Re: voiced/unvoiced detection ("Alain de Cheveigne'" )

Subject: Re: voiced/unvoiced detection
From:    "Alain de Cheveigne'"  <alain(at)LINGUIST.JUSSIEU.FR>
Date:    Fri, 13 Nov 1998 15:14:18 +0100

Jont Allen <jba(at)> wrote: >>To the extent that segmental information is carried by >> spectral shape, > >This is clearly NOT the case. If it were, how would you ever hear-out >one speaker from a second, male from female. There is confusion here between recognition and segregation. Speech recognition by humans may or may not involve F0 to extract segmental information (vowel identity, etc.). That question merits study. The point I tried to make was that comparing voiced and whispered speech may not be a good test, because they may differ in more respects than just periodicity. I think we actually agree on this point. Speech recognition by machines typically uses gross spectral shape (the first few cepstral or mel-cepstral coefficients, that describe gross shape but not periodicity details). Both static and dynamic (delta-cepstrum). I know of several attempts to incorporate F0 information, either to improve "segmental" feature extraction, or to exploit prosodic information, but as far as I know F0 is not yet used in "mainstream" ASR systems. Maybe someone knowledgeable could comment. IF voiced speech differed from whispered speech by only periodicity THEN an ASR system trained on one should work on the other. As the antecedent is doubtful, the consequent is anyone's guess. Surely someone out there with some ASR software can give us the answer. I think we agree on that point too. Speech _segregation_ by humans can take advantage of periodicity, as demonstrated by Summerfield and many others, including myself. That does not mean that periodicity is necessary for recognition. After all identification in noise benefits from binaural disparities, but that does not mean that we can't understand speech in quiet without them. In his thesis, Andrew Lea (1990) did experiments with mixtures of "whispered" and voiced vowels. He found that a whispered vowel was no less intelligible than a voiced vowel. That was true whether it was isolated, mixed with a whispered vowel, or mixed with a voiced vowel. The voicing state of the vowel being identified made no difference! On the other hand, both voiced and whispered vowels were less intelligible when mixed with a whispered vowel, than when mixed with a voiced vowel. Conclusion: segregation depends on the harmonic state of the _interference_. Summerfield and Culling (1992) found similar results, and so did my colleagues and I in an extensive series of experiments. [Note: Andy's "whispered" vowels were synthesized with the same vocal-tract envelopes as voiced vowels. Excitation was noise-like and had a -6 dB/octave roll-off (vs -12 dB/octave for voiced vowels). Later experiments were with more closely matched stimulus envelopes.] Attempts to use F0 information for speech segregation by machine date back to Parsons (1976) and Weintraub (1985), and many schemes have been proposed since. But I'm not aware of an example where F0 is exploited in a useful (say, commercial) system for the purpose of segregation or noise reduction. Not yet. Again, someone more knowledgeable might care to comment. >Based on the results of Quentin Summerfield (and colleagues), you can only >separate two simultaneous speakers (get a good AI score) if their f0's >differ. In double-vowel experiments, identification is typically way above chance even when F0s are the same. Scores certainly improve with F0 differences, but do not become perfect. For equal-amplitude vowels one gets typically a 10-25% increase of both-correct scores between 0 and 1 semitone, and litte improvement beyond that. For unequal amplitudes (15 to 25 dB difference), one can get more spectacular effects for the weaker vowel. Another way of describing the effects is to say that they correspond to a boost in SNR of about 15 dB of the weaker vowel (Culling, Summerfield and Marshall, 1994). In summary, F0 differences help but identification is still often possible when F0s are the same. I have a paper in review that examines in detail (vowel by vowel) identification at DF0=0, and how it improves when DF0!=0. However, for doubts about whether such results extend to "real" speech, see Chris Darwin's page ( >How do you reconcile this observation with whispered speech, where f0 is >absent? See work quoted above, & refs below. Alain --- \item[] Assmann, P. F., and Summerfield, Q. (1989). "Modeling the perception of concurrent vowels: Vowels with the same fundamental frequency," JASA 85, 327-338. \item[] Assmann, P. F., and Summerfield, Q. (1990). "Modeling the perception of concurrent vowels: Vowels with different fundamental frequencies," JASA 88, 680-697. \item[] Culling, J. F., and Darwin, C. J. (1993). "Perceptual separation of simultaneous vowels: Within and across-formant grouping by F0," JASA 93, 3454-3467. \item[] Culling, J. F., Summerfield, Q., and Marshall, D. H. (1994). "Effects of simulated reverberation on the use of binaural cues and fundamental frequency differences for separating concurrent vowels," Speech Comm. 14, 71-95. \item[] de Cheveigne, A. (1997). "Concurrent vowel identification III: A neural model of harmonic interference cancellation," J. Acoust. Soc. Am. 101, 2857-2865. \item[] de Cheveigne, A., Kawahara, H., Tsuzaki, M., and Aikawa, K. (1997). "Concurrent vowel identification I: Effects of relative level and F0 difference," J. Acoust. Soc. Am. 101, 2839-2847. \item[] de Cheveigne, A., McAdams, S., and Marin, C. (1997). "Concurrent vowel identification II: Effects of phase, harmonicity and task," J. Acoust.Soc. Am. 101, 2848-2856. \item[] de Cheveigne, A., McAdams, S., Laroche, J., and Rosenberg, M. (1995). "Identification of concurrent harmonic and inharmonic vowels: A test of the theory of harmonic cancellation and enhancement," J. Acoust. Soc. Am. 97, 3736-3748. \item[] de Cheveigne, A. (1993). "Separation of concurrent harmonic sounds: Fundamental frequency estimation and a time-domain cancellation model of auditory processing," J. Acoust. Soc. Am. 93, 3271-3290. \item[] Lea, A. (1992), "Auditory models of vowel perception," Nottingham unpublished doctoral dissertation. \item[] Parsons, T. W. (1976). "Separation of speech from interfering speech by means of harmonic selection," JASA 60, 911-918. \item[] Summerfield, Q. (1992). "Roles of harmonicity and coherent frequency modulation in auditory grouping," in "The auditory processing of speech: from sounds to words," Edited by M. E. H. Schouten, Berlin, Mouton de Gruyter, 157-166. \item[] Summerfield, Q., and Culling, J. F. (1992). "Periodicity of maskers not targets determines ease of perceptual segregation using differences in fundamental frequency.", Proc. 124th meeting of the ASA, 2317(A). \item[] Weintraub, M. (1985), "A theory and computational model of auditory monaural sound separation," Stanford unpublished doctoral dissertation. --- ------------------------------------------------------------------ Alain de Cheveigne' Laboratoire de Linguistique Formelle, CNRS / Universite' Paris 7, case 7003, 2 place Jussieu, 75251 Paris CEDEX 05, FRANCE. phone: +33 1 44273633, fax: +33 1 44277919 e-mail: alain(at) ------------------------------------------------------------------ Email to AUDITORY should now be sent to AUDITORY(at) LISTSERV commands should be sent to listserv(at) Information is available on the WEB at

This message came from the mail archive
maintained by:
DAn Ellis <>
Electrical Engineering Dept., Columbia University