Re: voiced/unvoiced detection (Jont Allen )

Subject: Re: voiced/unvoiced detection
From:    Jont Allen  <jba(at)RESEARCH.ATT.COM>
Date:    Wed, 11 Nov 1998 15:25:57 +0000

Alain de Cheveigne' wrote: > > For whispered speech, one should probably distinguish the issues of > transmitting segmental information ("phoneme" identity, etc.), and > intonation. To the extent that segmental information is carried by > spectral shape, This is clearly NOT the case. If it were, how would you ever hear-out one speaker from a second, male from female. For more on this see: author = {Allen, J. B.}, title = {How do humans process and recognize speech?}, journal = {IEEE Trans. on Speech and Audio Proc.}, volume = {2}, number = {4}, pages = {567-577}, month = oct, year = 1994 as well as Summerfield's speech AI work (Somebody have the exact reference please?) > it is coded equally well if the excitation is noise-like. The spectrum will not be the same for voiced and whispered speech unless the source point is exactly at the same point, and the source impedance is the same. I doubt that that either condition is true. In fact, I expect we dont really know much about this. Does anybody know of any measurements of the spectrum of whispered speech, re voiced speech? > A speech recognizer trained on voiced speech should work on whispered > speech. I stronly suspect that modern hidden Markov ( model (HMM) automatic speech recognition (ASR) software would !massively! fail with whispered speech as an input. Has anybody ever tried it? > In principle. In practice there are issues such as the different > spectral slopes of voiced and whispered excitation, and the fact that > speakers might not articulate the same when they whisper as when they use > voice. > > Intonation is another problem, as it is usually thought of as being coded > by F0 which is absent in whispered speech. I think it has been suggested > that F1 might be used in the place of F0 (how to reconcile this role with > that of coding segmental information is another mystery). Other parameters > are timing and intensity. Introspection tells me that whispered > articulation is more marked than voiced articulation, something akin to a > sort of "Lombard effect". It may be a mistake to equate "whispered speech" > with "voiced speech minus the F0". Based on the results of Quentin Summerfield (and colleagues), you can only separate two simultaneous speakers (get a good AI score) if their f0's differ. How do you reconcile this observation with whispered speech, where f0 is absent? > > Alain > > Email to AUDITORY should now be sent to AUDITORY(at) > LISTSERV commands should be sent to listserv(at) > Information is available on the WEB at Jont Allen -- Jont B. Allen (Technology Leader) AT&T Labs-Research, Shannon Laboratory 180 Park Ave., Room E161 Florham Park NJ 07932-0971 973/360-8545voice, x7111fax, To send a fax that I get by email: 973/360-8545 (Experimental) Email to AUDITORY should now be sent to AUDITORY(at) LISTSERV commands should be sent to listserv(at) Information is available on the WEB at

This message came from the mail archive
maintained by:
DAn Ellis <>
Electrical Engineering Dept., Columbia University