Re: voiced/unvoiced detection (Jont Allen )


Subject: Re: voiced/unvoiced detection
From:    Jont Allen  <jba(at)RESEARCH.ATT.COM>
Date:    Wed, 11 Nov 1998 15:25:57 +0000

Alain de Cheveigne' wrote: > > For whispered speech, one should probably distinguish the issues of > transmitting segmental information ("phoneme" identity, etc.), and > intonation. To the extent that segmental information is carried by > spectral shape, This is clearly NOT the case. If it were, how would you ever hear-out one speaker from a second, male from female. For more on this see: author = {Allen, J. B.}, title = {How do humans process and recognize speech?}, journal = {IEEE Trans. on Speech and Audio Proc.}, volume = {2}, number = {4}, pages = {567-577}, month = oct, year = 1994 as well as Summerfield's speech AI work (Somebody have the exact reference please?) > it is coded equally well if the excitation is noise-like. The spectrum will not be the same for voiced and whispered speech unless the source point is exactly at the same point, and the source impedance is the same. I doubt that that either condition is true. In fact, I expect we dont really know much about this. Does anybody know of any measurements of the spectrum of whispered speech, re voiced speech? > A speech recognizer trained on voiced speech should work on whispered > speech. I stronly suspect that modern hidden Markov (http://www-history.mcs.st-and.ac.uk:/history//Mathematicians/Markov.html) model (HMM) automatic speech recognition (ASR) software would !massively! fail with whispered speech as an input. Has anybody ever tried it? > In principle. In practice there are issues such as the different > spectral slopes of voiced and whispered excitation, and the fact that > speakers might not articulate the same when they whisper as when they use > voice. > > Intonation is another problem, as it is usually thought of as being coded > by F0 which is absent in whispered speech. I think it has been suggested > that F1 might be used in the place of F0 (how to reconcile this role with > that of coding segmental information is another mystery). Other parameters > are timing and intensity. Introspection tells me that whispered > articulation is more marked than voiced articulation, something akin to a > sort of "Lombard effect". It may be a mistake to equate "whispered speech" > with "voiced speech minus the F0". Based on the results of Quentin Summerfield (and colleagues), you can only separate two simultaneous speakers (get a good AI score) if their f0's differ. How do you reconcile this observation with whispered speech, where f0 is absent? > > Alain > > Email to AUDITORY should now be sent to AUDITORY(at)lists.mcgill.ca > LISTSERV commands should be sent to listserv(at)lists.mcgill.ca > Information is available on the WEB at http://www.mcgill.ca/cc/listserv Jont Allen -- Jont B. Allen (Technology Leader) AT&T Labs-Research, Shannon Laboratory 180 Park Ave., Room E161 Florham Park NJ 07932-0971 973/360-8545voice, x7111fax, http://www.research.att.com/info/jba To send a fax that I get by email: 973/360-8545 (Experimental) Email to AUDITORY should now be sent to AUDITORY(at)lists.mcgill.ca LISTSERV commands should be sent to listserv(at)lists.mcgill.ca Information is available on the WEB at http://www.mcgill.ca/cc/listserv


This message came from the mail archive
http://www.auditory.org/postings/1998/
maintained by:
DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University