[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: voiced/unvoiced detection



Jont Allen <jba@research.att.com> wrote:
>>To the extent that segmental information is carried by
>> spectral shape,
>
>This is clearly NOT the case. If it were, how would you ever hear-out
>one speaker from a second, male from female.

There is confusion here between recognition and segregation.

Speech recognition by humans may or may not involve F0 to extract segmental
information (vowel identity, etc.).  That question merits study.  The point
I tried to make was that comparing voiced and whispered speech may not be a
good test, because they may differ in more respects than just periodicity.
I think we actually agree on this point.

Speech recognition by machines typically uses gross spectral shape (the
first few cepstral or mel-cepstral coefficients, that describe gross shape
but not periodicity details).  Both static and dynamic (delta-cepstrum).  I
know of several attempts to incorporate F0 information, either to improve
"segmental" feature extraction, or to exploit prosodic information, but as
far as I know F0 is not yet used in "mainstream" ASR systems.  Maybe
someone knowledgeable could comment.

IF voiced speech differed from whispered speech by only periodicity THEN an
ASR system trained on one should work on the other.  As the antecedent is
doubtful, the consequent is anyone's guess. Surely someone out there with
some ASR software can give us the answer.  I think we agree on that point
too.

Speech _segregation_ by humans can take advantage of periodicity, as
demonstrated by Summerfield and many others, including myself.  That does
not mean that periodicity is necessary for recognition.   After all
identification in noise benefits from binaural disparities, but that does
not mean that we can't understand speech in quiet without them.

In his thesis, Andrew Lea (1990) did experiments with mixtures of
"whispered" and voiced vowels.  He found that a whispered vowel was no less
intelligible than a voiced vowel.  That was true whether it was isolated,
mixed with a whispered vowel, or mixed with a voiced vowel.  The voicing
state of the vowel being identified made no difference!  On the other hand,
both voiced and whispered vowels were less intelligible when mixed with a
whispered vowel, than when mixed with a voiced vowel.   Conclusion:
segregation depends on the harmonic state of the _interference_.
Summerfield and Culling (1992) found similar results, and so did my
colleagues and I in an extensive series of experiments.   [Note: Andy's
"whispered" vowels were synthesized with the same vocal-tract envelopes as
voiced vowels.  Excitation was noise-like and had a -6 dB/octave roll-off
(vs -12 dB/octave for voiced vowels).  Later experiments were with more
closely matched stimulus envelopes.]

Attempts to use F0 information for speech segregation by machine date back
to Parsons (1976) and Weintraub (1985), and many schemes have been proposed
since.   But I'm not aware of an example where F0 is exploited in a useful
(say, commercial) system for the purpose of segregation or noise reduction.
Not yet.  Again, someone more knowledgeable might care to comment.

>Based on the results of Quentin Summerfield (and colleagues), you can only
>separate two simultaneous speakers (get a good AI score) if their f0's
>differ.

In double-vowel experiments, identification is typically way above chance
even when F0s are the same.  Scores certainly improve with F0 differences,
but do not become perfect.  For equal-amplitude vowels one gets typically a
10-25% increase of both-correct scores between 0 and 1 semitone, and litte
improvement beyond that.  For unequal amplitudes (15 to 25 dB difference),
one can get more spectacular effects for the weaker vowel.  Another way of
describing the effects is to say that they correspond to a boost in SNR of
about 15 dB of the weaker vowel (Culling, Summerfield and Marshall, 1994).
In summary, F0 differences help but identification is still often possible
when F0s are the same.  I have a paper in review that examines in detail
(vowel by vowel) identification at DF0=0, and how it improves when DF0!=0.

However, for doubts about whether such results extend to "real" speech, see
Chris Darwin's page (http://www.biols.susx.ac.uk/Home/Chris_Darwin/).

>How do you reconcile this observation with whispered speech, where f0 is
>absent?

See work quoted above, & refs below.

Alain

---
   \item[]
Assmann, P. F., and Summerfield, Q. (1989). "Modeling the perception of
concurrent vowels: Vowels with the same fundamental frequency,"  JASA 85,
327-338.
   \item[]
Assmann, P. F., and Summerfield, Q. (1990). "Modeling the perception of
concurrent vowels: Vowels with different fundamental frequencies,"  JASA
88, 680-697.
   \item[]
Culling, J. F., and Darwin, C. J. (1993). "Perceptual separation of
simultaneous vowels: Within and across-formant grouping by F0,"  JASA 93,
3454-3467.
   \item[]
Culling, J. F., Summerfield, Q., and Marshall, D. H. (1994). "Effects of
simulated reverberation on the use of binaural cues and fundamental
frequency differences for separating concurrent vowels,"  Speech Comm. 14,
71-95.
   \item[]
de Cheveigne, A. (1997). "Concurrent vowel identification III: A neural
model of harmonic interference cancellation,"  J. Acoust. Soc. Am. 101,
2857-2865.
   \item[]
de Cheveigne, A., Kawahara, H., Tsuzaki, M., and Aikawa, K. (1997).
"Concurrent vowel identification I: Effects of relative level and F0
difference,"  J. Acoust. Soc. Am. 101,  2839-2847.
   \item[]
de Cheveigne, A., McAdams, S., and Marin, C. (1997). "Concurrent vowel
identification II: Effects of phase, harmonicity and task,"  J. Acoust.Soc.
Am. 101, 2848-2856.
   \item[]
de Cheveigne, A., McAdams, S., Laroche, J., and Rosenberg, M. (1995).
"Identification of concurrent harmonic and inharmonic vowels: A test of the
theory of harmonic cancellation and enhancement,"  J. Acoust. Soc. Am. 97,
3736-3748.
   \item[]
de Cheveigne, A. (1993). "Separation of concurrent harmonic sounds:
Fundamental frequency estimation and a time-domain cancellation model of
auditory processing,"  J. Acoust. Soc. Am. 93, 3271-3290.
   \item[]
Lea, A. (1992), "Auditory models of vowel perception," Nottingham
unpublished doctoral dissertation.
   \item[]
Parsons, T. W. (1976). "Separation of speech from interfering speech by
means of harmonic selection,"  JASA 60, 911-918.
   \item[]
Summerfield, Q. (1992). "Roles of harmonicity and coherent frequency
modulation in auditory grouping," in "The auditory processing of speech:
from sounds to words," Edited by M. E. H. Schouten, Berlin, Mouton de
Gruyter, 157-166.
   \item[]
Summerfield, Q., and Culling, J. F. (1992). "Periodicity of maskers not
targets determines ease of perceptual segregation using differences in
fundamental frequency.", Proc. 124th meeting of the ASA, 2317(A).
   \item[]
Weintraub, M. (1985), "A theory and computational model of auditory
monaural sound separation," Stanford unpublished doctoral dissertation.
---


------------------------------------------------------------------
Alain de Cheveigne'
Laboratoire de Linguistique Formelle, CNRS / Universite' Paris 7,
case 7003, 2 place Jussieu, 75251 Paris CEDEX 05, FRANCE.
phone:   +33 1 44273633, fax: +33 1 44277919
e-mail:  alain@linguist.jussieu.fr
http://www.linguist.jussieu.fr/~alain/
------------------------------------------------------------------

Email to AUDITORY should now be sent to AUDITORY@lists.mcgill.ca
LISTSERV commands should be sent to listserv@lists.mcgill.ca
Information is available on the WEB at http://www.mcgill.ca/cc/listserv