[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


>In speech processing the voiced/unvoiced decision is usually considered
>more difficult than the measurement of pitch itself.
>How would you measure how strong the sensation of "pitchedness"? Does
>this make sense at all, or it is a binary decision, that is, we either
>hear or don't hear a pitch?
>Especially, I'm looking for ideas about how to make the voiced/unvoiced
>detection of speech using auditory-like processing, eg. the summary
>autocorrelogram. In this case I'd guess I should measure how strongly the
>peak "dominates" the summary autocorrelogram. What would give a measure
>of this? E.g. a narrower peak means more definite pitch sensation than a
>wide, diffuse one? Or it is the height of the peak compared to its
>neighborhood that counts? If so, how wide "neighborhood" should I check?

Good question.

I've seen various suggestions to the effect that pitch might be weaker if
there are multiple (ambiguous) peaks, or if the period peak is wide.

More quantititatively, Kaernbach and Demany (1998) suggested using the
ratio between peak and background within a 6ms portion of the
autocorrelation of the waveform as a measure of pitch strength (the period
was 10 ms).  This measure is not entirely without problems.

Yost (1996) showed that the ratio of period-peak to zero-lag peak of the
autocorrelation of the waveform is a good predictor of pitch strength of
IRN (iterated repetition noise) stimuli.  Wiegrebe, Patterson, Demany and
Carlyon (1998) have recently refined this result.  They showed that the
autocorrelation calculation must be modified for this measure to be valid
for a wider class of stimuli.

The same measure (peak ratios) can be derived from summary
autocorrelograms, but I'm not sure if it predicts pitch strength so well.

The ratio of period-peak to zero-lag peak of the autocorrelation function
is directly related to the depth of "period valleys" relative to the rest
of the cancellation pattern in my own cancellation model (de Cheveigne,
1998).  For perfectly periodic stimuli the ACF peak ratio is 1, while the
cancellation pattern valley ratio is 0.

The cancellation pitch model is related to the AMDF method of speech F0
estimation.  The depth of the AMDF valley has been used as "periodicity
measure" related to voicing.  A similar measure can be derived from peaks
of the autocorrelation function of the speech waveform.

A difficulty in applying perceptual models to speech processing is of
course that pitch and F0 are not quite the same thing.  Also, voicing is
not quite synonymous with periodicity (glottal pulses are sometimes
irregular, and sometimes even occur in isolation).  I wouldn't claim that
an AMDF-derived periodicity measure solves the problem of voicing detection.


de Cheveigne, A. (1998). "Cancellation model of pitch perception,"  J.
Acoust. Soc. Am. 103, 1261-1271.

Kaernbach, C., and Demany, L. (1998). "Psychophysical evidence against the
autocorrelation theory of pitch perception,"  JASA 104, 2298-2306.

Wiegrebe, L., Patterson, R. D., Demany, L., and Carlyon, R. P. (1998).
"Temporal dynamics of pitch strength in regular interval noises,"  JASA
104, 2307-2313.

Yost, W. A. (1996). "Pitch strength of iterated rippled noise,"  JASA 100,

Alain de Cheveigne'
Laboratoire de Linguistique Formelle, CNRS / Universite' Paris 7,
case 7003, 2 place Jussieu, 75251 Paris CEDEX 05, FRANCE.
phone:   +33 1 44273633, fax: +33 1 44277919
e-mail:  alain@linguist.jussieu.fr

Email to AUDITORY should now be sent to AUDITORY@lists.mcgill.ca
LISTSERV commands should be sent to listserv@lists.mcgill.ca
Information is available on the WEB at http://www.mcgill.ca/cc/listserv