[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: voiced/unvoiced detection



Dear Laszlo,

>In this case I'd guess I should measure how strongly the
>peak "dominates" the summary autocorrelogram. What would give a measure
>of this? E.g. a narrower peak means more definite pitch sensation than a
>wide, diffuse one? Or it is the height of the peak compared to its
>neighborhood that counts? If so, how wide "neighborhood" should I check?

Although most of the publications about pitch estimation using the
correlogram concentrate on the summary autocorrelation (or pooled all-order
interval histograms [Hi, Peter]), it might be worth considering the
correlogram "image" itself before summing across the frequency dimension.

If you consider each frequency band (cochlear channel, however you want to
call it) separately, you'll find that the width of the peak corresponding
to the pitch period depends on the frequencies (and total number) of the
partials falling in that channel; lower frequencies (and fewer partials)
lead to wider peaks.  I subscribe to the interpretation that it is the
alignment of these peaks across multiple channels that generates a pitch
sensation rather than the "sharpness" of the peaks, either in individual
channels or in the summary. This alignment is, of course, reflected in the
summary autocorrelation, but summing across channels is only one of many
ways of detecting it (this fact is pointed out in some of the papers from
around 1990). And the width of the peak in the summary autocorrelation
depends more on the strength of the various partials in a harmonic signal
than it does on the "pitchiness" of the sound. So the degree of
"pitchiness" might be related to the degree of across-channel structure in
the image.

The main thrust of my argument should apply equally well to other methods
of periodicity-detection, such as the Auditory Image Model (and perhaps
some versions of the cancellation model), that employ band-pass frequency
analysis prior to periodicity-detection. The peaks in AIM are narrower
across-the-board because of the triggering mechanism, but if you look for
structure across bands, some of the apparent differences between AIM and
the correlogram vanish.

I don't claim that this line of thinking will work for all of the variously
constructed "pitched sounds", but it works extremely well for real-world
quasi-periodic sounds, like voiced speech, or in my case musical instruments.

There are some people (Hi again, Peter!) who will argue vehemently that you
don't need to preserve the information from individual channels -- that the
summary is all you need. As you can tell, I disagree, but the question is
certainly still open.

Cheers,

--Keith

-----
Keith D. Martin
MIT Media Lab Machine Listening Group
kdm@media.mit.edu
http://sound.media.mit.edu/~kdm
"Busy, busy, busy, is what we Bokononists whisper whenever we
think of how complicated and unpredictable the machinery of life really is."

Email to AUDITORY should now be sent to AUDITORY@lists.mcgill.ca
LISTSERV commands should be sent to listserv@lists.mcgill.ca
Information is available on the WEB at http://www.mcgill.ca/cc/listserv