Re: voiced/unvoiced detection

Subject: Re: voiced/unvoiced detection
From:    "Keith D. Martin"  <kdm(at)>
Date:    Thu, 5 Nov 1998 10:11:30 -0500

Dear Laszlo, >In this case I'd guess I should measure how strongly the >peak "dominates" the summary autocorrelogram. What would give a measure >of this? E.g. a narrower peak means more definite pitch sensation than a >wide, diffuse one? Or it is the height of the peak compared to its >neighborhood that counts? If so, how wide "neighborhood" should I check? Although most of the publications about pitch estimation using the correlogram concentrate on the summary autocorrelation (or pooled all-order interval histograms [Hi, Peter]), it might be worth considering the correlogram "image" itself before summing across the frequency dimension. If you consider each frequency band (cochlear channel, however you want to call it) separately, you'll find that the width of the peak corresponding to the pitch period depends on the frequencies (and total number) of the partials falling in that channel; lower frequencies (and fewer partials) lead to wider peaks. I subscribe to the interpretation that it is the alignment of these peaks across multiple channels that generates a pitch sensation rather than the "sharpness" of the peaks, either in individual channels or in the summary. This alignment is, of course, reflected in the summary autocorrelation, but summing across channels is only one of many ways of detecting it (this fact is pointed out in some of the papers from around 1990). And the width of the peak in the summary autocorrelation depends more on the strength of the various partials in a harmonic signal than it does on the "pitchiness" of the sound. So the degree of "pitchiness" might be related to the degree of across-channel structure in the image. The main thrust of my argument should apply equally well to other methods of periodicity-detection, such as the Auditory Image Model (and perhaps some versions of the cancellation model), that employ band-pass frequency analysis prior to periodicity-detection. The peaks in AIM are narrower across-the-board because of the triggering mechanism, but if you look for structure across bands, some of the apparent differences between AIM and the correlogram vanish. I don't claim that this line of thinking will work for all of the variously constructed "pitched sounds", but it works extremely well for real-world quasi-periodic sounds, like voiced speech, or in my case musical instruments. There are some people (Hi again, Peter!) who will argue vehemently that you don't need to preserve the information from individual channels -- that the summary is all you need. As you can tell, I disagree, but the question is certainly still open. Cheers, --Keith ----- Keith D. Martin MIT Media Lab Machine Listening Group kdm(at) "Busy, busy, busy, is what we Bokononists whisper whenever we think of how complicated and unpredictable the machinery of life really is." Email to AUDITORY should now be sent to AUDITORY(at) LISTSERV commands should be sent to listserv(at) Information is available on the WEB at

