[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Robust method of fundamental frequency estimation.



Arturo,

That may be true, but there are other good time-domain
 > correlation-based pitch models that can NOT be expressed in terms of the
spectrum.
> For example, the Meddis & Hewitt or Meddis & O'Mard models, or
 Slaney & Lyon models,
 derived from Licklider's duplex theory, which do the ACF after what the
 cochlea model does, which is a separation into filter channels and a
 half-wave rectification.

I do not agree. If you know the frequency response of the cochlea, you can predict the spectrum of its output from the spectrum of its input. The effects of half-wave-rectification and compression are more difficult to analyze, but not impossible. I remember reading a little bit about it in Anssi Klapuri?s PhD thesis.

Did you consider any such models?

I have used these models in the past, but I stopped using them. If I am not wrong, what Slaney & Lyon?s model does is to apply a summary autocorrelation to the output of a gammatone filterbank (it does some extra steps, but the main idea is that one). Since this can be shown to be equivalent to applying autocorrelation to the original signal (use Wiener?Khinchin theorem and linearity property of Fourier Transform), I have not used it anymore.

For those unfamiliar, the Wiener Khinchine theorem relates the power spectrum of a signal to its autocorrelation function. One is the Fourier transform of the other. The theorem is often invoked to claim 'equivalence' between spectral and temporal (or at least autocorrelation-based) approaches.


Note that the theorem gives a relation between waveforms defined over infinite time, and spectra defined over infinite frequency. The 'equivalence' does not apply strictly to short-term transforms applicable to pratical signals (band-limited and considered over a short time interval), although the theorem is useful to give insight to asymptotic behavior.

I believe the theorem can be extended to apply rigourously to 'short-term ACF' and 'short-term power spectra' (Jont Allen might have more to say). However the 'running ACF' used in the Licklider/Lyon/Slaney/Meddis&Hewitt model differs from the 'short-term ACF', so again the equivalence doesn't apply strictly.

Note also, when going from waveform to ACF via the power spectrum, that the initial and final Fourier transforms (both linear) are separated by the power calculation (non-linear). Swapping ACF and filterbank thus does not follow from linearity. It might nevertheless be allowed by orthogonality between basis functions of the Fourier transform. However the property is again lost if a rectifier or hair-cell model is inserted at each filter output.

There are at least two reasons why filtering into bands before the ACF might be useful.

One I think you mention in another email: by weighting channels inversely to their amplitude one can counter the expansive property of the power (squared magnitude), that otherwise causes the ACF to be dominated by high-amplitude parts of the spectrum (e.g. formants of speech). The 'cepstrum' is another way to achieve a similar goal.

Another reason for using a filterbank is that, again by appropriate weighting of channels, you can discount spectral regions dominated by noise (or by another voice). Wu, Wang and Brown have recently made use of this property in the context of multiple-speaker F0 estimation, but I think the real pioneer was Dick Lyon in an ICASSP paper in 1983. He applied the idea to binaural localization and separation, and Mitch Weintraub (his student?) applied it a little later to speech separation on the basis of F0 cues.

Best,
Alain