[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Robust method of fundamental frequency estimation.



Dick,

Below are my answers to your questions.

> Arturo, thanks for the pointer to your poster.  I
> had to go back to this message to decode the different algorithnms you
> compared.  Now, I have some questions.
>
> You say "Although some of these algorithms were
> initially proposed using a time-domain approach, all of them can also be
> formulated using the spectrum of the signal, and that is the approach we
> took."  That may be true, but there are other good time-domain
> correlation-based pitch models that can NOT be expressed in terms of the
> spectrum.
> For example, the Meddis & Hewitt or Meddis & O'Mard models, or
> Slaney & Lyon models,
> derived from Licklider's duplex theory, which do the ACF after what the
> cochlea model does, which is a separation into filter channels and a
> half-wave rectification.

I do not agree. If you know the frequency response of the cochlea, you can
predict the spectrum of its output from the spectrum of its input. The
effects of half-wave-rectification and compression are more difficult to
analyze, but not impossible. I remember reading a little bit about it in
Anssi Klapuri?s PhD thesis.

> Did you consider any such models?

I have used these models in the past, but I stopped using them. If I am
not wrong, what Slaney & Lyon?s model does is to apply a summary
autocorrelation to the output of a gammatone filterbank (it does some
extra steps, but the main idea is that one). Since this can be shown to be
equivalent to applying autocorrelation to the original signal (use
Wiener?Khinchin theorem and linearity property of Fourier Transform), I
have not used it anymore.

About Meddis, Hewitt and O?Mards models, applying half-wave rectification
to the output of the gammatone filterbank is a good idea because it adds
useful harmonics to the signals (see Klapuri?s thesis). Applying
compression is also a good idea because it reduces the squaring effects of
autocorrelation. However, I did not include Meddis, Hewitt and O?Mards
models in my study because they are level dependant. My experience with
them is that the firing rate patterns vary a lot if the level of the
signal is changed, and this produced changes in pitch with level. Since in
applications (at least the ones I work with) we do not know the level at
which the signal was recorded or the level at which it will be reproduced,
I prefer not to use level dependant models. However, I recognize the
utility of half-wave-rectification and compression, and I am actually
working in a model/algorithm that makes use of them to estimate pitch.

> Have others
> reported their results on that speech database?  I think that's really the
> competition if you have a new pitch model, especially if you want more
> generality beyond speech and music.
>
> Your poster says that the spectra were estimated
> using FFT, and the next sentence says using a gammatone filterbank.  Which
> is it?  Or both? Oh, I see, one says the algorithm and the other
> the model.  Why would you choose an algorithm that doesn't match the model?
> Why treat these as
> conceptually different things?  An algorithm is a computational model, is
> it not?
>

The reason why I made the difference between algorithm and model is
because one goal of the algorithm was to make it computationally
efficient, but not the model. When we created the algorithm our goal was
not to match the pitch perception off weird complex tones or noises, but
to estimate the pitch of more natural sounds like speech. It may seem I am
contradicting myself because what we used in the poster to show other
algorithms pitfalls were complex tones or noises, but these tones and
noises were inspired by speech signals. For example, I was working with
simulated telephone speech when I discovered that Harmonic Product
Spectrum (HPS) produced more errors for male voice than for female voice.
Analyzing the errors I discovered that HPS does not work well when the
fundamental is missing, which is obvious from the definition of the
algorithm. Since male speech fundamental is most of the time below the 300
Hz limit of telephone speech, it is obvious that HPS is prone to fail more
for male than for female telephone speech. However, in the poster we
showed examples with complex tones instead of speech because they are
easier to describe and they are equally good as speech to show the
problem. Anyway, the poster also shows that our algorithm performed well
on Paul Baghsaw?s speech database for pitch estimation, which was our
goal.

Arturo

-- 
__________________________________________________

Arturo Camacho
PhD Candidate
Computer and Information Science and Engineering
University of Florida

E-mail: acamacho@xxxxxxxxxxxx
Web page: www.cise.ufl.edu/~acamacho
__________________________________________________