[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

FW: FW: mfcc filters gain


Here are some comments from my colleague, Dr. Skowronski, whose paper you
cited in your posting. I hope you find these useful.


-----Original Message-----
From: Mark Skowronski [mailto:markskow@cnel.ufl.edu]
Sent: Wednesday, November 03, 2004 6:31 PM
To: Rahul Shrivastav
Subject: Re: FW: mfcc filters gain


Feel free to post this reply to your listserv.

Regarding scaling of triangular filters in MFCC or HFCC, in short, it
doesn't matter.

In MFCC and HFCC, the FFT magnitude squared (power spectrum) is scaled by
a triangular filter, and the sum of those squared terms is called the
filter output energy E(i) for i=1,...,N filters.  Now scale those output
energies by whatever scale factor you like (equal area triangles, equal
height, unity amplitude) and denote the scaled energies as E(i)*A(i) where
A(i) is the peak amplitude of the scaled triangular filter i.

In MFCC and HFCC, E(i)*A(i) is log transformed to log(E(i)) + log(A(i))
before the DCT.  Since the DCT is linear, log(E(i)), i=1,...,N transforms
to cE(j), j=1,...,M (M-point DCT) and log(A(i)) transforms to cA(j).

So log(E(i)) + log(A(i)) --> cE(j) + cA(j) via the DCT.

For two different frames of speech under analysis, MFCC and HFCC will
produce two different cE(j) but the same term cA(j).

In computing an Lp distortion measure between the two different frames of
speech (Euclidean, p=2) as in the DTW, the cA(j) terms would cancel by
subtraction.  In probability models of the cepstral feature distributions
(Hidden Markov model or Gaussian mixture model), the means of pdfs of all
classes would translate by the same amount cA(j), j=1,...,M dimension
feature space.

If you have A(i) changing in time (adapting to the input), that's another


> -----Original Message-----
> From: AUDITORY Research in Auditory Perception
> [mailto:AUDITORY@LISTS.MCGILL.CA] On Behalf Of Guillaume Lemaitre
> Sent: Wednesday, November 03, 2004 11:33 AM
> Subject: mfcc filters gain
> Dear list,
> In the Malcom Slaney's Matlab implementation of mel frequency cepstral
> coefficients, triangular filters are normalized "so that each filter has
> unit
> weight". Parsing some papers dealing with mfcc, I noticed that most of
> authors does not mention this normalization step (a few of them do, but
> without explanation).
> I am wondering what does this normalization correspond to. If I am
> correct, and if triangular filters were supposed to approximate critical
> band filtering, they all should have the same unit height, just as third
> octave, or Patterson's gammatone filterbank. Am I wrong ?
> I am also wondering if some work has already be done to improve
> mfcc-like processing. As it is suggested in [1], Moore's ERB scale or
> Bark scale seems to be more appropriated than the mel scale, and
> gammatone filterbank should be much more accurate (even if probably more
> computationaly expensive) than a triangular filterbank ?
> Regards
> Guillaume
> [1] M. D. Skoweonski and J. G. Harris
> "Improving the filterbank of a classic speech feature extraction
> IEEE Int. Symp. on Circuits and Systems, Bangkok, Thailand, 2003
> -------------------------------------------------------------------
> Guillaume Lemaitre, Ph.D.
> Post-doctoral fellow
> Project-team REVES (REndering and Virtual Environments with Sounds)
> INRIA Sophia-Antipolis                 tel: (+33) (0)4 92 38 50 83
> 2004 route des Lucioles               fax: (+33) (0)4 92 38 50 30
> BP 93, F-06902 Sophia-Antipolis, France
> Guillaume.Lemaitre@sophia.inria.fr,
> ------------------------------------------