[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: FW: FW: mfcc filters gain
Thank you for that posting. It seems related to a recent discussion I had
with a colleague regarding the need to eliminate the natural spectral tilt
of human speech before taking the DCT. By Dr. Skowronski's reasoning, it
appears clear that spectral tilt compensation is not required before taking
Best regards, Scott.
On Wed, 3 Nov 2004 19:30:33 -0500
Rahul Shrivastav <rahul@CSD.UFL.EDU> wrote:
> Here are some comments from my colleague, Dr. Skowronski, whose paper you
> cited in your posting. I hope you find these useful.
> -----Original Message-----
> From: Mark Skowronski [mailto:firstname.lastname@example.org]
> Sent: Wednesday, November 03, 2004 6:31 PM
> To: Rahul Shrivastav
> Subject: Re: FW: mfcc filters gain
> Feel free to post this reply to your listserv.
> Regarding scaling of triangular filters in MFCC or HFCC, in short, it
> doesn't matter.
> In MFCC and HFCC, the FFT magnitude squared (power spectrum) is scaled by
> a triangular filter, and the sum of those squared terms is called the
> filter output energy E(i) for i=1,...,N filters. Now scale those output
> energies by whatever scale factor you like (equal area triangles, equal
> height, unity amplitude) and denote the scaled energies as E(i)*A(i) where
> A(i) is the peak amplitude of the scaled triangular filter i.
> In MFCC and HFCC, E(i)*A(i) is log transformed to log(E(i)) + log(A(i))
> before the DCT. Since the DCT is linear, log(E(i)), i=1,...,N transforms
> to cE(j), j=1,...,M (M-point DCT) and log(A(i)) transforms to cA(j).
> So log(E(i)) + log(A(i)) --> cE(j) + cA(j) via the DCT.
> For two different frames of speech under analysis, MFCC and HFCC will
> produce two different cE(j) but the same term cA(j).
> In computing an Lp distortion measure between the two different frames of
> speech (Euclidean, p=2) as in the DTW, the cA(j) terms would cancel by
> subtraction. In probability models of the cepstral feature distributions
> (Hidden Markov model or Gaussian mixture model), the means of pdfs of all
> classes would translate by the same amount cA(j), j=1,...,M dimension
> feature space.
> If you have A(i) changing in time (adapting to the input), that's another