[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: mfcc filters gain
Just to add to the confusion: as well as changing the shape of the filters,
it's worth looking at their bandwidth and the non-linearity (traditionally a
log operation, as specified by the theory of homomorphic filtering).
The problem with triangular filters is that any small change in frequency
causes a large change in MFCC values just because the peak is too sharp.
Almost any shape with a flatter top will work better, provided you get the
The problem with logarithms is that they over-emphasise any very small
signals (which are most likely background noise). Traditionally the solution
is to put a lower floor on the values being logged, but a "nicer" solution
to my mind, is to use an Nth root operation instead. As N increases, the Nth
root gets closer and closer to a scaled and shifted log operation, while as
N decreases, the effects of low levels of noise become less and less. You
need to experiment with the value of N to suit the noise characteristics in
By improving the shape and width of the filters, and optimising N in the Nth
root operation, you can get somewhere between 20 and 40% reduction in word
error rate, so it's worth looking into. These figures are based on my
experiments with telephone speech from the UK SpeechDat database.
My own work in this area is largely unpublished, but there was at least one
paper in the "Aurora" sessions of Eurospeech a few years ago which looked at
these issues and came to similar conclusions. Unfortunately I too can't find
any specific references at the moment.
Dr S W Beet, Principal R & D Engineer,
Aculab plc, Lakeside, Bramley Road, Mount Farm,
Milton Keynes, Bucks., MK1 1PT, UK
Tel: (+44) 1908 273963 ; Fax: (+44) 1908 273801
----- Original Message -----
From: "Toth Laszlo" <tothl@INF.U-SZEGED.HU>
Sent: Wednesday, November 03, 2004 6:32 PM
Subject: Re: mfcc filters gain
You will find quite many different scales in the literature, and sometimes
even several different formulas for the same scale. I have tried a couple
of them, and never found a significant difference in the recognition
results. In my sceptic opinion, there are much bigger inaccuracies in
current speech recognition technology, so these little differences doesn't
really matter. Anyway, probably the most interesting idea in this field
was when several authors tried to directly optimize the filters in order
to achieve the best possible recognition....