[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: mfcc filters gain

Just to add to the confusion: as well as changing the shape of the filters,

it's worth looking at their bandwidth and the non-linearity (traditionally a

log operation, as specified by the theory of homomorphic filtering).

The problem with triangular filters is that any small change in frequency

causes a large change in MFCC values just because the peak is too sharp.

Almost any shape with a flatter top will work better, provided you get the

width right.

The problem with logarithms is that they over-emphasise any very small

signals (which are most likely background noise). Traditionally the solution

is to put a lower floor on the values being logged, but a "nicer" solution

to my mind, is to use an Nth root operation instead. As N increases, the Nth

root gets closer and closer to a scaled and shifted log operation, while as

N decreases, the effects of low levels of noise become less and less. You

need to experiment with the value of N to suit the noise characteristics in

your data.

By improving the shape and width of the filters, and optimising N in the Nth

root operation, you can get somewhere between 20 and 40% reduction in word

error rate, so it's worth looking into. These figures are based on my

experiments with telephone speech from the UK SpeechDat database.

My own work in this area is largely unpublished, but there was at least one

paper in the "Aurora" sessions of Eurospeech a few years ago which looked at

these issues and came to similar conclusions. Unfortunately I too can't find

any specific references at the moment.


Steve Beet


Dr S W Beet, Principal R & D Engineer,

Aculab plc, Lakeside, Bramley Road, Mount Farm,

Milton Keynes, Bucks., MK1 1PT, UK

Tel: (+44) 1908 273963 ; Fax: (+44) 1908 273801


----- Original Message -----

From: "Toth Laszlo" <tothl@INF.U-SZEGED.HU>


Sent: Wednesday, November 03, 2004 6:32 PM

Subject: Re: mfcc filters gain

You will find quite many different scales in the literature, and sometimes

even several different formulas for the same scale. I have tried a couple

of them, and never found a significant difference in the recognition

results. In my sceptic opinion, there are much bigger inaccuracies in

current speech recognition technology, so these little differences doesn't

really matter. Anyway, probably the most interesting idea in this field

was when several authors tried to directly optimize the filters in order

to achieve the best possible recognition....