[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: More cepstrum flaws

I've been watching this thread with interest, but haven't contributed (yet)
because I can't provide citations for most of the work I've done in this
area - most of it has been unpublished, and some of it is "proprietary" to
my employer. However I can say that I share the dislike of MFCCs. Especially
in situations where the "back end" has the potential to provide enhancements
such as speaker and channel adaptation, background noise suppression, source
separation, etc. (all of which are much more difficult after the MFCC
analysis has clouded the respective issues).

As with most of the other contributors to this thread, I found that the
shape of the frequency scale didn't make a huge difference to an MFCC-based
recogniser. As far as I recall, a pure "log" scale worked slightly worse
than the mel scale, but there wasn't much in it. The ability to perform
(approximate) vocal tract length normalisation via a true log scale might
push the balance the other way, but I didn't investigate it. Maybe Roy's
paper might shed some light here? A linear scale was worst of all.

More important were the shapes and widths of the MFCC frequency-windows,
along with the amplitude scales (both the choice of absolute amplitude or
power before the windowing, and the choice of log with a hard floor, cube
(or other) root, or whatever, after). By making a number of "optimisations"
of this sort, and improving the orthogonality between the static and the
dynamic coefficient vectors, large-vocabulary connected-word error rates
were dramatically reduced (c. 60% reduction if I remember correctly).

It's of note here, Arturo, that the cube (or other) root and the "hard
floor" are both ways of working around the problem of spectral zeroes.
Neither is perfect, nor mathematically justifiable, but the "Nth root"
approach does have the added appeal that it can be thought of as a very
approximate replication of the human peripheral auditory system.

In all my experiments though, I relied on a fairly large number of Gaussian
mixtures to handle speaker variability - because I felt (as Roy has already
said) that MFCCs are not well-suited to vocal tract normalisation. At the
time it seemed that the best way to handle inter-speaker variability would
be to selectively modify the weights of specific mixtures of each state to
match the expected characteristics of the speaker.

One advantage of such a scheme is that it can allow for even non-linear
variations (such as the effects of vocal tract wall resonance at low
frequencies, and the onset of higher propagation modes within the vocal
tract at higher frequencies). Another advantage is that it allows us to
stick with a front end which is already pretty well understood (MFCCs).
Having said all that, I never got as far as implementing any such scheme, so
these ideas are (as far as I know) completely unproven! Other approaches
have been published widely.



Dr Steve Beet, Principal R&D Engineer, Aculab plc,
Lakeside, Bramley Road, Milton Keynes, Bucks., MK1 1PT, UK

Tel:    (+44/0) 1908 273963
Fax:    (+44/0) 1908 273801
WWW:    http://www.aculab.com/