Re: More cepstrum flaws (Steve Beet )


Subject: Re: More cepstrum flaws
From:    Steve Beet  <steve.beet@xxxxxxxx>
Date:    Mon, 12 Jan 2009 19:33:03 -0000
List-Archive:<http://lists.mcgill.ca/scripts/wa.exe?LIST=AUDITORY>

I've been watching this thread with interest, but haven't contributed (yet) because I can't provide citations for most of the work I've done in this area - most of it has been unpublished, and some of it is "proprietary" to my employer. However I can say that I share the dislike of MFCCs. Especially in situations where the "back end" has the potential to provide enhancements such as speaker and channel adaptation, background noise suppression, source separation, etc. (all of which are much more difficult after the MFCC analysis has clouded the respective issues). As with most of the other contributors to this thread, I found that the shape of the frequency scale didn't make a huge difference to an MFCC-based recogniser. As far as I recall, a pure "log" scale worked slightly worse than the mel scale, but there wasn't much in it. The ability to perform (approximate) vocal tract length normalisation via a true log scale might push the balance the other way, but I didn't investigate it. Maybe Roy's paper might shed some light here? A linear scale was worst of all. More important were the shapes and widths of the MFCC frequency-windows, along with the amplitude scales (both the choice of absolute amplitude or power before the windowing, and the choice of log with a hard floor, cube (or other) root, or whatever, after). By making a number of "optimisations" of this sort, and improving the orthogonality between the static and the dynamic coefficient vectors, large-vocabulary connected-word error rates were dramatically reduced (c. 60% reduction if I remember correctly). It's of note here, Arturo, that the cube (or other) root and the "hard floor" are both ways of working around the problem of spectral zeroes. Neither is perfect, nor mathematically justifiable, but the "Nth root" approach does have the added appeal that it can be thought of as a very approximate replication of the human peripheral auditory system. In all my experiments though, I relied on a fairly large number of Gaussian mixtures to handle speaker variability - because I felt (as Roy has already said) that MFCCs are not well-suited to vocal tract normalisation. At the time it seemed that the best way to handle inter-speaker variability would be to selectively modify the weights of specific mixtures of each state to match the expected characteristics of the speaker. One advantage of such a scheme is that it can allow for even non-linear variations (such as the effects of vocal tract wall resonance at low frequencies, and the onset of higher propagation modes within the vocal tract at higher frequencies). Another advantage is that it allows us to stick with a front end which is already pretty well understood (MFCCs). Having said all that, I never got as far as implementing any such scheme, so these ideas are (as far as I know) completely unproven! Other approaches have been published widely. Steve ___________________________________________________________ Dr Steve Beet, Principal R&D Engineer, Aculab plc, Lakeside, Bramley Road, Milton Keynes, Bucks., MK1 1PT, UK Tel: (+44/0) 1908 273963 Fax: (+44/0) 1908 273801 WWW: http://www.aculab.com/ ___________________________________________________________


This message came from the mail archive
http://www.auditory.org/postings/2009/
maintained by:
DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University