MFCC flaws

Hello all,

MFCC's have, what seems to me, a fundamental flaw, quite separate from anything to do with the Mel scale. The problem was pointed out to me by Alain de Cheveigne. The Cepstral Coefficients are COSINE coefficients which means that they cannot shift with speaker size to capture the shift in formant frequencies that occurs as children grow up and their vocal tracts get longer. This is a big effect which has been repeatedly observed. See Lee et al. (1999) or Voperian et al. (2007) for examples.

We recently presented a paper explaining why recognizers trained on the data of a man cannot possibly be expected to recognize the speech of a woman, let alone a child. The CCs for a given phoneme are different for men, women and children. This is one of the reasons that training sets have to be so large and training has to take so long.

The paper was presented at Acoustics08 in Paris. The paper also describes an alternative, set of auditory features that are largely scale invariant. The reference is

Monaghan, J. J. M., Feldbauer, C., Walters, T. C. and Patterson, R. D. (2008) “Low-dimensional, auditory feature vectors that improve vocal-tract-length normalization in automatic speech recognition,” Acoustics08, Paris, paper H000688.

I can provide a pdf of the paper on request, and I would be interested in your comments on the ideas in the paper.

Regards Roy P

* ** *** * ** *** * ** *** * ** *** * ** *** *
Roy D. Patterson
Centre for the Neural Basis of Hearing
Department of Physiology, Development and Neuroscience
University of Cambridge
Downing Street, Cambridge, CB2 3EG

phone: +44 (1223) 333819 office
fax:   +44 (1223) 333840 department
email	rdp1@xxxxxxxxx