ATR Human Information Processing Res. Labs., 2-2 Hikari-dai, Seika-cho, Soraku-gun, Kyoto, 619-02 Japan
Speaker-independent speech recognition experiments using an auditory model front end with a spectro-temporal masking model demonstrated the improvement of the recognition performance and outperformed the auditory front ends without the masking model and the traditional LPC-based front ends. The auditory model front end composed of an adaptive Q cochlear filter bank incorporating spectro-temporal masking has been proposed [J. Acoust. Soc. Am. 92, 2476 (A) (1992)]. The spectro-temporal masking model can enhance common phonetic features by eliminating the speaker-dependent spectral tilt that reflects individual source variation. It can also enhance the spectral dynamics that convey phonological information in speech signals. These advantages result in an effective new spectral parameter for representing speech models for speaker-independent speech recognition. Speaker-independent word and phoneme recognition experiments were carried out for Japanese word and phrase databases. The masked spectrum was calculated by subtracting the masking level from logarithmic power spectra extracted using a 64-channel adaptive Q cochlear filter bank. The masking levels were calculated as the weighted sum of the smoothed preceding spectra. To cover the variability of the time sequences of the spectrum, multi-template DTW and hidden Markov model were used as the backend recognition mechanism. [sup a)]Also at ATR Auditory and Visual Perception Res. Labs.