NTT Basic Res. Labs., 3-9-11, Midori-cho, Musashino, Tokyo, 180 Japan
The performances of several auditory front ends were evaluated in a phoneme recognition task using a VQ-HMM or an LVQ2 back end, and in a word recognition task using a DTW back end. The auditory front ends used in the experiments were different combinations of a fixed-Q cochlear filter (FQF), an adaptive-Q circuit (AQC) [T. Hirahara et al., Proc. ICASSP, 496--499 (1989)], an inner-hair-cell model (IHC) and a lateral inhibition (LINH). Traditional DFT-based and LPC-based front ends were examined for comparison. In the VQ-HMM/LVQ2 phoneme recognition task, the AQC and the LINH improved the performance in both noise-free and noisy conditions. However, the best performance provided by an auditory front end was the same as that of the DFT front end. Unexpectedly, the IHC degraded performance in most cases. Hence, it was concluded that the auditory front ends provide little benefit. In the DTW word recognition task, the AQC improved the robustness for noise and speaker variation. The LINH greatly improved robustness for noise but degraded that for speaker variation. The best performance for noisy speech was obtained by the combination of FQF, AQC, and LINH, which outperformed traditional front ends thoroughly. Similarly, the best performance for multiple speakers was by the FQF in conjunction with the AQC. This combination completely outperformed other front ends. Hence, it was concluded that the auditory front end pays off. These results indicate that the difference in back ends and the task difficulty strongly affect the evaluation of the front ends.