5pSP4 Robust speech recognition in a multimedia teleconferencing

ASA 124th Meeting New Orleans 1992 October

5pSP4. Robust speech recognition in a multimedia teleconferencing environment.

Chi Wei Che

Mazin Rahim

James Flanagan

CAIP Center, Rutgers Univ., Piscataway, NJ 08855-1390

In many speech recognition systems, inconsistency between the training and testing conditions (i.e., effects related to noise, reverberation, microphone type, and characteristics, etc.) typically results in an unacceptable degradation in the recognition accuracy. For example, an experiment conducted in our multimedia laboratory demonstrated that the word accuracy of the Sphinx recognition system degrades from 96% to 71% when the close-talking Senheiser microphone (CLS) at 5 in., used in training, is replaced by a hands-free wideband line array (ARR) at 10 ft. This paper describes a neural network architecture for improving the robustness of speech recognizers in a multimedia teleconferencing environment. A multi-layer perception (MLP) is trained to map cepstral parameters from the ARR to the CLS. An experiment conducted on three male speakers using the ARR shows that a MLP with a hidden layer of eight nodes improves the recognition accuracy of the Sphinx system from 71% to 90%. Furthermore, cross-speaker validation (i.e., training on one speaker and testing on others) provided an 84% word accuracy. The encouraging result implies that the neural network ``learns'' the room reverberation characteristics and performs the environment adaptation largely irrespective of the speaker or the spoken text.