Spoken Language Systems Group, Lab. for Comput. Sci., MIT, 545 Technology Sq., Cambridge, MA 02139
The goal of this study is to seek an understanding of the effects of microphone variations on the MIT segment-based speech recognition system, summit. Specifically, phonetic classification and recognition performance are evaluated on utterances extracted from the timit corpus. The timit corpus offers phonetically-transcribed and time-aligned data for three different microphones---a Sennheiser close-talking, noise-canceling microphone, a Bruel and Kajar (B&K) far-field pressure microphone, and a telephone handset (plus channel distortion). These transducers cause different convolutional, additive, and bandwidth effects in the speech waveform. Experimental procedures are established to measure and analyze system performance under variable training and testing conditions. Classification uses Gaussian models on a feature vector consisting of Mel-frequency cepstral coefficients and their time derivatives, plus duration. The experiments show that performance in phonetic classification and recognition degrades from the Sennheiser (27% classification error) to the B&K (29%) and the telephone (43%). Performance further degrades when training and testing conditions are unmatched. Closer examination reveals that these degradations are concentrated in specific phonetic classes. For example, confusions between voiced and unvoiced fricative pairs account for a large percentage of the additional errors when training on the Sennheiser and testing on the B&K.