ASA 129th Meeting - Washington, DC - 1995 May 30 .. Jun 06

1aSC39. Vocal tract normalization in speech recognition: Compensating for systematic speaker variability.

Jordan Cohen

CCR-IDA, Princeton, NJ

Terri Kamm

Andreas G. Andreou

Johns Hopkins Univ., Baltimore, MD 21218

The performance of speech recognition systems is often improved by accounting explicitly for sources of variability in the data. In the SWITCHBOARD corpus, studied during the 1994 CAIP workshop [Frontiers in Speech Processing Workshop II, CAIP (August 1994)], an attempt was made to compensate for the systematic variability due to different vocal tract lengths of various speakers. The method found a maximum probability parameter for each speaker which mapped an acoustic model to the mean of the models taken from a homogeneous speaker population. The underlying acoustic model was that of a straight tube, and the parameter estimation was accomplished by warping the spectrum of each speaker linearly over a 20% range (actually accomplished by digitally resampling the data), and finding the maximum a posteriori probability of the data given the warp. The technique produces statistically significant improvements in accuracy on a speech transcription task using each of four different speech recognition systems. The best parametrizations were later found to correlate well with vocal tract estimates computed manually from spectrograms.