B. H. Juang
AT&T Bell Labs., 600 Mountain Ave., Murray Hill, NJ 07974
The acoustic front end that converts the signal waveform into a parsimonious representation is a critical component of a speech recognizer. Traditionally, this front end conversion process has relied on the short time spectral analysis framework in which power spectra of speech segments, on the order of 10 ms, are successively and independently estimated, via all-pole modeling or bank of filters. In this paper, several key considerations in front end design are elaborated, ranging from the computational structure to the associated dissimilarity measures, in the context of prevalent short time spectral representations and the strengths as well as weaknesses of various speech analysis models are discussed, with an attempt to point out possible ways to improve the existing methods. It is further suggested how auditory modeling may help enhance the robustness of a speech recognizer by providing perceptually reliable measurements of speech.