Spoken Language Systems Group, Lab. for Comput. Sci., MIT, 545 Technology Sq., Cambridge, MA 02139
Currently, most speech recognition architectures model the speech signal as a nonoverlapping sequence of phonetic segments. A set of phonetic models is created that attempt to capture the acoustic-phonetic properties of individual phones, but do not explicitly model the transition between phones. It is readily apparent, however, that these transitions contain important information about the identity of neighboring phones. While context-dependent phonetic modeling may capture some of this information, it is likely that more explicit models of phonetic transitions could offer performance improvements. In this talk, the use of phonetic transition models will be discussed within the context of summit, a segment-based continuous speech recognition system [Zue et al., ``Acoustic Segmentation and Phonetic Classification in the summit Speech Recognition System,'' Proc. ICASSP 89, pp. 389--392, Glasgow, Scotland (1989)]. The transition models use a feature vector based on Mel-frequency spectral coefficients (MFSC's). The vector is created by concatenating multiple spectral averages on both sides of a transition. For example, in one configuration a total of eight averages were used which spanned a total time interval of 150 ms. In order to reduce the number of dimensions, a principal component analysis was performed. A set of diagonal Gaussian models is used to model the transitions. The models were tested by applying them to the N-Best sentence hypotheses from the recognition system. Each of the N-Best hypotheses is rescored using a linear combination (optimized on training data) of the segment and transition scores. Initial experiments have resulted in 10%--20% reductions in word error rates.