Spoken Language Systems Group, Lab. for Comput. Sci., MIT, 545 Technology Sq., Cambridge, MA 02139
This talk presents phonetic models that capture both the dynamic characteristics and the statistical dependencies of acoustic attributes in a segment-based framework. The approach is based on the creation of a track, T[sub (alpha)], for each phonetic unit (alpha). The track serves as a model of the dynamic trajectories of the acoustic attributes over the segment. The statistical framework for scoring incorporates the auto- and cross-correlation properties of the track error over time, within a segment. On a vowel classification task [W. Goldenthal and J. Glass, ``Modeling Spectra Dynamics for Vowel Classification,'' Proc. Eurospeech 93, pp. 289--292, Berlin, Germany (1993)], this methodology achieved classification performance of 68.9%. This result compares favorably with other studies using the timit corpus. This talk extends this result by presenting context-independent and context-dependent experiments for all the phones. Context-independent classification performance of 76.8% is demonstrated. The key to implementing the context-dependent classifier consists of merging tracks trained separately on left and right contexts to synthesize any desired context during classification. This method allows one to synthesize a track for triphone contexts not seen in the training set. Using a total of 4167 gender-dependent biphone tracks, 58 phonetic statistical models, and no phone grammar, a context-dependent classification performance of 80.5% was achieved. This result increases to 85.8% when a trigram phone grammar is added.