Department of Elec. and Comput. Eng., Univ. Waterloo, ON N2L 3G1, Canada
Recent efforts will be reported in the design and implementation of a speech recognizer that has been motivated by a modern form of phonology arguing for use of multi-dimensional articulatory structures [Browman and Goldstein, Phonetica 49 (1992)]. The earlier work [Deng and Sun, J. Acoust. Soc. Am. (to be published)] has recently been extended such that the primitive speech units, constructed via constrained overlap among five-dimensional articulatory features/gestures (lips, tongue blade, tongue dorsum, velum, and larynx), have changed from their previous ``static'' representation to the current ``dynamic'' one. The statistical model underlying the speech recognizer is a version of the nonstationary-state HMM [Deng, Signal Process. 27 (1992)], where a bulk of the Markov states are characterized by time-varying Gaussian-mean functions (implemented as mixtures of polynomial functions of sojourn time within the Markov state). These nonstationary Markov states are constructed only for the feature constellations arising from assimilation of one or more primary articulatory features (i.e., lips, tongue blade, tongue dorsum). The physical reality is that manipulations of the articulatory structure indexed by any primary articulatory feature(s) during speech production necessarily generate the acoustic signal that is transitional in nature. In addition to the above physical motivations, algorithmic issues relating to implementation of the speech recognizer will also be addressed. Evaluation of the recognizer for the task of phonetic classification (TIMIT) containing all classes of English sounds is currently in progress.