Dept. Elec. Eng., Univ. Waterloo, Waterloo, ON N2L 3G1, Canada
Spoken Language Systems Group, Lab. Comput. Sci., MIT, Cambridge, MA 02139
The conventional design of speech recognizers tends to treat the ``front end'' (signal representation of speech) and the ``back end'' (lexical representation plus signal modeling and pattern matching) as separate modules. In the integrated framework being developed, signal representation is regarded as a process of ``constrained optimization'' with the objective function determined by the phonological structure (feature) of speech and by statistical representation of its acoustic/auditory correlates. The constraints are based on the significant properties known to be employed by the human auditory system. A preliminary version of a speech recognizer incorporating the above ideas will be presented where bundles of overlapping articulatory features are used as the basis for describing the phonological structure of speech. Several multi-valued features are assigned uniquely to each quasi-phonemic unit with ``minimal'' redundancy and ``maximal'' separability. Major contextual variations in speech are modeled as a natural result of overlapping the ``intrinsic'' values of one or more of these features across adjacent phonemic units. Knowledge from speech production and speech perception are utilized to limit the allowable feature overlaps. Linkage of this feature-based lexical representation to speech recognition is achieved by establishing a one-to-one mapping from a feature-overlap pattern to a directed graph where each node represents a unique composition of the features and is characterized by a time-series model. Given the explicit feature specification for each node in the overall speech space, a general set of acoustic/auditory measurements can be assigned specifically into the individual nodes based on the understanding of the acoustic/auditory correlates to the features.