L. D. Braida
R. M. Uchanski
L. A. Delhorne
Res. Lab. of Electron., MIT, Cambridge, MA 02139
Current progress in the development of automatic speech recognition (ASR) systems may soon permit discrete symbolic speechreading supplements to be derived from the speech signal. Such supplements could be similar to those used in manual cued speech, in which the talker uses discrete hand positions and shapes to provide distinctions between constants and vowels that are often confused in speechreading. Highly trained receivers of manual cued speech can achieve nearly perfect reception of everyday connected speech materials at normal speaking rates through the visual sense alone. To understand the accuracy that might be achieved with automatically generated cues, we measured how well trained spectrogram readers and an automatic speech recognizer could assign cues for various cue systems. A model of audiovisual integration was then applied to these measurements and data on human recognition of consonant and vowel segments via speechreading was published. This analysis suggests that with cues derived from current recognizers, consonant and vowel segments can be received with accuracies in excess of 80%, roughly equivalent to the segment reception accuracy required to account for observed levels of manual cued speech reception. To provide guidance for the development of automatic cueing systems, we describe techniques for determining optimum cue groups for a given recognizer and speechreader, and estimate the cueing performance that might be achieved if the performance of current recognizers were improved.