4pSCa3. Visible speech revisited: An acoustically driven model of lip and tongue motion.

Session: Thursday Afternoon, June 19

Author: Jay T. Moody
Location: Dept. of Cognit. Sci., Univ. of California, San Diego, 9500 Gilman Dr., La Jolla, CA 92093-0515, jmoody@cogsci.ucsd.edu
Author: Maureen Stone
Location: Univ. of Maryland Med. School, Baltimore, MD 21201


A method is presented for converting acoustic speech data into a ``speech readable'' movie of a canonical talking face using a neural network. Target values for the network are created by finding the principal components of a set of video frames, with each frame consisting of side-by-side images of the face and tongue of a single speaker. Tongue contours are extracted from mid-sagittal ultrasound images. Input to the network consists of a set of cepstral parameters and their derivatives calculated over 22-ms windows (overlapping by 11 ms). Two such input frames are matched to each video frame (33 ms). Recurrent (backward) connections in the network encourage it to learn not only the acoustic-articulatory pairings, but also articulatory trajectories (expectations of the next articulatory state based on recent articulatory states). It is hypothesized that this supplemental trajectory information helps to alleviate the uncertainties inherent in the vocal tract inverse mapping. After training, the network is presented with new (untrained) tokens (audio only) of utterances from the training corpus and the network's video output is recorded. The network's output is compared to the actual recorded video. [Work supported by US Dept. of Education and NIH.]

ASA 133rd meeting - Penn State, June 1997