ASA 127th Meeting M.I.T. 1994 June 6-10

5pSP9. Beyond visemes: Using disemes in synthetic speech with facial animation.

Caroline Henton

Linguistic Res. Ctr., Univ. of California, Santa Cruz, CA 95064

Realistic on-screen computer ``assistants'' need synthetic visible speech that has accurate, pleasing articulation; they also need to run in real-time on a personal computer. Previous visible speech systems have used 9--32 ``visemes,'' minimal contrastive units of visible articulation. Viseme-based animation has been choppy, inaccurate, and insufficiently plastic. In contrast, concatenative speech synthesis utilizes a large inventory of diphone units. Therefore, two improvements are needed: expansion beyond simple visemes, and reduction of the number of diphones so that they may be mapped to the facial images in real-time. To this end, visible and articulatory archiphones and diphone ``aliases'' were formalized, using standard phonological distinctive features, and a system of disemes was created. Disemes begin during one viseme (phone) in an archiphonic family and end somewhere during the following viseme (phone), in another archiphonic family. In this way, many transitions that occur in approximately 1800 diphones (for General American English) can be visually depicted by the same diseme, due to their similarity in lip, teeth, and tongue image positions. The effectiveness of mapping these disemes may be demonstrated, using a variety of non-screen agents and faces.