Speech Systems Res. Dept., Bellcore, 445 South St., Morristown, NJ 07960
A phonemic segment has different acoustic-phonetic realizations depending on many contextual factors, e.g., nearby phonemes and position in the syllable, word, and phrase. Appropriate acoustic variation in the phonemes is necessary for intelligible, natural-sounding synthetic speech. Two basic approaches to achieving these variations are: (1) articulatory synthesis, which models the human vocal apparatus, attempting to automatically account for the desired acoustic variability and (2) acoustic synthesis, which bypasses the articulatory level and operates on acoustic patterns directly, either by controlling phoneme-based format targets and transitions or by concatenating prerecorded units. In a phoneme-based system, whether articulatory or acoustic, the necessary acoustic variations for each phoneme are produced entirely by rules whose goal is to capture linguistic and articulatory regularities. However, knowledge of these rules is incomplete. The rationale for concatenative systems is that by recording and storing multiple variants of each phoneme, or units longer than phonemes, the units themselves incorporate some of the acoustic variation. Basic units for concatenation range in size and phonetic nature from phonemes or allophones, through dyads or diphones, polyphones, and demisyllables, to units covering polysyllables or words. This talk discusses interactions among the modeling approach, the basic unit, and the rules for combining units.