Michael W. Macon
Mark A. Clements
Dept. of Elec. and Comput. Eng., Georgia Inst. of Tech., Atlanta, GA 30332
A general framework for waveform synthesis in a concatenation-based text-to-speech system is presented. Natural speech is segmented into subword units and analyzed using an iterative analysis-by-synthesis procedure originally presented in [E. B. George and M. J. T. Smith, J. Audio Eng. Soc. 40, 497--516 (1992)]. Synthesis is then performed by an efficient overlap-add resynthesis and modification method. This method eliminates the need for precise, hand-corrected pitch pulse marking in analysis (as required in some other popular concatenation methods), by incorporating a pitch pulse onset time estimation function based on [R. J. McAulay and T. F. Quatieri, Proc. ICASSP, 1713--1715 (1986)]. The sinusoidal model is capable of natural-sounding prosodic modification of both continuous speech and concatenated segments, making it an ideal candidate for application in a TTS system. Furthermore, the model provides for a computationally tractible and conceptually simple decoupling of various properties of the speech signal, making it an excellent platform for other transformations of the synthesized speech.