R. J. McAulay
T. F. Quatieri
Lincoln Lab., MIT, 200 Wood St., Lexington, MA 02173-9108
It has been shown that speech of high quality can be synthesized using a sinusoidal model when the amplitudes, frequencies, and phases are derived from a high-resolution analysis of the short-time Fourier transform (STFT). It has also been shown that if the measured sine-wave frequencies are replaced by a harmonic set of frequencies in which the fundamental frequency is chosen to make the harmonic model a ``best fit'' to the measured sine-wave data, then synthetic speech of high quality can also be obtained provided the amplitudes and phases are obtained by sampling the STFT at the harmonic frequencies. A model has also been developed for the sine-wave phases that has a linear component corresponding to the onset time of the glottal pulse, a minimum phase component due to the dispersive characteristics of the vocal tract, and a random component that represents the degree to which the speech segment was unvoiced. While conventional methods are used for coding the pitch and voicing, the sine-waves amplitudes are coded using high-order allpole models. Scalar quantization of the line spectral frequencies offers good performance at rates from 4800--8000 bps, while a multiband vector quantizer results in performance that is quite good at 2400 bps.