Thomas J. Watson Research Center, P. O. Box 704, Yorktown Heights, NY 10598
A speech synthesizer for use in analysis-by-synthesis speech recognition is described. Coarticulation is modeled by applying linear FIR filters to target gestures in a pseudo-articulatory domain. This domain is treated as a hidden, unobservable layer between the phonetic input to the synthesizer and the acoustic spectrum output. That output is obtained from the hidden-layer signal by means of a memoryless nonlinear transformation that is implemented by a neural net with elliptic basis functions. The entire model, including phonetic targets, FIR filter shapes, and neural-net parameters, is trained by a pre-conditioned conjugate gradient method, using the mean-squared error between synthetic and actual spectra as the objective function. The gradient is calculated by a back-propagation algorithm. It is found that after training, the FIR filter shapes typically resemble noncausal two-pole lowpass characteristics, with one pole in the right and the other in the left half-plane, representing right-to-left (anticipatory) and left-to-right coarticulation, respectively. Performance of the synthesizer on isolated-word speech, continuous read speech, and spontaneous speech will be described, and results from recognition experiments on speaker-dependent and speaker-independent tasks will be reported.