Information Sci. Res. Lab., NTT Basic Res. Labs., 3-9-11 Midori-cho, Musashino-shi, Tokyo, 180 Japan
An articulatory-to-acoustic mapping was estimated from real speech data and articulatory data measured by a magnetic position sensing device. The mapping function consisted of an acoustic model of the vocal tract and a multi-layered neural network to map articulatory positions on the midsagittal plane into the vocal tract log area function. Input variables to the network included eight positions on the tongue, lower and upper lip, jaw, and velum as well as two unknown inputs that locally control the log area function below and around the larynx. The weighting variables of the network and the unknown inputs were iteratively trained by the gradient method so as to minimize the cepstral distance between the pre-emphasized input speech spectrum and the synthesized spectrum. When a database consisting of 27 sentences was used to examine the mapping accuracy for a three-layered neural network configuration with 112 hidden nodes, the spectral error of the mapping was 1.3 dB for vowel data and 1.8 dB for sentence data. The mapping accuracy was fair enough for producing natural sounding speech from real articulatory movements.