5pSC11. Modeling of pitch, loudness, and segmental durations in Finnish using neural networks.

Session: Friday Afternoon, December 6

Time: 4:35

Author: Toomas Altosaar
Location: Helsinki Univ. of Technol., Otakaari 5A, 02150 Espoo, Finland
Author: Martti Vainio
Location: Univ. of Helsinki, PL 35, 00014 Helsinki, Finland
Author: Matti Karjalainen
Location: Helsinki Univ. of Technol., Otakaari 5A, 02150 Espoo, Finland


Several facets of the man--machine interface, such as speech synthesis and recognition in the spoken language realm, can be modeled using neural networks. Here neural networks have been applied to model the lexical prosodic parameters: segmental duration, loudness, and pitch, for the Finnish language. The prosodic models that were generated can be used in currently viable applications such as speech synthesis to further improve their naturalness. The text input stream was first converted into a phoneme sequence from which the input representation for the nets was generated. Inputs included: phoneme position in word, number of phonemes in word, and context in terms of previous and future phonemes. Optimal input representations for each type of prosodic net were searched for by varying the size of the input vector. The number of hidden nodes was also varied to determine the complexity of the problem. Estimating duration required class specific nets for the error to drop below 20%, the difference limen. For loudness it was 2.2 phon (1 phon is just noticeable), while pitch networks performed well with an error of 3.5% (equals 0.6 semitones at 100 Hz which is less than the 1.5 semitone perceptual intonation threshold).

ASA 132nd meeting - Hawaii, December 1996