NTT Telecommun. Networks Labs., Midori-cho, Musashino-city, Tokyo, 180 Japan
This paper investigates how various factors affect the quality of synthetic speech produced by rules. Using rules to synthesize speech will be an important technique for providing various telecommunication services in future intelligent networks. The quality of synthetic speech is generally measured by subjectively evaluating the speech from the viewpoint of intelligibility, or by comparing it with the quality of other types of synthetic speech. However, the development of a practical speech synthesis method for use in telecommunication networks requires an overall quality evaluation, including intelligibility and naturalness. The quality should be compared with that of natural telephone speech. To establish an overall quality evaluation method, the effects of several factors on the overall quality (expressed by MOS) of speech synthesized by several Japanese text-to-speech systems are quantitatively compared with the effects of using additive speech-correlated white noise as a natural speech material. Experimental results show that such factors as subject, listening experience, average pitch frequency, and text affect synthetic speech more than natural speech. Quality evaluation characteristics due to these factors are discussed and an overall quality evaluation method for synthetic speech is proposed.