Stephen A. Zahorian
Dept. of Elec. and Comput. Eng., Old Dominion Univ., Norfolk, VA 23529
Vowel tokens were synthesized from sinusoids using two methods. For the first method (formant sinusoids), three variable frequency sinusoids were used, with frequencies adjusted to match the first three formants extracted from naturally spoken vowels. The amplitudes were filtered at -12 dB/oct, to approximate the roll-off of the glottal source. For the second method (spectral shape sinusoids), 16 fixed-frequency sinusoids, approximately equally spaced on a bark scale, were used. For this method, the amplitudes of the sinusoids were adjusted so that the overall spectral shape of the naturally spoken token was preserved. A forced-choice identification experiment was conducted using the five vowels /a,i,u,(ae ligature),(eh)/. As a control, tokens were also generated using one period of the original speech, periodically extended to match the length (1 s) of the synthesized tokens. The average percentages of tokens correctly identified were 87%, 37%, and 67% for the original, formant sinusoids, and spectral shape sinusoids, respectively. These results clearly show that vowel stimuli which preserve formant frequencies, but which distort spectral shape, are perceptually impoverished. In contrast, vowel stimuli which preserve spectral shape, but which only approximately preserve formants, are identified with greater accuracy. However, neither set of stimuli are identified as accurately as are the original tokens.