4pSC26. Fast speaker-independent acoustic modeling and speaker adaptation.

Session: Thursday Afternoon, December 5

Time: 5:00

Author: Shoichi Matsunaga
Location: NTT Human Interface Lab., 1-2356, Take, Yokosuka, Kanagawa, 238-03 Japan
Author: Masahiro Tonomura
Location: ATR-ITL, Kyoto, 619-02 Japan
Author: Tetsuo Kosaka
Location: Canon Media Technol. Labs., Kanagawa, 211 Japan


A fast acoustic modeling method for speaker-independent speech recognition and a speaker-adaptation method, which is effective even with only a small amount of speech data, is described. The speaker-independent phoneme models are generated by composing representative speaker-dependent phoneme models, which are selected from among all speaker-dependent models by clustering the models without Baum--Welch parameter re-estimation. This generation method greatly reduces the computational cost needed to create the speaker-independent HMMs to much less than that of the Baum--Welch method, i.e., by a factor between approximately 1/20 and 1/50. This speaker adaptation algorithm unifies two conventional techniques, i.e., a maximum a posteriori (MAP) estimation and transfer vector field smoothing. A priori knowledge from initial models is statistically combined with a posteriori knowledge derived from the adaptation data to complement the sparse adaptation data. Transfer vector smoothing is used to interpolate the untrained parameters. Furthermore, in order to obtain a suitable a priori knowledge concerning speaker characteristics, a speaker-clustering model, generated by using speech of a selected speaker cluster, is used as an initial model. The cluster selection is performed with a tree-structured speaker clustering technique that determines the number of speakers and the members in the cluster based on speaker similarity. [Work supported by ART-ITL.]

ASA 132nd meeting - Hawaii, December 1996