R. C. Rose
B. H. Juang
C. H. Lee
AT&T Bell Labs., Murray Hill, NJ 07964
MIT, Cambridge, MA 02139
This paper is concerned with small vocabulary speech recognition from conversational utterances over the telephone network. Modeling techniques are investigated for dealing with large numbers of nonvocabulary words and artifacts that arise in these utterances. A hidden Markov model (HMM)-based continuous speech recognition system using a frame synchronous Viterbi beam search decoder is used for recognition. Keyword models compete in the finite state network with ``filler'' models of nonkeyword speech. Several issues were investigated relating to the quality of acoustic representations and language representations for this task. The first issue that was investigated was the definition of acoustic subword units using allophone clustering procedures. The second issue was the size of the vocabulary used for modeling non-keyword utterances. Finally, the last issue was the use of language models in unconstrained speech tasks. Experimental results will be presented for a 20-keyword recognition task where performance was evaluated on continuous utterances from 22 speakers. The results showed that all of the procedures including decision tree based allophone clustering, better out-of-vocabulary speech representations, and language models contributed to overall recognition performance. The best performing system provided 76% average probability of keyword detection at 5.8 false alarms per keyword per hour.