ASA 130th Meeting - St. Louis, MO - 1995 Nov 27 .. Dec 01

5aSC25. Recent progress in the INRS speech recognition system.

Douglas O'Shaughnessy

Zhishon Li

Azarshid Farhat

INRS-Telecommunications, 16 Pl. du Commerce, Verdun, PQ H3E 1H6, Canada

For large-vocabulary continuous-speech recognition, a two-pass search allows inexpensive first-pass models, with pruned search spaces represented by word graphs. Powerful language models and detailed acoustic-phonetic models follow. A first-pass Viterbi lexicon search is avoided via tables of estimates of phone scores and durations, from backward-Viterbi searches of much smaller graphs, which impose diphone rather than full lexical constraints on phonetic transcriptions. These estimates of phone scores and durations are used to calculate approximate acoustic matches for arbitrary phonetic transcriptions (one floating-point operation/phone). The speaker-independent system uses WSJ0 data (5000-word vocabulary), with separate male and female models: 3-state full right-context models, and code books of 14 static and 15 dynamic cepstral parameters. The first pass uses VQ models with one covariance matrix and 256 means. The word inclusion rate is about 97%. For the second pass, trigram language models with perplexity 104 and continuous-HMM acoustic models achieved about 90% word-recognition accuracy on the development set. To achieve good trade-offs between acoustic models' complexity and trainability, a shared-distribution approach for clustering has distortion measures based only on the weights of Gaussian mixtures rather than all parameters. Word accuracy increased by 6% for the ATIS corpus. [Work supported by NSERC-Canada.]