Chin-Hui Lee Wu Chou Biing-Hwang Juang Lawrence R. Rabiner Jay G. Wilpon
AT&T Bell Labs., 600 Mountain Ave., Murray Hill, NJ 07974
Accurate and robust connected digit recognition is essential for a wide range of telecommunication services. Based on training and testing using only clean network digit data, and using the same whole-word model architecture as in the TI/NIST connected digit testing, the string error rate increased from less than 1% to more than 5%. The performance degraded even further when evaluated on data collected with different network conditions. Most of the observed errors were caused by changing channel characteristics, highly variable digit pronunciations, and inadequate modeling of cross-digit coarticulation. Results are presented for a number of context-dependent whole-word and subword modeling techniques developed to overcome some of the above problems. The most effective one is a new acoustic subword modeling approach that assumes that each digit model consists of three parts, namely, head, body, and tail subword units. Multiple heads and tails are also allowed, one for each of the 11 possible preceding and following digits and the background. Cross-digit coarticulation is modeled by connecting the pair of digits through the corresponding tail and head units. Testing on about 12 000 digit strings, collected from five regions, this new model architecture reduced the string error rate to under 2%.