3aSP1 Acquisition of language models from text.

ASA 124th Meeting New Orleans 1992 October

3aSP1. Acquisition of language models from text.

P. Brown

F. Jelinek

J. Lafferty

R. Mercer

S. Roukos

Continuous Speech Recognition Group, IBM T. J. Watson Res. Ctr., Box 704, Yorktown Heights, NY 10598

Statistical language models that assign a probability of being generated to each possible string of words are of interest here. Grammatical language models may, moreover, assign probabilities to parses of the given string of words. Language models are used as components of recognition systems, speech and handwriting being current examples of interest. Another use of language models is in text understanding systems and machine translation systems, where grammatical information is very desirable. Because of the richness of language, language models are usually constructed for particular discourse domains, such as medical reports, office correspondence, legal documents, etc. The construction is on the basis of large amounts of sample text, and should be as automatic as possible. For grammatical language models, it is desirable to estimate not only their statistical parameters from the supplied text, but even the grammatical rules themselves. The talk will discuss various methods of language model construction, their relative performance characteristics, and some of the many open problems.