IBM's Superhuman Speech

You just don't understand!

IBM's Superhuman Speech initiative clears conversational confusion.

by Sam Howard-Spink

It's a vision of the future that's been promised for decades — humans and machines interacting with each other by voice. Yet despite over 50 years of research, today's speech recognition systems are not nearly accurate enough for widespread use. They work when customized to recognize a trained individual's voice, such as a secretary accustomed to dictation, or limited data sets such as telephone numbers or medical vocabulary. A real-time, natural-language recognizer that can cope with all types of speech remains an elusive goal.

IBM competes fiercely to improve automatic speech recognition (ASR). The company has focused on speech technologies for more than 30 years and is currently embarking on an eight-year mission to develop a Superhuman Speech system (SHS) — a recognizer that actually performs better than humans.

The SHS initiative is an umbrella research effort encompassing all of IBM's speech recognition projects and product tracks. There are roughly 100 IBM speech researchers worldwide. Teams work on such problems as machine comprehension or "natural-language understanding," embedded ASR for cars and mobile devices, and transcription tools for specific industries. All of these projects require increasingly accurate recognition capabilities, and SHS is leading the effort on their behalf.

David Nahamoo, department group manager of human language technologies at IBM Research, joined the company's speech recognition team from Purdue University in 1982. "My professors at the time told me that I should forget all about speech recognition, since I wouldn't get anywhere with it in my lifetime," he says.

"But today we have some very good recognition technologies. With the Superhuman project we'll be dealing with a lot of issues we haven't handled in the past. These include accents, high noise environments, all kinds of variability in the delivery channel, the mood of the speaker, the spontaneity of the speech, and other variables. Depending on the task, we're still a factor of three to a factor of 10 behind human performance. SHS aims to close that gap."

Better than human
Following early pioneering work at MIT, Carnegie Mellon, and other universities, speech research began in earnest at IBM in 1970. The breakthrough came when researchers applied the principles of statistical pattern recognition to the problem of ASR. They constructed a system that could learn the likelihood of word sequences simply by crunching vast quantities of data. Once researchers figured out how to convert linguistics into mathematics, speech recognition moved from theory to practice.

Speech research at IBM has come full circle in the past 30 years, says Nahamoo. Today there are four speech recognition product tracks at IBM: dictation for professionals and consumers (ViaVoice™); embedded recognition for devices such as PDAs and automotive applications; telephony (WebSphere™ Voice Server); and transcription tools for business and professional use in medical, legal, and other fields (WebSphere Transcription Server). In 2000 IBM began providing voice services as part of its e-business infrastructure offerings.

Rather than introducing new products, the SHS initiative will improve upon IBM's existing speech products by substantially reducing error rates. The project aims to develop technology that will meet two objectives. First, researchers hope that SHS will eliminate almost all need for customization so a speech-recognition package can be used by anyone in any circumstances. The other major goal is to get the systems to perform as well as or better than humans. At that point, the economic benefits of the technology are expected to dictate wider deployment and drive what could be a $30 billion — $50 billion-a-year market.

"We have reached a point where we've closed the loop from innovation to product delivery," says Nahamoo. "Now we're reexamining all the necessary components to take this technology to the next level. We're really concentrating on matching and even exceeding human performance. That is what Superhuman Speech is about."

Hearing above the din
The proposal to develop a speech recognizer that is "better than human" seems far-fetched at first, until one remembers that humans also make errors in recognizing speech. Current technologies have error rates of around one in 20 words — nearly 10 times worse than human performance. The SHS project is not seeking to perfect speech recognition, but rather to refine it to easily tolerable levels.

One of the challenges facing researchers in speech technology today is achieving accurate recognition in noisy environments. Today's recognizers do a good job when tied to a trained individual's voice, but ambient and sudden sounds in the background can produce frequent errors. Reducing these noise-related errors is one of SHS's primary goals.

Another objective is to construct a universal recognizer that can handle different input systems. These might include a telephone for navigating through call center options; a desktop microphone for dictation and transcription; or a handheld device such as a cell phone or PDA to allow for the input of data without using tiny buttons. Each input mechanism currently requires a new customization for an existing recognizer, but the ideal is to have one recognizer that can handle input from multiple sources.

Researchers face the additional challenge of building a system that is truly domain-independent. Speech recognizers can currently handle specific data such as numbers or a limited, specialized vocabulary — for example, booking a flight over the telephone-but the goal is to have a system that can accommodate unfamiliar data.

Talking history
To achieve these goals, IBM's researchers need an abundance of language data to crunch. They've found a rich source in the exhaustive MALACH Project, one of several activities under way in the SHS initiative. IBM and a consortium of academic and industrial researchers, including Johns Hopkins University, the University of Maryland, and Steven Spielberg's Visual History Foundation, have received a National Science Foundation grant to transcribe a database containing over 100,000 hours of interviews and conversations with Holocaust survivors. In addition to offering tremendous social and historical value, the recordings provide some of the most linguistically challenging speech available. The testimony is in 32 different languages; it's heavily accented with frequent hesitation and language switching; and it's imbued with emotion. All of these factors make automatic recognition of the recordings in the MALACH (the Hebrew word for "messenger") database a unique challenge.

"A lot of spoken data has been collected in terms of oral history and news broadcasts, and people want to extract information from it," says Michael Picheny, head of the Superhuman Speech group. "The magnitude of the MALACH database makes it literally and practically impossible to transcribe, unless you can automate the process. If we achieve any breakthroughs it will be valuable not only for this particular data, but the techniques we develop will also be applicable to any other type of recorded material."

So how will accurate and ubiquitous speech recognition affect our world? According to SHS researchers, innovations in this area will likely accelerate commerce and information access, especially through wireless and remote devices. Audio and video broadcasts will become more like print; Just as newspapers are produced in quantity and read when their buyers have time, speech recognition multimedia information can be stored and retrieved when convenient, even if one's hands are busy. For instance, driving directions could be stored on cell phones that use global positioning services to track where you are and when you need the directions.

Superhuman Speech even has the potential to narrow the digital divide. People around the world will be able to interact with technology more easily and access more information than ever before, regardless of literacy and skill levels. Combined with the potential for real-time speech translators, which are already being developed, language barriers between cultures could become a thing of the past.

Having a stimulating conversation with a machine may still be far in the future, but before the decade is over, your computer should make a great listener.