NIPS and Audio Perception (Malcolm Slaney )

Subject: NIPS and Audio Perception
From:    Malcolm Slaney  <malcolm(at)INTERVAL.COM>
Date:    Wed, 7 Dec 1994 13:32:47 +0000

I was honored with an invitation to speak at the NIPS conference last week (11/28 to 12/2/94) and I thought some of you might be interested in what was presented about audition there. This is, of course, a rather biased view. I'll try to provide names or pointers so people can find out more. While the emphasis of the conference was not on audio perception, there was enough good work and people to keep me very busy. TUTORIALS There were a couple of tutorials of interest to audio people. Teuvo Kohonen reviewed his lab's work on self-organizing maps. He showed the now-classic picture of the vowels being organized. More indirectly, Leo Breiman, a statistician at Berkeley, presented a wonderful tutorial on statistical methods and fitting regression lines. He provided lots of references to little pieces, but no overall reference. If you ever get a chance to see him speak, do it. CONSCIOUSNESS Monday morning was spent on consciousness. Most of it wasn't auditory, but one quote sticks out, "Conscious perception involves a selection of perceptual alternatives." I'm not convinced this is true, but should be something that is testable. LEARNING THEORY DeLiang Wang (OSU) presented some interesting work on using cortical oscillators to model figure separation. He also mentioned, but didn't describe, some preliminary work he's done using the oscillators to perform auditory scene analysis. I don't remember if he has results yet. He can be reached at dwang(at) CHIPS Cauwenberghs (JHU) and Pedroni (Caltech) described an architecture for an analog VLSI vector quantizer. This isn't just an audio thing, but it's something I wanted when I was trying to design a chip to implement the front-end of a cochlear ASR system. With this design and cochlear front-end, you're all set. Also, Horiuchi (Caltech) described an enhancement to the Caltech binaural hearing chips that modify's its output based on where the eyes are looking. ASR Chang and Lippman (MIT) described training an ASR system by generating additional voice templates. A single example was modified into many more by artificially warping the spectral profile of each utterance. They showed a reduction in error, but didn't talk about how the error rate compared to simply adding noise to the examples (thus preventing over-learning.) SPEECH and SIGNAL PROCESSING I presented my own work on correlogram inversion, and explained why model inversion is one part of establishing the correlogram as a viable auditory representation. I'm afraid that I completely missed the next talk by Waterhouse/Robinson (Cambridge). The title was "Non-linear prediction of acoustic vectors using hierarchical mixtures of experts." Sid Fels (Toronto) presented his work on using a NN to learn and recognize hand positions and control a formant synthesizer. The system was much easier to use than previous work and the video was quite wonderful. He also talked about his work on automatically going from sounds to gestures at a workshop. He's going to talk more about this work at a CCRMA Hearing Seminar next year. Finally, Movellan (UCSD) talked about his system to do lip-reading. They showed a system that used HMM models to recognize the digits 1-4. Bell and Sejnowski (Salk Institute) presented some modifications to the standard blind-deconvolution work that allowed it to work better with non-gaussian noise. They played an impressive 10-source separation (but I find it hard to get excited about such demos when they are digitally mixed.) Note to tony(at) for more information. ANTHROPOMORPHIC SPEECH RECOGNITION Hynek Hermansky (OGI) organized a one day workshop on anthropomorphic speech recognition. The workshop was held on Dec. 2 and was attended by about 20-30 people. Andreou Andreas (JHU) presented some work his lab has been doing by hooking up an analog cochlea chip to a speech recognizer. He got disappointing results, but claimed that Chalapathy Neti (IBM) enhanced the short-time events so they didn't get lost in the 10ms frame time, and beat all the current techniques. (Sorry, I don't have more information.) I led a discussion on why I thought perception hadn't solved the ASR problem. I observed that some things were well accepted by the ASR community (MFCC and RASTA/CMN). We talked about issues like short vs. medium term adaptation and which cochlear features were needed for speech recognition. Misha Pavel (OGI) talked about tests they did comparing RASTA to psychophysics and forward masking. It was interesting to note that one feature of RASTA (the zero in the filter response at DC) is needed for ASR, but definitely a mistake for perception. Hynek Hermansky (OGI) talked about why short-term temporal phenomena are problematic. In this vein, Nelson Morgan (ICSI) presented his new ASR system which tries to look only for the transitions and ignores all the steady-state information. Jont Allen (Bell Labs) led a VERY active discussion on how do humans recognize speech. He has been looking at old data collected by Harvey Fletcher in the early part of the century and asking what it says about how we recognize speech. This work was published in the Oct. 94 issue of IEEE Transactions on Speech. The two basic conclusions are: 1) the amount of information in the speech signal is relatively flat as a broad function of frequency. (Information content was measured by filtering out portions of the spectrum and looking at error rates.) 2) The speech recognition system recognizes features, phones, then words, in a heirarchical manner. The first conclusions sounds great, I'm not sure I agree with the second. We're hoping to have Jont present his work at a CCRMA Hearing Seminar next year. WORKSHOP CONCLUSIONS Hynek presented this list of conclusions (partially tongue-in-cheek) at the end of the workshop. My comments are in (). Do we need features for non-linear classifiers Yes - To suppress information we do not need. What do we not need? What we do not hear Models of Hearing? Frequency Axis Warping (but this models articulation, not perception) Medium-scale temporal processing Anything else? Possibly, but benefits yet unclear Big Questions: Do we need different recognition paradigms? Quite Likely (HMMs aren't necessarily the best recognition paradigm.) How do we find out what to change? Careful interpretation of the experimental data (perception) Would we qualify as scientists? Probably not quite yet. CONCLUSIONS I've undoubtably missed something. I apologize for that. Was it a good meeting? Yes! I met lots of interesting people (including some new work on rhythm perception and cortical oscillators for separation)!!! Thanks to the NIPS organizers for inviting me. I learned lots and hope to return. -- Malcolm

This message came from the mail archive
maintained by:
DAn Ellis <>
Electrical Engineering Dept., Columbia University