[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

NIPS and Audio Perception



I was honored with an invitation to speak at the NIPS conference last week
(11/28 to 12/2/94) and I thought some of you might be interested in what
was presented about audition there.

This is, of course, a rather biased view.  I'll try to provide names or
pointers so people can find out more.  While the emphasis of the conference
was not on audio perception, there was enough good work and people to keep
me very busy.

TUTORIALS
There were a couple of tutorials of interest to audio people.  Teuvo
Kohonen reviewed his lab's work on self-organizing maps.  He showed the
now-classic picture of the vowels being organized.  More indirectly, Leo
Breiman, a statistician at Berkeley, presented a wonderful tutorial on
statistical methods and fitting regression lines.  He provided lots of
references to little pieces, but no overall reference.  If you ever get a
chance to see him speak, do it.

CONSCIOUSNESS
Monday morning was spent on consciousness.  Most of it wasn't auditory, but
one quote sticks out, "Conscious perception involves a selection of
perceptual alternatives." I'm not convinced this is true, but should be
something that is testable.

LEARNING THEORY
DeLiang Wang (OSU) presented some interesting work on using cortical
oscillators to model figure separation.  He also mentioned, but didn't
describe, some preliminary work he's done using the oscillators to perform
auditory scene analysis.  I don't remember if he has results yet.  He can
be reached at dwang@cis.ohio-state.edu.

CHIPS
Cauwenberghs (JHU) and Pedroni (Caltech) described an architecture for an
analog VLSI vector quantizer.  This isn't just an audio thing, but it's
something I wanted when I was trying to design a chip to implement the
front-end of a cochlear ASR system.  With this design and cochlear
front-end, you're all set.  Also, Horiuchi (Caltech) described an
enhancement to the Caltech binaural hearing chips that modify's its output
based on where the eyes are looking.

ASR
Chang and Lippman (MIT) described training an ASR system by generating
additional voice templates.  A single example was modified into many more
by artificially warping the spectral profile of each utterance.  They
showed a reduction in error, but didn't talk about how the error rate
compared to simply adding noise to the examples (thus preventing
over-learning.)

SPEECH and SIGNAL PROCESSING
I presented my own work on correlogram inversion, and explained why model
inversion is one part of establishing the correlogram as a viable auditory
representation.

I'm afraid that I completely missed the next talk by Waterhouse/Robinson
(Cambridge).  The title was "Non-linear prediction of acoustic vectors
using hierarchical mixtures of experts."

Sid Fels (Toronto) presented his work on using a NN to learn and recognize
hand positions and control a formant synthesizer.  The system was much
easier to use than previous work and the video was quite wonderful.  He
also talked about his work on automatically going from sounds to gestures
at a workshop.  He's going to talk more about this work at a CCRMA Hearing
Seminar next year.

Finally, Movellan (UCSD) talked about his system to do lip-reading.  They
showed a system that used HMM models to recognize the digits 1-4.

Bell and Sejnowski (Salk Institute) presented some modifications to the
standard blind-deconvolution work that allowed it to work better with
non-gaussian noise.  They played an impressive 10-source separation (but I
find it hard to get excited about such demos when they are digitally
mixed.) Note to tony@salk.edu for more information.


ANTHROPOMORPHIC SPEECH RECOGNITION
Hynek Hermansky (OGI) organized a one day workshop on anthropomorphic
speech recognition. The workshop was held on Dec. 2 and was attended by
about 20-30 people.

Andreou Andreas (JHU) presented some work his lab has been doing by hooking
up an analog cochlea chip to a speech recognizer.  He got disappointing
results, but claimed that Chalapathy Neti (IBM) enhanced the short-time
events so they didn't get lost in the 10ms frame time, and beat all the
current techniques.  (Sorry, I don't have more information.)

I led a discussion on why I thought perception hadn't solved the ASR
problem.  I observed that some things were well accepted by the ASR
community (MFCC and RASTA/CMN).  We talked about issues like short vs.
medium term adaptation and which cochlear features were needed for speech
recognition.

Misha Pavel (OGI) talked about tests they did comparing RASTA to
psychophysics and forward masking.  It was interesting to note that one
feature of RASTA (the zero in the filter response at DC) is needed for ASR,
but definitely a mistake for perception.

Hynek Hermansky (OGI) talked about why short-term temporal phenomena are
problematic.  In this vein, Nelson Morgan (ICSI) presented his new ASR
system which tries to look only for the transitions and ignores all the
steady-state information.

Jont Allen (Bell Labs) led a VERY active discussion on how do humans
recognize speech.  He has been looking at old data collected by Harvey
Fletcher in the early part of the century and asking what it says about how
we recognize speech.  This work was published in the Oct. 94 issue of IEEE
Transactions on Speech.  The two basic conclusions are: 1) the amount of
information in the speech signal is relatively flat as a broad function of
frequency.  (Information content was measured by filtering out portions of
the spectrum and looking at error rates.)  2) The speech recognition system
recognizes features, phones, then words, in a heirarchical manner.  The
first conclusions sounds great, I'm not sure I agree with the second.
We're hoping to have Jont present his work at a CCRMA Hearing Seminar next
year.

WORKSHOP CONCLUSIONS
Hynek presented this list of conclusions (partially tongue-in-cheek) at the
end of the workshop.  My comments are in ().

Do we need features for non-linear classifiers
        Yes - To suppress information we do not need.

What do we not need?
        What we do not hear

Models of Hearing?
        Frequency Axis Warping (but this models articulation, not perception)
        Medium-scale temporal processing

Anything else?
        Possibly, but benefits yet unclear

Big Questions:
        Do we need different recognition paradigms?
                Quite Likely (HMMs aren't necessarily the best recognition
                                paradigm.)
        How do we find out what to change?
                Careful interpretation of the experimental data (perception)
        Would we qualify as scientists?
                Probably not quite yet.



CONCLUSIONS
I've undoubtably missed something.  I apologize for that.

Was it a good meeting?  Yes!  I met lots of interesting people (including
some new work on rhythm perception and cortical oscillators for
separation)!!!

Thanks to the NIPS organizers for inviting me.  I learned lots and hope to
return.

-- Malcolm