[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Computational Auditory Scene Analysis Workshop Summary

Report on the Computational Auditory Scene Analysis Workshop
Malcolm Slaney, Dan Ellis, Dave Rosenthal
Montreal, Quebec, Canada, August 19 and 20th, 1995.

The first workshop on Computational Auditory Scene Analysis (CASA) was held
August 19 and 20th at the 1995 IJCAI (International Joint Conference on
Artificial Intelligence) in Montreal.  Organized by Hiroshi Okuno and David
Rosenthal, the workshop was attended by about thirty people doing work on
scientific and engineering models of human audition and signal processing.

Perhaps the workshop will best be remembered as the largest gathering to
date of people interested in computer models of auditory scene analysis
(ASA).  The attendees were nearly evenly split between those that are
interested in understanding human auditory perception and those that want
to solve problems in auditory perception, perhaps using some of the
techniques of auditory scene analysis.

Al Bregman served as keynote speaker for the conference. His book,
"Auditory Scene Analysis," motivates this new field and most of the talks
at the workshop were attempts to implement the principles of scene analysis
in computer models.  His keynote address emphasized the old-plus-new
principle of auditory organization (old sounds are stable, new sounds are
formed from sound components that can't be explained by the old), and
architectures for computer models of auditory scene analysis.

The rest of the workshop was devoted to poster presentations by the
workshop participants and group discussions. The three discussions will be
described first, then the posters will be listed.

For more information, look for the book on "Computational Auditory Scene
Analysis" to be published by Erlbaum early next year!

DISCUSSION 1 - "A critique of pure audition" or, "Bottom-up vs. top-down"
Led by Malcolm Slaney.

Malcolm motivated the discussion with a series of very telling counter-
examples to the notion that vision is purely bottom-up, followed by some
analogous examples in audition.  Percepts such as apparent motion are
often considered to be locally-calculated, yet a quartet of ambiguous
flashing dots all appear to move in the *same* one of two ambiguous
choices; clearly there is a context-dependent, top-down aspect to this

A discussion of the distinction between bottom-up and top-down
processing suggested they could be functionally equivalent, but that
top-down systems - where information flows both ways between processing
modules - should be a more efficient implementation.  It is unclear how to
interpret the physiological reality of efferent fibres at every stage of
the auditory chain in terms of the algorithm they implement.

Much of the auditory scene analysis work is based on Marr's theory of
vision.  Marr's theory is very beautiful and stimulated much good research.
But it's premise that humans reconstruct the entire visual scene in one's
brain may not be appropriate.  Perhaps an AI-like or top-down approach like
Nakatani's is more appropriate.  This lead to a vigourous discussion about
information flow during perception tasks.

It was also noted that, many of the phenomena such as auditory induction
and restoration might simply be limited to the very last stage of
processing - more a case of top-top processing thantop-down.  Perhaps the
final element of a bottom-up processing chain is the addition of
information from external sources such as memory and expectation to form
the resulting percept.

In summary, despite the fact that many existing models of auditory
organization are almost exclusively bottom-up, working from raw data to
result, with no adaptation or context-dependence, few modelers would deny
the value of top-down processing.  However, the features of a model that
actually reflect this top-downness are much less unanimous.

DISCUSSION 2 - "CASA: Physiological vs. Functional Models."
Led by Hamid Nawab

A quick vote showed that the participants were split evenly between those
wanting to simulate the physiology (at different levels) versus those that
wanted to build models that wanted to solve an engineering problem.  The
most important outcome of the discussion was that there are two equally
valid problems, one based on the science of understanding our auditory
system and the other based on finding engineering solutions to auditory

Other questions and conclusions included: Does the brain process/evaluate
symbols? Can we build models of psychophysical data without modeling the
physiology?  Marr's distinction between implementation, algorithm, and
theory is important. Is physiological ASA analogous to learning to fly by
studying birds? Do the physiological experiments inform functional models?
Are we doing science or engineering? Is localization important to the
problem? Question from the physiologists for the functionalists: How do we
use massive parallelisms? Question from the functionalists for the
physiological people: What auditory cues are there?

DISCUSSION 3 - "Hard problems in representation and auditory scene analysis"
Led by Dave Rosenthal and Dan Ellis.

To emphasize the fact that different participants use representations in
very different ways, the discussion started with a poll of the audience to
see who identified with a variety of 'ists': Representation-ists,
Agent-ists, Blackboard-ists, Neural-ists, Oscillator-ists, Pattern-ists,
Application-ists, Physiology-ists and Psychology-ists.  Average votes per
participant were about 2.2 so perhaps this covered the field.  We then
tried to come up with a list of goals we wanted to achieve with our
representations, by way of motivating the desirable features.  This
resulted in a range of responses varying from 'isomorphism with human
behavior' to 'detecting the sound of harmonic oscillator systems in a
mixture' - perhaps reflecting the physiology/application split that
emerged in discussion 2.  Participants then described some of the different
representations they use, with justifications, ranging from simple
filtered versions of the raw audio (to emphasize a feature like a
particular musical pitch), to adaptive filterbank processing (to achieve a
context-dependent form of representation always best suited to the signal
at hand), to symbolic descriptions of the inferred source characteristics
of all the sound-sources in a given environment (such as a discussion),
collapsed across the many sound-events produced
by each source.

The second part of the discussion looked more generally at the question of
hard problems in the field.  The idea was to identify some specific
problem areas that must be addressed before a 'complete model' of auditory
organization can be built, then to try and predict or set targets for
progress on that area for two years from now.  One domain for improvement
is to refine our existing organization models to deal with less restricted
classes of sound.  For instance, in two years' time, will a system be able
to take a monophonic recording of two speakers with similar pitch ranges
and separate their speech, including both voiced and unvoiced portions?
The Sheffield-ATR 'ShATR' multi-speaker corpus CD-ROM was identified as an
excellent resource for this kind of project.  There was also significant
discussion of issues relating to non-speech sounds.  Do we need a better
taxonomy, or perhaps some kind of comprehensive corpus, to redress the
excessive attention given to auditory organization as applied exclusively
to speech sounds?

The final part of this discussion considered what the successor to this
workshop should be.  A comprehensive list of the conferences preferred by
the workshop participants revealed no outright favorites, although the
Acoustical Society meetings and the Neural Information Processing Systems
conferences were popular and had the advantage of being strongly
multi-disciplinary.  The possibility of staging a workshop independent of
a parent conference was well received, possibly affiliated with an
organization such as the IEEE to add credibility.

A brief mention of the AUDITORY list expressed the view that it was
under-utilized, possibly because participants are wary of wasting the time
of other members.  The list could function well as a place to post
abstracts and announce the availability of pre-prints, something rarely
seen at present.


The poster sessions were distributed throughout the workshop. It gave
everybody a chance to see every body's latest work and to have many
informal discussions.  Here is a brief synopsis of each presenter's work.

Jean Rouat presented his ideas on a new representation for speech
recognition based on short-time autocorrelation. (With Miguel Garcia)

Frank Klassner explained how discrepancy diagnosis and signal reprocessing
contribute to his environmental-sound separating blackboard system. (With
Victor Lessor and Hamid Nawab)

Alon Fishbach presented a mid-level auditory representation consisting of
spectrogram segments and their features, the segmentation being based on
discontinuities in sound.

Steven Boker's model was able to predict where listeners placed the
downbeat in rhythmic patterns using a measure of local information
theoretic entropy.

Kunio Kashino's poster detailed his comprehensive music-understanding
system that uses Bayesian networks to integrate levels of information.
(With Kazuhiroa Nakadai, Tomoyoshi Kinoshita, and Hidehiko Tanaka)

Joern Grabke presented a processor that separated two voices using a
binaural processor implemented with simple delay lines. (With Jens Blauert)

Brian Karlsen was launching the ShATR CD-ROM, a large database of real
multi-speaker discussion complete with extensive annotation and tools.
(With Guy Brown, Martin Cooke, Malcolm Crawford, Phil Green, and Steve

Ray Meddis described a number of psychophysical principles of pitch and
sound stream suppression. (With Lowel O'Mard)

Darryl Godsmark presented his blackboard system for accumulating
contextual evidence in order to model human-like competition of cues. (With
Guy Brown)

Dan Ellis presented an overview of auditory representations and a new
representation known as Wefts, based on correlograms. (With David

Hideki Kawahara presented his experimental work on the control of pitch by
humans and the control feedback between perception and synthesis.

Frederic Berthommier discussed the implications of physiological mechanisms
for amplitude modulation. (With Christian Lorenzi)

Masataka Goto presented his real time system for perceiving the beats in a
musical audio. (With Yoichi Muraoka)

Ludger Solbach compared the wavelet transform to conventional models of
auditory models as a preprocessor for ASA. (With Rolf Wohrmann and Jorg

Lonce Wyse presented his work on analyzing audio for content (speech versus
audio) and detecting changes in the speaker. (With Stephen Smollar)

DeLiang Wang presented his work on relaxation oscillator networks and
showed that they can be used for modeling auditory stream segregation.

Hamid Nawab presented his knowledge-based system for recognizing speech
contaminated with environmental sounds by directed reanalysis. (With Carol
Epsy-Wilson, Ramamurthy Mani, and Nabil Bitar)

Nicholas Saint-Arnaud presented his work on synthesizing realistic audio
textures given a small sample. (With Kris Popat)

Eric Scheirer presented his system for correlating a musical score with an
audio signal and using the differences to infer expressive performance

Guy Brown presented work on using cortical oscillators to model stream
segregation and showed results consistent with psychophysical data. (With
Martin Cooke)

Tomohiro Nakatani showed an agent based system for grouping binaural sounds
using harmonic analysis and binaural information. (With Masataka Goto,
Takatoshi Ito, and Hiroshi Okuno)

Malini Bhandaru explained work to give an environmental sound recognition
system the ability to extend its models in response to new examples. (With
Victor Lessor)