[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CASA problems and solutions

Dear Dr. Bates,

There is a distinction between auditory scene analysis (ASA) as defined by Bregman,
and computational auditory scene analysis (CASA).  From my reading of your writeup,
your criticisms are targeted to ASA.  To a CASA researcher like me, the problem is
well-defined: how to extract auditory events (objects) from one sensor (monaural),
two sensors (binaural), or more (sensor array).  This definition is, in Marrian
framework, at the level of computational theory, independent of Fourier analysis,
cochlear models, or even the auditory system.  One may insist on biological
plausibility, or one may pay close attention to how the auditory system
solves the problem (which we don't know) in order to borrow ideas and solutions.

But, regardless of approaches, the CASA problem remains unsolved.  I think that a lot
of people are not really biased in terms of approaches and some frankly don't care about
human audition.  This speaks to the challenge of the problem itself.  Moreover, one need
not be too pessimistic.  Think about computer vision, where A LOT more people have been
working, and artificial intelligence.

What I am getting at is that, if you can manage to separate multiple sources spatially
and reconstruct them, it will be a great technological breakthrough.  I don't know if you
can do it and I have doubts (see below), but we will study your approach.

Location-based source separation has been attempted before with some success.  But
it is far from solving the CASA problem.  Successes in a few demos are far from a general
solution. There are well-documented test databases to measure success in CASA in a
systematic way (see references below).  Results on these databases would be
a lot more revealing.

Since you have made a connection with vision, spatial analysis in audition would
roughly correspond to depth perception from two images, a problem that has
been studied since the early days of computer vision.  The challenge there, which
remains to this day, is the correspondence problem: which pixels of one image correspond
to which pixels of the other.  It's hard to find a solution to the correspondence problem
without image analysis (grouping and segmentation). Similarly, it's hard for me to
imagine a solution to CASA purely on the basis of spatial analysis without other cues of
ASA.  The replies by Al Bregman and John Culling suggest that location may not even be
a major cue for ASA.  I'd like to be proven wrong, so that at least we have one solution to
rely on.

Some recent references about CASA:

D.F. Rosenthal and H.G. Okuno, Ed. Computational auditory scene analysis. Mahwah NJ: Lawrence
Erlbaum,  1998.

D.L. Wang and G.J. Brown, "Separation of speech from interfering sounds based on oscillatory
correlation," IEEE Trans. Neural Net., vol. 10, pp. 684-697, 1999.  (PDF file available on my web.)


DeLiang Wang
Dr. DeLiang Wang
Department of Computer and Information Science
The Ohio State University
2015 Neil Ave.
Columbus, OH 43210-1277, U.S.A.

Email: dwang@cis.ohio-state.edu
Phone: 614-292-6827 (OFFICE); 614-292-7402 (LAB)
Fax: 614-292-2911
URL: http://www.cis.ohio-state.edu/~dwang