[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CASA problems and solutions

Dear Dr. Wang,
  I'm afraid that you misunderstood my objectives in my essay on CASA.
Although I used a lot of references to human auditory perception (ASA) what
I tried to make clear was that I wanted to derive the design requirements
and constraints leading to a practical CASA system that could (ideally) do
what the human ear does.  In other words, in order to design a system that
could solve auditory problems it would be necessary to find the fundamental
requirements as Seebeck might have seen them.  So let's start from scratch
to design a hearing system.
  The main requirements are:
    (1) separate the sources, and
    (2) at any instant, select the primary source for attention.
  Whether or not we want to separate sources by spatial localization or by
other source features, an objective analysis of the first problem says that
we need the best possible time resolution.  We haven't been able to get
that by Fourier methods, so what else is there than to seek the best
possible time resolution?  If that's the primary constraint, then it is
necessary to find a way to combine both high time resolution and source
information.  My solution is to use pixel-like particles that I call
waveform information vectors (WIVs) that are sampled at waveform zeros.
From this starting point, we then select,  combine, and recognize patterns
in the data that ultimately end in speech perception as well as all other
kinds of sounds.  Obviously, this method has nothing to do with the
biophysics of the ear.
  To demonstrate that this method is feasible I have included a few
illustrative experiments in my Web site. <http://home.computer.net/~jkbates>
  To satisfy the second objective, we must return to abstractions
concerning the how, when, and why we want our machine to listen to a
particular sound.  That is the really tough problem because the ultimate
purpose of a CASA machine is to act like a listener who has a reason to
survive in whatever mission it is designated to carry out.  Hence, my
discussion on the philosophy of survival objectives.  To my mind,
implementing this concept of survival in practical form is necessary: it is
the only way to arrive at a robust speech recognizer.
  As for the subject of spatial separation I'm aware that verification
takes a lot more than a few experiments.  I have done a variety of tests
under varying conditions for a number of years, and the system seems
reliable.  I feel that with improved testing facilities and software,
results will be even better.
  With reference to your mention of pixel correspondence in getting
direction of arrival, I believe that my method solves a similar problem in
my interaural time difference scheme.  It turned out to be easy because it
was a variation on a system I had designed and patented in 1975.  Actually,
the system looks a lot like the Jeffress model except that I use zero
crossings instead of phase.  The system is described briefly in my paper
presented at the Mohonk97 workshop. [1]
  While other location-based source separation attempts have had some
success, the important point is, as you mention, whether or not the methods
relate to the entire CASA system.  This is what I have always tried to keep
in mind.  What I have shown so far has been only what is necessary to
establish feasibility; that the ultimate objectives are possible.
  Many thanks for your comments; they are very helpful.

   Best wishes,

   John Bates
   Time/Space Systems
   Pleasantville, New York

 [1]  J.K. Bates, "Modeling the Haas effect: a first step for solving the
CASA problem," Proc. of IEEE Workshop on Applications of Signal Processing
to Audio and Acoustics, October 1997

>There is a distinction between auditory scene analysis (ASA) as defined by
>and computational auditory scene analysis (CASA).  From my reading of your
>your criticisms are targeted to ASA.  To a CASA researcher like me, the
problem is
>well-defined: how to extract auditory events (objects) from one sensor
>two sensors (binaural), or more (sensor array).  This definition is, in
>framework, at the level of computational theory, independent of Fourier
>cochlear models, or even the auditory system.  One may insist on biological
>plausibility, or one may pay close attention to how the auditory system
>solves the problem (which we don't know) in order to borrow ideas and
>But, regardless of approaches, the CASA problem remains unsolved.  I think
that a lot
>of people are not really biased in terms of approaches and some frankly
don't care about
>human audition.  This speaks to the challenge of the problem itself.
Moreover, one need
>not be too pessimistic.  Think about computer vision, where A LOT more
people have been
>working, and artificial intelligence.
>What I am getting at is that, if you can manage to separate multiple
sources spatially
>and reconstruct them, it will be a great technological breakthrough.  I
don't know if you
>can do it and I have doubts (see below), but we will study your approach.
>Location-based source separation has been attempted before with some
success.  But
>it is far from solving the CASA problem.  Successes in a few demos are far
from a general
>solution. There are well-documented test databases to measure success in
CASA in a
>systematic way (see references below).  Results on these databases would be
>a lot more revealing.
>Since you have made a connection with vision, spatial analysis in audition
>roughly correspond to depth perception from two images, a problem that has
>been studied since the early days of computer vision.  The challenge
there, which
>remains to this day, is the correspondence problem: which pixels of one
image correspond
>to which pixels of the other.  It's hard to find a solution to the
correspondence problem
>without image analysis (grouping and segmentation). Similarly, it's hard
for me to
>imagine a solution to CASA purely on the basis of spatial analysis without
other cues of
>ASA.  The replies by Al Bregman and John Culling suggest that location may
not even be
>a major cue for ASA.  I'd like to be proven wrong, so that at least we
have one solution to
>rely on.
>Some recent references about CASA:
>D.F. Rosenthal and H.G. Okuno, Ed. Computational auditory scene analysis.
Mahwah NJ: Lawrence
>Erlbaum,  1998.
>D.L. Wang and G.J. Brown, "Separation of speech from interfering sounds
based on oscillatory
>correlation," IEEE Trans. Neural Net., vol. 10, pp. 684-697, 1999.  (PDF
file available on my web.)
>DeLiang Wang