[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Computational ASA

To: AUDITORY@xxxxxxxxxxxxxxx
Subject: Re: Computational ASA
From: Paris Smaragdis <paris@xxxxxxxxxxxxx>
Date: Fri, 30 Apr 2004 18:11:42 -0400
Delivery-date: Fri Apr 30 18:33:10 2004
In-reply-to: <E2A41F3CFE5FD411ABC900508BAE0BEE041431F5@power.coe.montana.edu>
References: <E2A41F3CFE5FD411ABC900508BAE0BEE041431F5@power.coe.montana.edu>
Reply-to: Paris Smaragdis <paris@xxxxxxxxxxxxx>
Sender: AUDITORY Research in Auditory Perception <AUDITORY@xxxxxxxxxxxxxxx>
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.6) Gecko/20040113

To add one more thing to Rob's excellent points, part of the difficulty
in source separation is that it is more of an art than a science.  There
are no meaningful measures of success, and what is perceived as a good
result is whatever sounds good.  Therefore by hand tuning it is easy to
come up with very good results on a case by case basis (hence the art);
but doing so automatically and consistently with arbitrary inputs is
very hard because we have no idea what numbers to strive for.

Paris


Maher, Rob wrote:

Jon--
I think the inherent difficulty of computational source separation has to do
with the generally ill-posed nature of the research problem:  given a
composite observation vector 'A' that is a linear sum of N unknown
time-varying signal vectors  'B', 'C', ..., determine estimates of 'B', 'C',
.... In other words, one equation in N unknowns, where N > 1.  Without some
other valid source of information, there can be no unique solution to the
problem.

To obtain the "other valid source of information," the CASA field has a
variety of threads.  One thread involves the use of conventional DSP
techniques to transform the composite signal into a (typically)
time-frequency representation, then to perform pattern extraction in the
transform domain.  Another thread uses biologically-inspired signal
processing via cochlear models and perceptually-derived nonlinear functions
borrowed from the perceptual audio coding field.  Yet another thread starts
with human psychoacoustical data in an attempt to exploit the cognitive
concepts of source segregation and streaming.

It is sometimes argued that "humans can do separation, so the problem must
be soluble."  I would argue that humans do source _identification and
tracking_ very effectively, but perhaps humans do not actually solve the
computational _separation_ problem, in the sense that the individual vectors
'B', 'C', etc. are extracted in a neural signal processing context.

A computational system that is able reliably to classify the number,
identity, and duration of overlapping sonic events seems like a first step
in the process.  Yet, I don't know of any system to date that comes close to
a casual human's ability to determine the orchestration of a musical
selection or recognize the doorbell at a noisy party.

We certainly need so new insights into the problem, so welcome aboard!

Rob Maher

--
Robert C. (Rob) Maher, Ph.D.
Associate Professor of Electrical and Computer Engineering
Montana State University-Bozeman
rob.maher@montana.edu

-----Original Message-----
From: Jon Boley [mailto:jdb@jboley.com]
Sent: Friday, April 30, 2004 7:59 AM
To: AUDITORY@LISTS.MCGILL.CA
Subject: Computational ASA

Hi all,
I am a grad student in the University of Miami's Music
Engineering program, and I am just starting to learn about
auditory scene analysis, particularly computational ASA models.

I know there are several CASA experts on this list, so I'd
like to ask why source separation seems to be so difficult.
It's seems like the general consensus is that source
separation is far too difficult, and research has focused on
understanding features within a mix.  Yet, from what I've
read, current methods of feature extraction work quite well.
It only seems natural that we could write an algorithm that
groups these features according to their perceived source and
creates separate audio streams based on this information.
While this would be much more difficult in noisy or
reverberant environments, I would imagine it would be quite
simple in a less complex environment.
What is it that makes source separation so difficult?

Thanks,
Jon Boley

References:
- Re: Computational ASA
  - From: Maher, Rob

Prev by Date: Re: Computational ASA -- how many sources can humans perceive?
Next by Date: Re: Computational ASA -- how many sources can humans perceive?
Previous by thread: Re: VAD (Voice Activity Detection) algorithms?
Next by thread: Auditory Streaming
Index(es):
- Date
- Thread