[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[no subject]

Several people have recently commented on the computer program I developed
with Mabel Wu -- which was described on AUDITORY on December 5th.
Most of the comments have been of the sort `Surely we need to consider
temporal aspects of the signal ...'   I'd like to respond to these comments.

In the ensuing decades (perhaps centuries?) there will be innumerable attempts
to build "auditory scene analyzers."  Some of these projects will be rather
ad hoc -- from which little useful knowledge will be gained.  Other projects
will be more systematic, and researchers will learn something from their
successes and failures.

Drawing on my own humbling experience, let me offer the following advice
concerning how this sort of research ought to proceed.  First, in brief:

   (1) Clarify whether the problem you're attempt to solve is auditory scene
       analysis or acoustic scene analysis.

   (2) Evaluate your system.

   (3) Investigate one hypothesis at a time.

Expanding on these points:

(1) Clarify whether the problem you're attempt to solve is auditory scene
    analysis or acoustic scene analysis.

There has already been a helpful discussion on AUDITORY concerning this issue.
Any research group needs to clarify whether their goal is to emulate human
behavior (with all its foibles and intricacies), or whether their goal is a
description of the acoustic scene.  (Of course this doesn't preclude
fashioning an ACOUSTIC scene analyzer along the lines of the AUDITORY system.)

(2) Evaluate your system.

If we want to have a sense of the success of some system, we need to evaluate
it in ways the admit comparisons between different approaches.  In the field
of optical character recognition (OCR), for example, there are standardized
samples of printed and hand-written letters.  These samples have been carefully
gathered to be representative of the population of printed and written texts
in various languages and for various tasks -- such as hand-written zip codes.
These samples provide a standard measurement tool for comparative evaluation
of different OCR systems.  Without these measurement tools it's possible to
tweak system parameters so that the performance looks impressive for some
arbitrary subset of inputs.

In my opinion, we will need to develop a similar set of benchmark stimuli
for acoustic and auditory scene analysis.  That is, we need to assemble a
set of recorded stimuli from a variety of acoustic situations, including
mixtures of speech, music, environmental sounds, differing numbers of acoustic
sources, harmonic and inharmonic sources, heavily masked and unmasked
scenes, etc.  In the case of auditory scene analysis, we would also need
corresponding perceptual data for each stimulus.  For example, we would need
to know how many sound sources typical listeners identify for each scene,
the variance in listener responses, etc.  Of course, developing such a
benchmark set of stimuli would be a major project.

Notice that without such a benchmark set of stimuli, each auditory/acoustic
scene analyzer can be optimized for some limited set of stimuli -- and so
can be made to look like it performs pretty well -- a problem that has plagued
OCR development.  If we want to compare different approaches to scene analysis,
it would helpful to have some standard measurement tool(s).

(3) Investigate one hypothesis at a time.

We COULD program a single scene analyzer that included cochlear modeling,
frequency- and pitch-domain pattern recognition, amplitude co-modulation,
onset synchrony, temporal grouping, gestalt heuristics, etc. etc.  But we
would spend the next hundred years trying to optimize a thousand parameters.
Parameter-tweaking is a black hole that will lead to little knowledge.  It's
better for researchers to begin with a single model (such as a cochlear model)
and see how far it takes us.  Then we need to try an alternative approach and
compare the results.  Is method A better than method B (given our benchmark
stimuli)?  Does method A subsume the successes of method B or vice versa?
Does method A work better for a different class of stimuli than for method B?
And so on.  By comparing competing scene analysis hypotheses we'll make faster

In other words, in the initial stages at least, it's better to examine one
approach at a time, rather than pasting together a number of approaches.
Certainly, we ought to avoid `everything but the kitchen sink' approaches.


In light of the above three principles, you can see why I think some of the
recent responses to the work I reported aren't especially helpful:

First, our program was not intended to be a "model" for scene analysis.
Our program was an implementation of Terhardt's virtual pitch algorithm
coupled with some peak-tracing methods.  The question we wanted to answer
was how well can virtual pitch information alone contribute to successful
scene parsing?

Second, I never suggested that temporal factors wouldn't be a useful approach.
Quite the reverse; here's what I wrote in my note of December 5th:

  "Our original intention was to push the pitch-unit spectrum approach as
   far as we could, and then try an independent tack by examining temporal
   information ..."

-- in short, one hypothesis at a time.

In my opinion, the major problem with the work we did was that we screwed-up
on principles (1) and (2).  We never got to the EVALUATION for a number of
reasons.  Not least of these was that we'd made the first error -- namely,
failing to decide whether our goal was acoustic scene analysis or auditory
scene analysis.  This may seem like a trivial distinction, but it has
concrete repercussions for how you measure the system's performance.  In the
first case, you need to enumerate the acoustic sources used to generate the
recorded stimuli; in the second case you need to compare the system to
subjective reports evoked by the stimuli.  Such lessons are hard-won.

Perhaps people would care to comment on my suggestion concerning a set of
benchmark stimuli for scene analysis.  I'd be interested to hear suggestions
as to the type of stimuli that would need to be included, whether real-world
sampling is important, and if so, what sort of sampling method would be
appropriate.  Perhaps we could distribute a collection of such stimuli on a
limited-run CD.

David Huron