[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
At 1:17 AM -0800 3/31/98, Sue Johnson wrote:
>I think there must be some way the brain splits up (deconvolves) the
>signal before applying a speech recogniser.
Who says the brain operates in only a bottom-up manner?
The best counter-example I know of is a song by Miriam Makeba where she
sings in an african click language. To my non-african ears, a click in the
middle of speech is heard as speech, but when the same sound is accompanied
by music it is heard as a drum beat. An ambiguous sound changes it
grouping based on the context. A similar argument can be made about
sine-wave speech--it has both speech-like and tone-like components.
Perhaps an even better example is the McGurk Effect-- Wow, a low-level
auditory decision changes based on visual input. Certainly the visual
system isn't connected at a low level to the auditory system. Some
information *must* be travelling top-down. It's expectation driven.
Most work on Computational Auditory Scene Analysis has assumed a bottom-up
processing model. That's certainly the easy, engineering approach. But
there is much evidence that life is not so simple. In a chapter I wrote
called "A Critique of Pure Audition" I argue that there are too many
processing stages that must come first (and often conflicting) that it
can't be bottom-up.
Bregman's book mostly discusses bottom-up grouping cues. But I'm sure
there must be grouping based on language and similar high-level constructs.
I don't know how you would prove it. (Anecdotally, I did notice in a
Japanese cocktail party that it was easier to separate native english
speakers, probably because their prosody fit my expectations.)
Many examples of these effects (include the Click Song) are available at
An early version of the chapter, before a massive edit to clean up the
language, is online at
Unfortunately, copyright restrictions means that the final chapter is not
online. You'll have to get a copy of the book.