[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: CASA problems and solutions
Al Bregman wrote:
Apart from spatial origin, the following sorts of information are used by
(A) For integrating components that arrive overlapped in time
1. harmonic relations
2. asynchrony of onset and offset
3. spectral separation
4. Independence of amplitude changes in different parts of the spectrum
> (B) For integrating components over time:
> 5. Spectral separation
> 6. Separation in time (interacts with other factors)
> 7. Differences in spectral shape
> 8. Differences in intensity (a weak effect)
> 9. Abruptness/smoothness of transition from one sound
> to the next
> I'm not sure whether your rejection of the Fourier method extends to all
> methods of decomposing the input into spectral components. However if it
> does, we should bear in mind that factors 3, 4, and 5, 7, and probably 1,
> listed above, are most naturally stated on a frequency x time
> representation -- that is, on a spectrogram or something like it.
I apologize that this is so long....
Al's handout and list are very useful in organizing what we want to explain, and
Al has done more than anyone in keeping questions of perceptual organization on
table. The question is what kind of auditory representation do we need to
this structure of perceptual organization that we observe in ourselves.
Certainly if you look at a spectrogram of a harmonic complex, the low pitch of
the complex (at the fundamental) is not readily apparent (unlike what we hear),
and its estimation involves some fairly elaborate inferences from the spectral
In order to determine the pitch, one either needs to identify the frequencies
of the harmonics and then derive their greatest common denominator
or use some harmonic template or trained neural network pattern recognizer
(subharmonic sieves take one out of the Fourier perspective). A log frequency
plot obscures the harmonic structure, a linear frequency plot distorts
distances between frequencies. No Fourier description shows pattern similarities
between octaves -- there is nothing special about octaves in a spectrogram.
It is very hard to determine from a spectrogram
if there are slight mistunings of harmonics, but we readily hear these
Some birds recognize mistuned harmonics within a couple of pitch cycles; my
recollection is that we can do it within about 50 ms.
Maybe a spectrogram isn't bad for isolated pure tones, but this map becomes
messy and uninterpretable when there are multiple objects (e.g. several musical
instruments playing different notes) with complex, overlapping spectra.
Yet we hear out the different objects with relatively little difficulty.
We should unpack the phrase "decomposing the input into spectral components".
If one passes a signal through an array of band-pass filters, but uses the time
structure that comes out of the filters rather than reading off their activation
magnitudes (by whatever measure), then is this operation a decomposition?
It is possible to have frequency-by-frequency processing in the
time domain without necessarily ever using a spectral representation per se.
One gets neighborhood interactions and many of the other properties that we
usually think of in terms of running spectra, but the primitives of this
are fine temporal patterns (neurally, these are spike times, interspike
rather than profiles of frequency-channel activations.
The problem with frequency-time representations as a basis of scene analysis
is that these usually eliminate underlying temporal fine structure (or phase
Although our perception of pitch and timbre of stationary sounds is largely
phase spectrum (components < 2 kHz), the mechanisms by which auditory objects
formed appear to be sensitive to abrupt changes in phase (I am thinking of
demonstrations of separation of partials from a harmonic complex by abrupt
in phase and intensity --- following the change, the partial is heard out and
back into the whole).
Precessions of relative phase relations can be used to separate sounds.
This is very apparent in the double-vowel separations, where the two
vowel fundamentals are separated by a semitone or more. Each vowel has its own
repeating waveform pattern that generates, by virtue of phase-locking, a
correlation pattern. The relations between the two patterns associated with the
vowels are constantly shifting relative to each other, but the patterns
internally invariant. If one has a mechanism by which
a temporal pattern is built up when it recurs, then such a mechanism will build
up each of the
two vowels as different invariant patterns and separate them.
Formation of objects can thus depend on coherence of temporal fine structure. I
is close to the kinds of relational, correlational processes that the
Gestaltists had in mind.
I have a paper and a poster on my website on recurrent timing nets that build up
If one adopts the standard frequency-time perspective, these scene analysis
that depend on temporal fine structure
are no longer available, and one must then search around for other means of
building up objects (e.g. prior expectations).
> Furthermore, when you look at a spectrographic representation of an auditory
> signal, the visual grouping that occurs is often directly analogous to the
> auditory organization (provided that the time and frequency axes are
> properly scaled).
I don't think this works very well for the pitches of harmonic and inharmonic
> Why would this be so if some sort of frequency axis were
> not central to auditory perception, playing a role analogous to a spatial
> dimension in vision?
We probably should not appeal to sensory systems whose operation is not
well understood. Is there a cogent, compelling theory out there of how visual
forms are neurally represented?
(The "pixel pattern" model of vision doesn't work very well either when multiple
objects enter visual scenes.)
But what if vision depends on spatiotemporal correlations between spikes (fine
between spikes at different retinal locations), rather than a retinal rate-place
It turns out that they have a hyperacuity problem in vision that is not unlike
ours for frequency,
and that vernier acuity limits can be explained if visual neurons can respond
to edges with jitters of a millisecond or less (provided this information can be
Following Bialek and others, the vision community is slowly finding out that
there is stimulus information
in spike timings down to a millisecond and maybe less (the limits of their
It could well be the case that the limits of visual acuity, like the limits of
frequency discrimination, depend on precisions of spike timings, not on rate
If so, then the retina could be regarded as a gigantic temporal
with vision as a system that looks not unlike binaural cross-correlation, but
with temporal correlations
across many retinal positions rather than across two corresponding cochlear
> Perhaps the Fourier transform is not the best
> approach to forming this frequency dimension, but something that does a
> similar job is required.
IMHO, something like running auto- and cross-correlations, that retain fine time
structure are needed. They share many common properties with spectrographs
without throwing away half the information. Or perhaps some way of operating on
the running phase
spectrum. One needs a process that is sensitive to phase invariants and changes
for object formation and separation, but then one subsequently needs a process
that is largely
phase insensitive for pitch and timbre comparisons. It could be the case
(as John Culling suggests for location) that the objects are formed first and
properties (such as pitch, timbre, location) are analyzed/compared. This might
why harmonic relations are more important for grouping than ITD cues.
> Finally there is overwhelming physiological
> evidence that the human nervous system does a frequency analysis of the
> sound and retains separate frequency representations all the way to the
We should be careful about this. (What evidence do you have in mind?)
There is a bit of a disconnect between tonotopy studies and the
neural representations that subserve (fine) perceptual discriminations.
Tonotopic organization of a relatively coarse nature is only seen if
one looks at/near neural response thresholds, and this organization invariably
breaks down at moderate to high levels. It may not be a "representation" in
the functional sense. Tonotopy could simply be a reflection
of the organization of the (cochlear) receptor surface and the tendency of
correlated inputs to cluster in local spatial neighborhoods, rather than the
neural representational mechanism through which fine pitch discriminations are
(If we looked at rate-place profiles in the auditory nerve in response to a
complex presented at 80 dB SPL, rate patterns would be very broad,
and we would be very hard pressed to estimate with
any degree of accuracy or reliability the frequency of the fundamental -- maybe
could get within a half-octave if we were lucky, but this is 2 orders of
coarser than pitch jnd's. I have the same feeling when I look at
systematic current-source density analysis of cortical responses to
harmonic complex tones -- one begins to see
spatial activation structure only when the harmonics are separated by about
half an octave (350 Hz for an 800 Hz BF). At the level of the auditory nerve,
the information for fine periodicity analysis is in the spike timing,
not in the (tonotopically organized) rate profiles. What aspects of cortical
response subserve pitch discriminations of 0.5% or less in frequency is an open
(Does anyone have an answer to this question?)
Perhaps I'm completely wrong in my wariness of the textbook view, as far as it
I think it's better not to wallpaper over the difficulties that auditory theory
currently faces. We need to make them as clear as we can.
Recognition of where current theory breaks down is absolutely
essential to future progress.