# Re: CASA problems and solutions (Peter Cariani )

```Subject: Re: CASA problems and solutions
From:    Peter Cariani  <peter(at)EPL.MEEI.HARVARD.EDU>
Date:    Wed, 31 Jan 2001 17:36:31 -0500

Al Bregman wrote:
Apart from spatial origin, the following sorts of information are used by
humans:
(A) For integrating components that arrive overlapped in time
1.  harmonic relations
2.  asynchrony of onset and offset
3.  spectral separation
4.  Independence of amplitude changes in different parts of the spectrum

> (B) For integrating components over time:
>     5.  Spectral separation
>     6.  Separation in time (interacts with other factors)
>     7.  Differences in spectral shape
>     8.  Differences in intensity (a weak effect)
>     9.  Abruptness/smoothness of transition from one sound
>          to the next
> ...

> I'm not sure whether your rejection of the Fourier method extends to all
> methods of decomposing the input into spectral components.  However if it
> does, we should bear in mind that factors 3, 4, and 5, 7, and probably 1,
> listed above, are most naturally stated on a frequency x time
> representation -- that is, on a spectrogram or something like it.

I apologize that this is so long....

Al's handout and list are very useful in organizing what we want to explain, and
certainly
Al has done more than anyone in keeping questions of perceptual organization on
the
table.  The question is what kind of auditory representation do we need to
explain
this structure of perceptual organization that we observe in ourselves.

Certainly if you look at a spectrogram of a harmonic complex, the low pitch of
the complex (at the fundamental) is not readily apparent (unlike what we hear),
and its estimation involves some fairly elaborate inferences from the spectral
pattern.
In order to determine the pitch, one either needs to identify the frequencies
of the harmonics and then derive their greatest common denominator
or use some harmonic template or trained neural network pattern recognizer
(subharmonic sieves take one out of the Fourier perspective).  A log frequency
plot obscures the harmonic structure, a linear frequency plot distorts
perceptual
distances between frequencies. No Fourier description shows pattern similarities

between octaves -- there is nothing special about octaves in a spectrogram.

It is very hard to determine from a spectrogram
if there are slight mistunings of harmonics, but we readily hear these
deviations.
Some birds recognize mistuned harmonics within a couple of pitch cycles; my
recollection is that we can do it within about 50 ms.

Maybe a spectrogram isn't bad for isolated pure tones, but this map becomes
rather
messy and uninterpretable when there are multiple objects (e.g. several musical
instruments playing different notes) with complex, overlapping spectra.
Yet we hear out the different objects with relatively little difficulty.

We should unpack the phrase "decomposing the input into spectral components".
If one passes a signal through an array of band-pass filters, but uses the time
structure that comes out of the filters rather than reading off their activation

magnitudes (by whatever measure), then is this operation a decomposition?
It is possible to have frequency-by-frequency processing in the
time domain without necessarily ever using a spectral representation per se.
One gets neighborhood interactions and many of the other properties that we
usually think of in terms of running spectra, but the primitives of this
representation
are fine temporal patterns (neurally, these are spike times, interspike
intervals)
rather than profiles of frequency-channel activations.

The problem with frequency-time representations as a basis of scene analysis
is that these usually eliminate underlying temporal fine structure  (or phase
structure if
one prefers).

Although our perception of pitch and timbre of stationary sounds is largely
insensitive to
phase spectrum (components < 2 kHz), the mechanisms by which auditory objects
are
formed appear to be sensitive to abrupt changes in phase (I am thinking of
Kubovy's
demonstrations of separation of partials from a harmonic complex by abrupt
changes
in phase and intensity --- following the change, the partial is heard out and
then blends
back into the whole).

Precessions of relative phase relations can be used to separate sounds.
This is very apparent in the double-vowel separations, where the two
vowel fundamentals are separated by a semitone or more. Each vowel has its own
repeating waveform pattern that generates, by virtue of phase-locking, a
multichannel temporal
correlation pattern. The relations between the two patterns associated with the
two
vowels are constantly shifting relative to each other, but the patterns
themselves are
internally invariant. If one has a mechanism by which
a temporal pattern is built up when it recurs, then such a mechanism will build
up each of the
two vowels as different invariant patterns and separate them.
Formation of objects can thus depend on coherence of temporal fine structure. I
think this
is close to the kinds of relational, correlational processes that the

I have a paper and a poster on my website on recurrent timing nets that build up
such patterns:
www.cariani.com
http://peter-office.meei.harvard.edu/ARO2kCariani.pdf
http://peter-office.meei.harvard.edu/TImingNets99.pdf

If one adopts the standard frequency-time perspective, these scene analysis
strategies
that depend on temporal fine structure
are no longer available, and one must then search around for other means of
building up objects (e.g. prior expectations).

>
> Furthermore, when you look at a spectrographic representation of an auditory
> signal, the visual grouping that occurs is often directly analogous to the
> auditory organization (provided that the time and frequency axes are
> properly scaled).

I don't think this works very well for the pitches of harmonic and inharmonic
complexes.

> Why would this be so if some sort of frequency axis were
> not central to auditory perception, playing a role analogous to a spatial
> dimension  in vision?

We probably should not appeal to sensory systems whose operation is not
currently
well understood. Is there a cogent, compelling theory out there of how visual
forms are neurally represented?
(The "pixel pattern" model of vision doesn't work very well either when multiple
objects enter visual scenes.)

But what if vision depends on spatiotemporal correlations between spikes (fine
temporal correlations
between spikes at different retinal locations), rather than a retinal rate-place
model?
It turns out that they have a hyperacuity problem in vision that is not unlike
ours for frequency,
and that vernier acuity limits can be explained if visual neurons can respond
to edges with jitters of a millisecond or less (provided this information can be
used).
Following Bialek and others, the vision community is slowly finding out that
there is stimulus information
in spike timings down to a millisecond and maybe less (the limits of their
measurement precisions).
It could well be the case that the limits of visual acuity, like the limits of
auditory
frequency discrimination, depend on precisions of spike timings, not on rate
tunings.
If so, then the retina could be regarded as a gigantic temporal
cross-correlator,
with vision as a system that looks not unlike binaural cross-correlation, but
with temporal correlations
across many retinal positions rather than across two corresponding cochlear
positions.

> Perhaps the Fourier transform is not the best
> approach to forming this frequency dimension, but something that does a
> similar job is required.

IMHO, something like running auto- and cross-correlations, that retain fine time

structure are needed. They share many common properties with spectrographs
without throwing away half the information. Or perhaps some way of operating on
the running phase
spectrum. One needs a process that is sensitive to phase invariants and changes
for object formation and separation, but then one subsequently needs a process
that is largely
phase insensitive for pitch and timbre comparisons. It could be the case
(as John Culling suggests for location) that the objects are formed first and
then their
properties (such as pitch, timbre, location) are analyzed/compared. This might
explain
why harmonic relations are more important for grouping than ITD cues.

> Finally there is overwhelming physiological
> evidence that the human nervous system does a frequency analysis of the
> sound and retains separate frequency representations all the way to the
> brain.

There is a bit of a disconnect between tonotopy studies and the
neural representations that subserve (fine) perceptual discriminations.
Tonotopic organization of a relatively coarse nature is only seen if
one looks at/near neural response thresholds, and this organization invariably
breaks down at moderate to high levels. It may not be a "representation" in
the functional sense. Tonotopy could simply be a reflection
of the organization of the (cochlear) receptor surface and the tendency of
correlated inputs to cluster in local spatial neighborhoods, rather than the
neural representational mechanism through which fine pitch discriminations are
effected.
(If we looked at rate-place profiles in the auditory nerve in response to a
harmonic
complex presented at 80 dB SPL, rate patterns would be very broad,
and we would be very hard pressed to estimate with
any degree of accuracy or reliability the frequency of the fundamental -- maybe
we
could get within a half-octave if we were lucky, but this is 2 orders of
magnitude
coarser than pitch jnd's. I have the same feeling when I look at
Steinschneider's
systematic current-source density analysis of cortical responses to
harmonic complex tones -- one begins to see
spatial activation structure only when the harmonics are separated by about
half an octave (350 Hz for an 800 Hz BF).  At the level of the auditory nerve,
the information for fine periodicity analysis is in the spike timing,
not in the (tonotopically organized) rate profiles. What aspects of cortical
neural
response subserve pitch discriminations of 0.5% or less in frequency is an open
question.
(Does anyone have an answer to this question?)

Perhaps I'm completely wrong in my wariness of the textbook view, as far as it
goes, but
I think it's better not to wallpaper over the difficulties that auditory theory
currently faces. We need to make them as clear as we can.
Recognition of where current theory breaks down is absolutely
essential to future progress.

Peter Cariani
```

This message came from the mail archive
http://www.auditory.org/postings/2001/
maintained by:
DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University