[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Analytical approach to temporal coding, CASA and other matters

Hi Drs.Bates,  Blumschein, Bregman, Cariani,  Greenberg, Deliang Wang,
 and other list people,

I: Introduction:

We (Yadong Wang and I) have been preparing to respond to many of
the recent postings (see below) in this forum and other related
venues (e.g., Blumschein's). I apologize for using last names and for
the length of this posting. Our approach to signal representation (at
various stages of development) using a possible temporal code  is
documented in our recent publications (see refs. in manuscript)
and manuscript(http://www.ele.uri.edu/~ydwang/Jasa.pdf). We would like
to relate these to the recent comments made on this topic.

Recent postings by Bates, Bregman, Blumschein, Cariani, Wang, Greenberg
and a whole lot of related literature point to the fact that temporal
information (not just the spectral energy as seen in a spectrogram)
has to play an important role in auditory function. A key question is
how is this temporal information coded and processesd in  higher
auditory centers. Ofcourse, a short and definitive answer is that
we don't know. Nobody knows. However, we have a possible analytical
approach to this problem and this is the topic of this posting.

II: References:

Many people in this Auditory list  have made insightful comments on
related problems of temporal coding ( rate/inter spike intervals),
spectrogram, coincidence detection in the  higher auditory centers,
etc. We refer to some of these below.

1 E.Blumschein, "Why and how to revise the traditional theory of
auditory function" at

2 J.Bates, "How to hear everything and listen to anything " at

3 A.Bregman, "Spectrogram frequency axis is central to auditory
perception" at http://www.psych.mcgill.ca/labs/auditory/findings.html

4 Peter Cariani, "The problem with frequency-time representations as a
basis of scene analysis is that these usually eliminate underlying
temporal fine structure.." at http://www.cariani.com

5 Peter Meijer, "Note that even the spectrogram itself can be viewed
as an attempt to design a useful artificial cross-modal mapping from
the auditory to the visual domain!" at

6 D. Wang, "CASA problems and solutions " at

We have some specific remarks on some of these.

III: Main Points

We have divided our comments into four main categories.

A) Motivation for our temporal code.
B) A brief summary of our temporal code
C) Speculations on how this might be achieved in the auditory periphery
D) Specific comments on other peoples remarks

III.A) Motivation for our temporal code:

1) Is there a temporal code that can faithfully represent a
   bandpass signal?

The auditory periphery (specifically, the inner ear) converts an
acoustic signal that is incident on it into a neural code. This code
is believed to be some combination of spike rate and relative timing
between spikes. However, experimental evidence seems to suggest that
interaural time difference is discernible to an accuracy of
microseconds (as we have read in many papers; apparently  it is even
better for bats). If that is so, then it is conceivable that an
acoustic signal's characteristics (envelope, phase,
frequency, intensity) are also accurately encoded in the spike  train
timing information. Therefore, a natural question to ask is: "Is it
at all  possible to code a signal's characteristics  with accurate
timing  information?" If YES, then may be we should look for such a
code. If we can find one, then one may further ask: Does this
code have ANYTHING to do  with how the auditory system
represents and perceives signals?

2) Do current models of cochlea help in discovering such a code?

Cochlear signal processing involves the incredibly delicate inner ear
mechanisms. The components of this mechanism (outer/inner hair cells
OHC/IHC, membranes BM/TM etc.) have been modeled extensively in the
literature. Although these models
attempt to explain the experimentally measured physical quantities
like membrane displacement/velocity, pressure etc., they have not
shed new  light on how signal processing is achieved (other than
gross (may be nonlinear) filtering) and how this leads to a temporal
code. Basically, these modeling approaches attempt to explain the
cochlear mechanics accurately, and assume that the temporal encoding
of signal characteristics will kind of  take care of itself (i.e.,
they get somehow represented in the spike-rate/inter-spike
intervals.) Contrary to this belief, it is possible that the
motor-like OHC action,  may indeed be there (at least partly) to
facilitate accurate temporal encoding of signals. Based on this
premise, we propose a bottom-up approach; it seems to us that if
we propose a temporal coding strategy (based on a theoretically
sound signal representation) then we might start
looking  for possible cochlear mechanisms that may be able to achieve
such a coding. This is certainly not conventional. But this is our

3) How to go about looking for such a code?

Assume that  a bandpass signal, say, centered around 2000Hz, is
applied to the auditory periphery. For purposes of visualization,
assume that this band pass signal is composed of a few tones
(1800Hz-2200Hz, spaced 100Hz apart). This signal could represent
a speech formant. After filtering by various peripheral structures
this bandpass signal is encountered by  the organ of
Corti (OC). The OC has an opportunity to 'look' at this signal for a
duration of T seconds (perhaps, a few milliseconds depending on the
characteristic frequency). Think of this T seconds as the 'dwell' time
on the OC. In this T second duration the OC has to come up with a
representation for this signal. Then the 'dwell window' slides on
continually to the next T seconds and so on. A general model
for this  signal, s(t),  during this T seconds is (in LaTex notation)

s(t)=A a(t) cos(\omega_c t+ \phi(t)),

where A is the scale factor representing the 'bigness' of the signal,
a(t)is the normalized envelope (i.e., maximum of a(t)=1), \omega_c
is the nominal center frequency and \phi(t) is the signal phase. Any
band-pass signal would fit this mold. We may point out that (large)
 A and \omega_c are immediately  visible in the spectrogram as dark
regions. It is the details of a(t)  and \phi(t) that are obscured
in a spectrographic presentation.

In short, \omega_c (and possibly A) constitute (loosely speaking) the
'tonotopic place' information and a(t) and \phi(t) are conveyed by the
inter-spike intervals. (The question is how?) In principle, this signal
model could be applied to different spectral regions
of a signal and thereby is a representation for  the entire (speech)

III.B) Our Temporal Code:

1) Zero-Crossings of s(t) are insufficient:

How do we represent s(t) by a temporal code? Don't even think about
using the zero-crossings of s(t) (i.e., produce a spike at every other
zero-crossing of s(t)). Because it is well known that it
is  not always possible to represent an arbitrary band-pass signal
using its zero-crossing information. Only in special cases
zero-crossings of a signal can be used to represent it.  The
zero-crossing representations were studied by Logan in
the early 1970s. The references may be found in our papers.

2) Adaptive demodulation may be the key:

Remember that  the modulation information (a(t) and \phi(t))
need to be represented as timing information (at least as per
our model). Our key contribution is that by
'adaptively demodulating' (explained below) an arbitrary
band-pass signal s(t), we can  represent its  envelope a(t)
and phase \phi(t) using only (spike) timing information.
Along with A and \omega_c (the 'place' info), these  then
completely  characterize s(t). Hence  our
representation is a  form of sampling theorem.
Shannon/Whittaker sampling theorem represents signals
by their sample values. In contrast we represent a signals'
phase and envelope using certain zero-crossing locations
that arise in the process of adaptive demodulation.
(These are NOT the zerocrossings of s(t)).
Many versions of this algorithm are possible. The details are in our
paper but the mechanics of the algorithm are described briefly below.

3) Some essentials of adaptive demodulation:

We synthesize a signal called r(t) (see figure 9 in the manuscript)
and use it to demodulate s(t). That is we define e(t)=s(t)r(t).
The energy in e(t) (integral of the square of e(t) over the T
seconds) is minimized by choosing r(t).  It turns out that
the resulting signal,
r(t), contains information about the envelope of s(t) and is
completely representable by its zero-crossings. (Curiously, for
this to happen it turns out that the spectrum of r(t) should not
overlap that of the stimulus s(t). See Figure 9a and 9b.) Also,
e(t) contains information about the phase of s(t) and is also
completely representable by its zero-crossings. We call the process
of demodulating s(t) and representing it
 by zero-crossings (of e(t) and/or r(t))
as Real-Zero-Conversion or RZC. Figure 1 in the manuscript
summarizes the temporal coding method i.e., it is
basically a filterbank followed by Real-Zero-Conversion.

4) For the mathematically inclined (It is not rocket science,
knowledge of college level algebra required):

The philosophy behind our approach is borrowed from
results in complex variable theory. Hadamard is quoted
in Boston Science Museum  as saying  (to the effect) "To understand
real numbers one has to first understand complex numbers".
This is true for signals as well. To understand real
signals one has to  first understand complex (or analytic)
signals.  Remarkably, it turns out that any analytic bandpass
function can be decomposed into  two functions, one completely
characterized by its envelope (or magnitude)
and the second with a monotonic phase. We exploit these
properties in our temporal code. We like to fantasize that
these basic properties of functions/signals have something to
do with the way the auditory system represents signals.

5) A possible cochlear scenario (blasphemy):

For understanding our point of view, imagine that the cochlea is
an analog delay line   in which the (filtered T-second long)
stimulus, s(t), is stored. It is useful to visualize r(t) (see above)
 also T-second long, as the motor force exerted by the OHCs.
Then r(t) is adjusted (or adapted) such that
e(t)=s(t)r(t) is as small as  possible (in energy).
e(t) may be visualized as the composite motion
due to both the stimulus and the feedback motor action due
to OHCs. Since the energy of e(t) is minimized by adapting r(t) it
has a compressive effect. The zero or level crossings
(related to (e(t) and/or) r(t)  (which are presumably picked
up by the IHC/auditory nerve) have sufficient  information to
uniquely represent the stimulus s(t).

III.C) Is this kind of representation  possible in the auditory
 periphery or is it purely  mathematical mumbo-jumbo?

1) Temporal encoding should begin at the periphery (Obviously!)

I think we have to get over this general impression that the mammalian
auditory periphery is a dumb device which kind of 'wiggles and
jiggles' and some how represents complex signals, which are then
perceived with great clarity. Any temporal encoding must manifest
at the periphery itself (may be this is obvious), if not how is
the rest of the auditory system  going to rely on phenomena like
coincidence detection to perceive the stimulus. If anything,
down the line, at higher nuclei the spike timing accuracy
will deteriorate, right?

2) Auditory periphery is an adaptive signal processor:

The inner ear is clearly an adaptive signal processor. This
is obvious from even current evidence. For example, we know
from  measurements that if the intensity of a stimulus is
increased (say beyond 50 dB SPL) then the cochlear filters
become broader (Pattuzzi and Johnston). We are suggesting
that such adaptive processing is not just limited to modifying
the filter characteristics, but is also involved in adaptively
demodulating the signal such that its envelope and phase are
encoded in the spike timing. It is  known that the OHC must exert
force on the BM via a  feed-back action. In fact it may be cycle-by
cycle action like the one we suggest.

As mentioned above it is curious that the spectrum of r(t) must be
'offset' from the spectrum of s(t) (see figure 9a and 9b).
This reminds me of a similar 1/3 or 1/2 octave shift in the
so-called second cochlear map.

4) Not convinced?

OK, you are not convinced that this math has anything to do
with the temporal encoding in the auditory system. Well, I am not
entirely convinced either. But we have to start some where.
At the least we have proposed some  analytical tools to  help
understand temporal encoding. Do you have a better strategy?
Can you improve  or revise what we are proposing?

III.D) Comments on recent postings:

1)Bates suggestions:

John Bates makes some very interesting comments on the
engineering aspects of the  evolution of the auditory system.
He then proposes a time-domain method called interstitial
waveform sampling. The WIV (wave information
vectors) seem to be triggered by zero-crossings of the signal.
He also invokes the results of Volecker (We are also great fans
of Volecker.) Volecker was (is
an unrecognized)  pioneer in signal processing.
He first proposed modeling signals using
ratio of complex polynomials in the early sixties (well before the
arrival of Matlab). He also related these models to zeros/poles
and made valiant attempts (with his students) to represent signals and
images by using zero-crossings. Davi Marr of Vision fame, also
tried this approach a little bit. However (according to Ed Titlebaum, a
former colleague of Herbert Volecker at Univ. of Rochester) after
a discussion with Prof.Longuet-Higgins on the futility of using
zero-crossings to reprsesent functions, he gave  up signal processing
and moved on to Manufacturing engineering at Cornell. He recently

It appears that Volecker left the field  just before the arrival
of adaptive signal processing (filetring ) methods (like equalizers,
inverse filters, maximum entropy methods etc.)in the early 1970s.
If he had stayed on, sooner or later he would have realized that
the dual of an adaptive filter (adaptive demodulator) could be
used to represent the envelope and phase of a signal, just like
an adaptive equalizer is used to compensate for the magnitude and
phase/group delay of filters. This is what we have been doing.

2) Bregman/Cariani's comments on the spectrogram:

We agree with Bregman's point that some sort of frequency analysis
(dimension) must be present in the auditory analysis, even at high
intensities, simply because of the presence of resonant structures
in the periphery. The analysis need not be strict  Fourier  analysis
but an equivalent filterbank type analysis which is then followed by
temporal coding (hopefully, something like our RZC). This will also
address Cariani's problem. Cariani laments about the loss of
temporal fine structure in spectrogram because we throw away the
spectral phase (half the information, on the average). The spectral
phase is implicitly captured in our zero-crossings. (Don't try to
capture this info from the zero-crossings of the signal; as I said
that is impossible, in general).

3) Deliang Wang comments:

We fully agree with Wang's comments on CASA, both the importance of
CASA and that separation of sources by relying only on angle of
arrival of sound is infeasible. However, both these problems
(separation of sources and CASA in general) might
benefit from accurate temporal encoding of signals at the periphery.