Subject: Analytical approach to temporal coding, CASA and other matters From: Ramdas Kumaresan <kumar(at)ELE.URI.EDU> Date: Wed, 21 Feb 2001 17:10:34 -0500
Hi Drs.Bates, Blumschein, Bregman, Cariani, Greenberg, Deliang Wang, and other list people, I: Introduction: We (Yadong Wang and I) have been preparing to respond to many of the recent postings (see below) in this forum and other related venues (e.g., Blumschein's). I apologize for using last names and for the length of this posting. Our approach to signal representation (at various stages of development) using a possible temporal code is documented in our recent publications (see refs. in manuscript) and manuscript(http://www.ele.uri.edu/~ydwang/Jasa.pdf). We would like to relate these to the recent comments made on this topic. Recent postings by Bates, Bregman, Blumschein, Cariani, Wang, Greenberg and a whole lot of related literature point to the fact that temporal information (not just the spectral energy as seen in a spectrogram) has to play an important role in auditory function. A key question is how is this temporal information coded and processesd in higher auditory centers. Ofcourse, a short and definitive answer is that we don't know. Nobody knows. However, we have a possible analytical approach to this problem and this is the topic of this posting. II: References: Many people in this Auditory list have made insightful comments on related problems of temporal coding ( rate/inter spike intervals), spectrogram, coincidence detection in the higher auditory centers, etc. We refer to some of these below. 1 E.Blumschein, "Why and how to revise the traditional theory of auditory function" at http://iesk.et.uni-magdeburg.de/~blumsche/M30.html 2 J.Bates, "How to hear everything and listen to anything " at http://home.computer.net/~jkbates/thela1.htm 3 A.Bregman, "Spectrogram frequency axis is central to auditory perception" at http://www.psych.mcgill.ca/labs/auditory/findings.html 4 Peter Cariani, "The problem with frequency-time representations as a basis of scene analysis is that these usually eliminate underlying temporal fine structure.." at http://www.cariani.com 5 Peter Meijer, "Note that even the spectrogram itself can be viewed as an attempt to design a useful artificial cross-modal mapping from the auditory to the visual domain!" at http://sound.media.mit.edu/dpwe-bin/mhmessage.cgi/AUDITORY/postings/2001/102 6 D. Wang, "CASA problems and solutions " at http://sound.media.mit.edu/dpwe-bin/mhmessage.cgi/AUDITORY/postings/2001/82 We have some specific remarks on some of these. III: Main Points We have divided our comments into four main categories. A) Motivation for our temporal code. B) A brief summary of our temporal code C) Speculations on how this might be achieved in the auditory periphery D) Specific comments on other peoples remarks III.A) Motivation for our temporal code: 1) Is there a temporal code that can faithfully represent a bandpass signal? The auditory periphery (specifically, the inner ear) converts an acoustic signal that is incident on it into a neural code. This code is believed to be some combination of spike rate and relative timing between spikes. However, experimental evidence seems to suggest that interaural time difference is discernible to an accuracy of microseconds (as we have read in many papers; apparently it is even better for bats). If that is so, then it is conceivable that an acoustic signal's characteristics (envelope, phase, frequency, intensity) are also accurately encoded in the spike train timing information. Therefore, a natural question to ask is: "Is it at all possible to code a signal's characteristics with accurate timing information?" If YES, then may be we should look for such a code. If we can find one, then one may further ask: Does this code have ANYTHING to do with how the auditory system represents and perceives signals? 2) Do current models of cochlea help in discovering such a code? Cochlear signal processing involves the incredibly delicate inner ear mechanisms. The components of this mechanism (outer/inner hair cells OHC/IHC, membranes BM/TM etc.) have been modeled extensively in the literature. Although these models attempt to explain the experimentally measured physical quantities like membrane displacement/velocity, pressure etc., they have not shed new light on how signal processing is achieved (other than gross (may be nonlinear) filtering) and how this leads to a temporal code. Basically, these modeling approaches attempt to explain the cochlear mechanics accurately, and assume that the temporal encoding of signal characteristics will kind of take care of itself (i.e., they get somehow represented in the spike-rate/inter-spike intervals.) Contrary to this belief, it is possible that the motor-like OHC action, may indeed be there (at least partly) to facilitate accurate temporal encoding of signals. Based on this premise, we propose a bottom-up approach; it seems to us that if we propose a temporal coding strategy (based on a theoretically sound signal representation) then we might start looking for possible cochlear mechanisms that may be able to achieve such a coding. This is certainly not conventional. But this is our motivation. 3) How to go about looking for such a code? Assume that a bandpass signal, say, centered around 2000Hz, is applied to the auditory periphery. For purposes of visualization, assume that this band pass signal is composed of a few tones (1800Hz-2200Hz, spaced 100Hz apart). This signal could represent a speech formant. After filtering by various peripheral structures this bandpass signal is encountered by the organ of Corti (OC). The OC has an opportunity to 'look' at this signal for a duration of T seconds (perhaps, a few milliseconds depending on the characteristic frequency). Think of this T seconds as the 'dwell' time on the OC. In this T second duration the OC has to come up with a representation for this signal. Then the 'dwell window' slides on continually to the next T seconds and so on. A general model for this signal, s(t), during this T seconds is (in LaTex notation) s(t)=A a(t) cos(\omega_c t+ \phi(t)), where A is the scale factor representing the 'bigness' of the signal, a(t)is the normalized envelope (i.e., maximum of a(t)=1), \omega_c is the nominal center frequency and \phi(t) is the signal phase. Any band-pass signal would fit this mold. We may point out that (large) A and \omega_c are immediately visible in the spectrogram as dark regions. It is the details of a(t) and \phi(t) that are obscured in a spectrographic presentation. In short, \omega_c (and possibly A) constitute (loosely speaking) the 'tonotopic place' information and a(t) and \phi(t) are conveyed by the inter-spike intervals. (The question is how?) In principle, this signal model could be applied to different spectral regions of a signal and thereby is a representation for the entire (speech) signal. III.B) Our Temporal Code: 1) Zero-Crossings of s(t) are insufficient: How do we represent s(t) by a temporal code? Don't even think about using the zero-crossings of s(t) (i.e., produce a spike at every other zero-crossing of s(t)). Because it is well known that it is not always possible to represent an arbitrary band-pass signal using its zero-crossing information. Only in special cases zero-crossings of a signal can be used to represent it. The zero-crossing representations were studied by Logan in the early 1970s. The references may be found in our papers. 2) Adaptive demodulation may be the key: Remember that the modulation information (a(t) and \phi(t)) need to be represented as timing information (at least as per our model). Our key contribution is that by 'adaptively demodulating' (explained below) an arbitrary band-pass signal s(t), we can represent its envelope a(t) and phase \phi(t) using only (spike) timing information. Along with A and \omega_c (the 'place' info), these then completely characterize s(t). Hence our representation is a form of sampling theorem. Shannon/Whittaker sampling theorem represents signals by their sample values. In contrast we represent a signals' phase and envelope using certain zero-crossing locations that arise in the process of adaptive demodulation. (These are NOT the zerocrossings of s(t)). Many versions of this algorithm are possible. The details are in our paper but the mechanics of the algorithm are described briefly below. 3) Some essentials of adaptive demodulation: We synthesize a signal called r(t) (see figure 9 in the manuscript) and use it to demodulate s(t). That is we define e(t)=s(t)r(t). The energy in e(t) (integral of the square of e(t) over the T seconds) is minimized by choosing r(t). It turns out that the resulting signal, r(t), contains information about the envelope of s(t) and is completely representable by its zero-crossings. (Curiously, for this to happen it turns out that the spectrum of r(t) should not overlap that of the stimulus s(t). See Figure 9a and 9b.) Also, e(t) contains information about the phase of s(t) and is also completely representable by its zero-crossings. We call the process of demodulating s(t) and representing it by zero-crossings (of e(t) and/or r(t)) as Real-Zero-Conversion or RZC. Figure 1 in the manuscript summarizes the temporal coding method i.e., it is basically a filterbank followed by Real-Zero-Conversion. 4) For the mathematically inclined (It is not rocket science, knowledge of college level algebra required): The philosophy behind our approach is borrowed from results in complex variable theory. Hadamard is quoted in Boston Science Museum as saying (to the effect) "To understand real numbers one has to first understand complex numbers". This is true for signals as well. To understand real signals one has to first understand complex (or analytic) signals. Remarkably, it turns out that any analytic bandpass function can be decomposed into two functions, one completely characterized by its envelope (or magnitude) and the second with a monotonic phase. We exploit these properties in our temporal code. We like to fantasize that these basic properties of functions/signals have something to do with the way the auditory system represents signals. 5) A possible cochlear scenario (blasphemy): For understanding our point of view, imagine that the cochlea is an analog delay line in which the (filtered T-second long) stimulus, s(t), is stored. It is useful to visualize r(t) (see above) also T-second long, as the motor force exerted by the OHCs. Then r(t) is adjusted (or adapted) such that e(t)=s(t)r(t) is as small as possible (in energy). e(t) may be visualized as the composite motion due to both the stimulus and the feedback motor action due to OHCs. Since the energy of e(t) is minimized by adapting r(t) it has a compressive effect. The zero or level crossings (related to (e(t) and/or) r(t) (which are presumably picked up by the IHC/auditory nerve) have sufficient information to uniquely represent the stimulus s(t). III.C) Is this kind of representation possible in the auditory periphery or is it purely mathematical mumbo-jumbo? 1) Temporal encoding should begin at the periphery (Obviously!) I think we have to get over this general impression that the mammalian auditory periphery is a dumb device which kind of 'wiggles and jiggles' and some how represents complex signals, which are then perceived with great clarity. Any temporal encoding must manifest at the periphery itself (may be this is obvious), if not how is the rest of the auditory system going to rely on phenomena like coincidence detection to perceive the stimulus. If anything, down the line, at higher nuclei the spike timing accuracy will deteriorate, right? 2) Auditory periphery is an adaptive signal processor: The inner ear is clearly an adaptive signal processor. This is obvious from even current evidence. For example, we know from measurements that if the intensity of a stimulus is increased (say beyond 50 dB SPL) then the cochlear filters become broader (Pattuzzi and Johnston). We are suggesting that such adaptive processing is not just limited to modifying the filter characteristics, but is also involved in adaptively demodulating the signal such that its envelope and phase are encoded in the spike timing. It is known that the OHC must exert force on the BM via a feed-back action. In fact it may be cycle-by cycle action like the one we suggest. As mentioned above it is curious that the spectrum of r(t) must be 'offset' from the spectrum of s(t) (see figure 9a and 9b). This reminds me of a similar 1/3 or 1/2 octave shift in the so-called second cochlear map. 4) Not convinced? OK, you are not convinced that this math has anything to do with the temporal encoding in the auditory system. Well, I am not entirely convinced either. But we have to start some where. At the least we have proposed some analytical tools to help understand temporal encoding. Do you have a better strategy? Can you improve or revise what we are proposing? III.D) Comments on recent postings: 1)Bates suggestions: John Bates makes some very interesting comments on the engineering aspects of the evolution of the auditory system. He then proposes a time-domain method called interstitial waveform sampling. The WIV (wave information vectors) seem to be triggered by zero-crossings of the signal. He also invokes the results of Volecker (We are also great fans of Volecker.) Volecker was (is an unrecognized) pioneer in signal processing. He first proposed modeling signals using ratio of complex polynomials in the early sixties (well before the arrival of Matlab). He also related these models to zeros/poles and made valiant attempts (with his students) to represent signals and images by using zero-crossings. Davi Marr of Vision fame, also tried this approach a little bit. However (according to Ed Titlebaum, a former colleague of Herbert Volecker at Univ. of Rochester) after a discussion with Prof.Longuet-Higgins on the futility of using zero-crossings to reprsesent functions, he gave up signal processing and moved on to Manufacturing engineering at Cornell. He recently retired. It appears that Volecker left the field just before the arrival of adaptive signal processing (filetring ) methods (like equalizers, inverse filters, maximum entropy methods etc.)in the early 1970s. If he had stayed on, sooner or later he would have realized that the dual of an adaptive filter (adaptive demodulator) could be used to represent the envelope and phase of a signal, just like an adaptive equalizer is used to compensate for the magnitude and phase/group delay of filters. This is what we have been doing. 2) Bregman/Cariani's comments on the spectrogram: We agree with Bregman's point that some sort of frequency analysis (dimension) must be present in the auditory analysis, even at high intensities, simply because of the presence of resonant structures in the periphery. The analysis need not be strict Fourier analysis but an equivalent filterbank type analysis which is then followed by temporal coding (hopefully, something like our RZC). This will also address Cariani's problem. Cariani laments about the loss of temporal fine structure in spectrogram because we throw away the spectral phase (half the information, on the average). The spectral phase is implicitly captured in our zero-crossings. (Don't try to capture this info from the zero-crossings of the signal; as I said that is impossible, in general). 3) Deliang Wang comments: We fully agree with Wang's comments on CASA, both the importance of CASA and that separation of sources by relying only on angle of arrival of sound is infeasible. However, both these problems (separation of sources and CASA in general) might benefit from accurate temporal encoding of signals at the periphery.