Re: CASA problems and solutions (Peter Cariani )

Subject: Re: CASA problems and solutions From: Peter Cariani <peter(at)EPL.MEEI.HARVARD.EDU> Date: Wed, 31 Jan 2001 17:36:31 -0500 Al Bregman wrote: Apart from spatial origin, the following sorts of information are used by humans: (A) For integrating components that arrive overlapped in time 1. harmonic relations 2. asynchrony of onset and offset 3. spectral separation 4. Independence of amplitude changes in different parts of the spectrum > (B) For integrating components over time: > 5. Spectral separation > 6. Separation in time (interacts with other factors) > 7. Differences in spectral shape > 8. Differences in intensity (a weak effect) > 9. Abruptness/smoothness of transition from one sound > to the next > ... > I'm not sure whether your rejection of the Fourier method extends to all > methods of decomposing the input into spectral components. However if it > does, we should bear in mind that factors 3, 4, and 5, 7, and probably 1, > listed above, are most naturally stated on a frequency x time > representation -- that is, on a spectrogram or something like it. I apologize that this is so long.... Al's handout and list are very useful in organizing what we want to explain, and certainly Al has done more than anyone in keeping questions of perceptual organization on the table. The question is what kind of auditory representation do we need to explain this structure of perceptual organization that we observe in ourselves. Certainly if you look at a spectrogram of a harmonic complex, the low pitch of the complex (at the fundamental) is not readily apparent (unlike what we hear), and its estimation involves some fairly elaborate inferences from the spectral pattern. In order to determine the pitch, one either needs to identify the frequencies of the harmonics and then derive their greatest common denominator or use some harmonic template or trained neural network pattern recognizer (subharmonic sieves take one out of the Fourier perspective). A log frequency plot obscures the harmonic structure, a linear frequency plot distorts perceptual distances between frequencies. No Fourier description shows pattern similarities between octaves -- there is nothing special about octaves in a spectrogram. It is very hard to determine from a spectrogram if there are slight mistunings of harmonics, but we readily hear these deviations. Some birds recognize mistuned harmonics within a couple of pitch cycles; my recollection is that we can do it within about 50 ms. Maybe a spectrogram isn't bad for isolated pure tones, but this map becomes rather messy and uninterpretable when there are multiple objects (e.g. several musical instruments playing different notes) with complex, overlapping spectra. Yet we hear out the different objects with relatively little difficulty. We should unpack the phrase "decomposing the input into spectral components". If one passes a signal through an array of band-pass filters, but uses the time structure that comes out of the filters rather than reading off their activation magnitudes (by whatever measure), then is this operation a decomposition? It is possible to have frequency-by-frequency processing in the time domain without necessarily ever using a spectral representation per se. One gets neighborhood interactions and many of the other properties that we usually think of in terms of running spectra, but the primitives of this representation are fine temporal patterns (neurally, these are spike times, interspike intervals) rather than profiles of frequency-channel activations. The problem with frequency-time representations as a basis of scene analysis is that these usually eliminate underlying temporal fine structure (or phase structure if one prefers). Although our perception of pitch and timbre of stationary sounds is largely insensitive to phase spectrum (components < 2 kHz), the mechanisms by which auditory objects are formed appear to be sensitive to abrupt changes in phase (I am thinking of Kubovy's demonstrations of separation of partials from a harmonic complex by abrupt changes in phase and intensity --- following the change, the partial is heard out and then blends back into the whole). Precessions of relative phase relations can be used to separate sounds. This is very apparent in the double-vowel separations, where the two vowel fundamentals are separated by a semitone or more. Each vowel has its own repeating waveform pattern that generates, by virtue of phase-locking, a multichannel temporal correlation pattern. The relations between the two patterns associated with the two vowels are constantly shifting relative to each other, but the patterns themselves are internally invariant. If one has a mechanism by which a temporal pattern is built up when it recurs, then such a mechanism will build up each of the two vowels as different invariant patterns and separate them. Formation of objects can thus depend on coherence of temporal fine structure. I think this is close to the kinds of relational, correlational processes that the Gestaltists had in mind. I have a paper and a poster on my website on recurrent timing nets that build up such patterns: www.cariani.com http://peter-office.meei.harvard.edu/ARO2kCariani.pdf http://peter-office.meei.harvard.edu/TImingNets99.pdf If one adopts the standard frequency-time perspective, these scene analysis strategies that depend on temporal fine structure are no longer available, and one must then search around for other means of building up objects (e.g. prior expectations). > > Furthermore, when you look at a spectrographic representation of an auditory > signal, the visual grouping that occurs is often directly analogous to the > auditory organization (provided that the time and frequency axes are > properly scaled). I don't think this works very well for the pitches of harmonic and inharmonic complexes. > Why would this be so if some sort of frequency axis were > not central to auditory perception, playing a role analogous to a spatial > dimension in vision? We probably should not appeal to sensory systems whose operation is not currently well understood. Is there a cogent, compelling theory out there of how visual forms are neurally represented? (The "pixel pattern" model of vision doesn't work very well either when multiple objects enter visual scenes.) But what if vision depends on spatiotemporal correlations between spikes (fine temporal correlations between spikes at different retinal locations), rather than a retinal rate-place model? It turns out that they have a hyperacuity problem in vision that is not unlike ours for frequency, and that vernier acuity limits can be explained if visual neurons can respond to edges with jitters of a millisecond or less (provided this information can be used). Following Bialek and others, the vision community is slowly finding out that there is stimulus information in spike timings down to a millisecond and maybe less (the limits of their measurement precisions). It could well be the case that the limits of visual acuity, like the limits of auditory frequency discrimination, depend on precisions of spike timings, not on rate tunings. If so, then the retina could be regarded as a gigantic temporal cross-correlator, with vision as a system that looks not unlike binaural cross-correlation, but with temporal correlations across many retinal positions rather than across two corresponding cochlear positions. > Perhaps the Fourier transform is not the best > approach to forming this frequency dimension, but something that does a > similar job is required. IMHO, something like running auto- and cross-correlations, that retain fine time structure are needed. They share many common properties with spectrographs without throwing away half the information. Or perhaps some way of operating on the running phase spectrum. One needs a process that is sensitive to phase invariants and changes for object formation and separation, but then one subsequently needs a process that is largely phase insensitive for pitch and timbre comparisons. It could be the case (as John Culling suggests for location) that the objects are formed first and then their properties (such as pitch, timbre, location) are analyzed/compared. This might explain why harmonic relations are more important for grouping than ITD cues. > Finally there is overwhelming physiological > evidence that the human nervous system does a frequency analysis of the > sound and retains separate frequency representations all the way to the > brain. We should be careful about this. (What evidence do you have in mind?) There is a bit of a disconnect between tonotopy studies and the neural representations that subserve (fine) perceptual discriminations. Tonotopic organization of a relatively coarse nature is only seen if one looks at/near neural response thresholds, and this organization invariably broadens and/or breaks down at moderate to high levels. It may not be a "representation" in the functional sense. Tonotopy could simply be a reflection of the organization of the (cochlear) receptor surface and the tendency of correlated inputs to cluster in local spatial neighborhoods, rather than the neural representational mechanism through which fine pitch discriminations are effected. (If we looked at rate-place profiles in the auditory nerve in response to a harmonic complex presented at 80 dB SPL, rate patterns would be very broad, and we would be very hard pressed to estimate with any degree of accuracy or reliability the frequency of the fundamental -- maybe we could get within a half-octave if we were lucky, but this is 2 orders of magnitude coarser than pitch jnd's. I have the same feeling when I look at Steinschneider's systematic current-source density analysis of cortical responses to harmonic complex tones -- one begins to see spatial activation structure only when the harmonics are separated by about half an octave (350 Hz for an 800 Hz BF). At the level of the auditory nerve, the information for fine periodicity analysis is in the spike timing, not in the (tonotopically organized) rate profiles. What aspects of cortical neural response subserve pitch discriminations of 0.5% or less in frequency is an open question. (Does anyone have an answer to this question?) Perhaps I'm completely wrong in my wariness of the textbook view, as far as it goes, but I think it's better not to wallpaper over the difficulties that auditory theory currently faces. We need to make them as clear as we can. Recognition of where current theory breaks down is absolutely essential to future progress. Peter Cariani

This message came from the mail archive
http://www.auditory.org/postings/2001/
maintained by:

DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University