Re: CASA problems and solutions (Peter Meijer )

Subject: Re: CASA problems and solutions
From:    Peter Meijer  <peter.b.l.meijer(at)PHILIPS.COM>
Date:    Thu, 1 Feb 2001 10:53:45 +0100

Peter Cariani writes > Maybe a spectrogram isn't bad for isolated pure tones, but this map b= ecomes > rather messy and uninterpretable when there are multiple objects (e.g= . several > musical instruments playing different notes) with complex, overlappin= g spectra. Quite true for "natural" sound scenes with "interleaved" harmonic spectra from multiple independently sounding objects, but not necessarily true for synthesized artificial sound scenes where objects are, by the nature of the mapping, kept localized in the time-frequency plane. Although I understand that CASA mainly deals with "natural" sound environments and speech and music and the like (and that is what you were probably referring to), it does become a relevant issue when you try to relate to what happens in terms of grouping and segregation in (natural) visual scene analysis. I respond here mainly because you also discuss some links with visual scene analysis, and apart from studying analogies between vision and hearing for natural auditory and visual scenes, one might then also consider analogies arising from artificial auditory or visual scenes that were *designed* to relate to each other. Note that even the spectrogram itself can be viewed as an attempt to design a useful artificial cross-modal mapping from the auditory to the visual domain! > there is nothing special about octaves in a spectrogram. True again, but if you start out with a spectrogram that resembles a visual scene and then synthesize a corresponding sound, then you don't want any such "spurious" harmonic relations because they are meaningless in the original visual scene. In other words, a lot will depend on the types of auditory scenes that CASA "intends" to cover. If CASA is only about more or less natural auditory scenes, then I agree with your objections against spectrograms. If it is also about synthesized auditory scenes that are designed to relate to vision, then one has to be careful. > (The "pixel pattern" model of vision doesn't work very well either > when multiple objects enter visual scenes.) Why is that? Do you mean occlusion effects? The "pixel pattern" model of vision is just about all the retina gets to work with, and it is up to higher brain centers to try and sort out occlusion, parallax, visual perspective, shading and so on in order to do visual grouping and segregation (e.g., segmentation as part of object identification). I do not see how the "pixel pattern" model of vision is limited like the spectrogram indeed is for natural auditory scenes. Perhaps CASA could grow into a new research field coined "CMSA", for (computational) multisensory scene analysis? After all, such fairly high-level cognitive functions like perceptual grouping and segregation are likely to be at least partially shared among multiple senses? Some of this resource sharing in the brain would also subserve multisensory integration and Gestalt formation... Best wishes, Peter Meijer Seeing with Sound - The vOICe =

This message came from the mail archive
maintained by:
DAn Ellis <>
Electrical Engineering Dept., Columbia University