[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CASA problems and solutions

Peter Cariani writes

> Maybe a spectrogram isn't bad for isolated pure tones, but this map becomes
> rather messy and uninterpretable when there are multiple objects (e.g. several
> musical instruments playing different notes) with complex, overlapping spectra.

Quite true for "natural" sound scenes with "interleaved" harmonic
spectra from multiple independently sounding objects, but not
necessarily true for synthesized artificial sound scenes where
objects are, by the nature of the mapping, kept localized in the
time-frequency plane. Although I understand that CASA mainly deals
with "natural" sound environments and speech and music and the like
(and that is what you were probably referring to), it does become
a relevant issue when you try to relate to what happens in terms
of grouping and segregation in (natural) visual scene analysis.
I respond here mainly because you also discuss some links with
visual scene analysis, and apart from studying analogies between
vision and hearing for natural auditory and visual scenes, one
might then also consider analogies arising from artificial auditory
or visual scenes that were *designed* to relate to each other.

Note that even the spectrogram itself can be viewed as an attempt
to design a useful artificial cross-modal mapping from the auditory
to the visual domain!

> there is nothing special about octaves in a spectrogram.

True again, but if you start out with a spectrogram that resembles
a visual scene and then synthesize a corresponding sound, then you
don't want any such "spurious" harmonic relations because they are
meaningless in the original visual scene. In other words, a lot
will depend on the types of auditory scenes that CASA "intends" to
cover. If CASA is only about more or less natural auditory scenes,
then I agree with your objections against spectrograms. If it is
also about synthesized auditory scenes that are designed to relate
to vision, then one has to be careful.

> (The "pixel pattern" model of vision doesn't work very well either
> when multiple objects enter visual scenes.)

Why is that? Do you mean occlusion effects? The "pixel pattern"
model of vision is just about all the retina gets to work with,
and it is up to higher brain centers to try and sort out occlusion,
parallax, visual perspective, shading and so on in order to do
visual grouping and segregation (e.g., segmentation as part of
object identification). I do not see how the "pixel pattern"
model of vision is limited like the spectrogram indeed is for
natural auditory scenes.

Perhaps CASA could grow into a new research field coined "CMSA",
for (computational) multisensory scene analysis? After all, such
fairly high-level cognitive functions like perceptual grouping
and segregation are likely to be at least partially shared among
multiple senses? Some of this resource sharing in the brain would
also subserve multisensory integration and Gestalt formation...

Best wishes,

Peter Meijer

Seeing with Sound - The vOICe