[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Subject: Computational Estimation of the Number of Acoustic Streams
In response to Pierre Divenyi's query, here is a brief summary of the
auditory scene work I did with Mabel Wu mentioned earlier on AUDITORY.
We started in 1991 and suspended work last summer.
Monophonic stimuli were initially processed with FFTs using overlapping
windows of up to 80 ms. The resulting spectral information was passed
through a simple masking algorithm and then on to Terhardt's pitch model.
We decided to use the "pitch domain" rather than the frequency domain for
several reasons: (1) pitch is a good handle for auditory objects,
(2) Terhardt's method generates output for inharmonic as well as harmonic
inputs, and (3) it handles residue pitches well.
The output from the Terhardt model is pitch units (p.u.) versus virtual or
spectral pitch weight (p.w.) for successive spectra. The result is a sort of
running spectrum -- only with pitch units replacing frequency and pitch weight
replacing amplitude. Estimating the number of streams amounts to identifying
the number of concurrent (major) "mountain ranges" in the running pitch-unit
spectrum. The computational problem is then to trace the "ranges" horizontally
-- since there are gaps and noise.
Any clustering task assumes some proximity metric -- such as frequency or
log freq. In our case we used log p.u. (I am hoping that the Sheffield
group will come up with a better proximity metric some day.)
When the running pitch-unit spectra were displayed visually, we could clearly
see the multiple streams in many of our test stimuli. But our peak-tracing
methods weren't powerful enough to capture what we could see. Fortunately,
this is a common problem in pattern recognition, so I expect that major
improvements can be had by becoming more informed about the latest peak-tracing
algorithms. There may even be some neural nets that do this well.
Overall, I'd have to say the results weren't very good. We never did formally
evaluate the system performance, primarily because we were sure that the
performance would improve significantly with a better peak-tracing algorithm.
>From a conceptual level, another problem relates to the engineering versus
psychology issue I raised in regard to the discussion of terminology. If our
goal is to estimate the number of "acoustic streams" rather than
"auditory streams," I'm not sure what the logic is for using masking, spectral
dominance-shaping, or pitch estimation. For example, eliminating the masking
algorithm might well improve the system's performance. It is true that humans
have evolved a fantastic ability to parse auditory scenes -- and so in
computational ventures we might start with perceptually-inspired heuristics.
But is masking a help or a hindrance in this process?
Our original intention was to push the pitch-unit spectrum approach as far as
we could, and then try an independent tack by examining temporal information
(amplitude co-modulation, onset synchrony, etc.). Presumably, a combination
of spectral and temporal approaches might produce good results.
I would be interested to hear from other people who have attempted