(David Huron - Conrad Grebel )

From:    David Huron - Conrad Grebel  <dhuron(at)WATSERV1.UWATERLOO.CA>
Date:    Sat, 5 Dec 1992 13:56:04 -0500

Subject: Computational Estimation of the Number of Acoustic Streams In response to Pierre Divenyi's query, here is a brief summary of the auditory scene work I did with Mabel Wu mentioned earlier on AUDITORY. We started in 1991 and suspended work last summer. Monophonic stimuli were initially processed with FFTs using overlapping windows of up to 80 ms. The resulting spectral information was passed through a simple masking algorithm and then on to Terhardt's pitch model. We decided to use the "pitch domain" rather than the frequency domain for several reasons: (1) pitch is a good handle for auditory objects, (2) Terhardt's method generates output for inharmonic as well as harmonic inputs, and (3) it handles residue pitches well. The output from the Terhardt model is pitch units (p.u.) versus virtual or spectral pitch weight (p.w.) for successive spectra. The result is a sort of running spectrum -- only with pitch units replacing frequency and pitch weight replacing amplitude. Estimating the number of streams amounts to identifying the number of concurrent (major) "mountain ranges" in the running pitch-unit spectrum. The computational problem is then to trace the "ranges" horizontally -- since there are gaps and noise. Any clustering task assumes some proximity metric -- such as frequency or log freq. In our case we used log p.u. (I am hoping that the Sheffield group will come up with a better proximity metric some day.) When the running pitch-unit spectra were displayed visually, we could clearly see the multiple streams in many of our test stimuli. But our peak-tracing methods weren't powerful enough to capture what we could see. Fortunately, this is a common problem in pattern recognition, so I expect that major improvements can be had by becoming more informed about the latest peak-tracing algorithms. There may even be some neural nets that do this well. Overall, I'd have to say the results weren't very good. We never did formally evaluate the system performance, primarily because we were sure that the performance would improve significantly with a better peak-tracing algorithm. >From a conceptual level, another problem relates to the engineering versus psychology issue I raised in regard to the discussion of terminology. If our goal is to estimate the number of "acoustic streams" rather than "auditory streams," I'm not sure what the logic is for using masking, spectral dominance-shaping, or pitch estimation. For example, eliminating the masking algorithm might well improve the system's performance. It is true that humans have evolved a fantastic ability to parse auditory scenes -- and so in computational ventures we might start with perceptually-inspired heuristics. But is masking a help or a hindrance in this process? Our original intention was to push the pitch-unit spectrum approach as far as we could, and then try an independent tack by examining temporal information (amplitude co-modulation, onset synchrony, etc.). Presumably, a combination of spectral and temporal approaches might produce good results. I would be interested to hear from other people who have attempted this problem. David Huron.

This message came from the mail archive
maintained by:
DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University