(David Huron - Conrad Grebel )

From:    David Huron - Conrad Grebel  <dhuron(at)WATSERV1.UWATERLOO.CA>
Date:    Sat, 19 Dec 1992 00:22:20 -0500

Several people have recently commented on the computer program I developed with Mabel Wu -- which was described on AUDITORY on December 5th. Most of the comments have been of the sort `Surely we need to consider temporal aspects of the signal ...' I'd like to respond to these comments. In the ensuing decades (perhaps centuries?) there will be innumerable attempts to build "auditory scene analyzers." Some of these projects will be rather ad hoc -- from which little useful knowledge will be gained. Other projects will be more systematic, and researchers will learn something from their successes and failures. Drawing on my own humbling experience, let me offer the following advice concerning how this sort of research ought to proceed. First, in brief: (1) Clarify whether the problem you're attempt to solve is auditory scene analysis or acoustic scene analysis. (2) Evaluate your system. (3) Investigate one hypothesis at a time. Expanding on these points: (1) Clarify whether the problem you're attempt to solve is auditory scene analysis or acoustic scene analysis. There has already been a helpful discussion on AUDITORY concerning this issue. Any research group needs to clarify whether their goal is to emulate human behavior (with all its foibles and intricacies), or whether their goal is a description of the acoustic scene. (Of course this doesn't preclude fashioning an ACOUSTIC scene analyzer along the lines of the AUDITORY system.) (2) Evaluate your system. If we want to have a sense of the success of some system, we need to evaluate it in ways the admit comparisons between different approaches. In the field of optical character recognition (OCR), for example, there are standardized samples of printed and hand-written letters. These samples have been carefully gathered to be representative of the population of printed and written texts in various languages and for various tasks -- such as hand-written zip codes. These samples provide a standard measurement tool for comparative evaluation of different OCR systems. Without these measurement tools it's possible to tweak system parameters so that the performance looks impressive for some arbitrary subset of inputs. In my opinion, we will need to develop a similar set of benchmark stimuli for acoustic and auditory scene analysis. That is, we need to assemble a set of recorded stimuli from a variety of acoustic situations, including mixtures of speech, music, environmental sounds, differing numbers of acoustic sources, harmonic and inharmonic sources, heavily masked and unmasked scenes, etc. In the case of auditory scene analysis, we would also need corresponding perceptual data for each stimulus. For example, we would need to know how many sound sources typical listeners identify for each scene, the variance in listener responses, etc. Of course, developing such a benchmark set of stimuli would be a major project. Notice that without such a benchmark set of stimuli, each auditory/acoustic scene analyzer can be optimized for some limited set of stimuli -- and so can be made to look like it performs pretty well -- a problem that has plagued OCR development. If we want to compare different approaches to scene analysis, it would helpful to have some standard measurement tool(s). (3) Investigate one hypothesis at a time. We COULD program a single scene analyzer that included cochlear modeling, frequency- and pitch-domain pattern recognition, amplitude co-modulation, onset synchrony, temporal grouping, gestalt heuristics, etc. etc. But we would spend the next hundred years trying to optimize a thousand parameters. Parameter-tweaking is a black hole that will lead to little knowledge. It's better for researchers to begin with a single model (such as a cochlear model) and see how far it takes us. Then we need to try an alternative approach and compare the results. Is method A better than method B (given our benchmark stimuli)? Does method A subsume the successes of method B or vice versa? Does method A work better for a different class of stimuli than for method B? And so on. By comparing competing scene analysis hypotheses we'll make faster headway. In other words, in the initial stages at least, it's better to examine one approach at a time, rather than pasting together a number of approaches. Certainly, we ought to avoid `everything but the kitchen sink' approaches. ---------------------------------------------------------- In light of the above three principles, you can see why I think some of the recent responses to the work I reported aren't especially helpful: First, our program was not intended to be a "model" for scene analysis. Our program was an implementation of Terhardt's virtual pitch algorithm coupled with some peak-tracing methods. The question we wanted to answer was how well can virtual pitch information alone contribute to successful scene parsing? Second, I never suggested that temporal factors wouldn't be a useful approach. Quite the reverse; here's what I wrote in my note of December 5th: "Our original intention was to push the pitch-unit spectrum approach as far as we could, and then try an independent tack by examining temporal information ..." -- in short, one hypothesis at a time. In my opinion, the major problem with the work we did was that we screwed-up on principles (1) and (2). We never got to the EVALUATION for a number of reasons. Not least of these was that we'd made the first error -- namely, failing to decide whether our goal was acoustic scene analysis or auditory scene analysis. This may seem like a trivial distinction, but it has concrete repercussions for how you measure the system's performance. In the first case, you need to enumerate the acoustic sources used to generate the recorded stimuli; in the second case you need to compare the system to subjective reports evoked by the stimuli. Such lessons are hard-won. Perhaps people would care to comment on my suggestion concerning a set of benchmark stimuli for scene analysis. I'd be interested to hear suggestions as to the type of stimuli that would need to be included, whether real-world sampling is important, and if so, what sort of sampling method would be appropriate. Perhaps we could distribute a collection of such stimuli on a limited-run CD. David Huron

This message came from the mail archive
maintained by:
DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University