"Hard problems in computational auditory scene analysis" (Dan Ellis )

Subject: "Hard problems in computational auditory scene analysis"
From:    Dan Ellis  <dpwe(at)MEDIA.MIT.EDU>
Date:    Thu, 3 Aug 1995 18:20:42 -0400

Dear AUDITORY list (and IJCAI CASA workshop participants) - A couple of months ago I was independently contacted by several students curious about computer models of auditory processing and who were looking for advice on a neat project. Responding to them made me realize that I was a little bit confused about how far 'my field' extended, and, as a result, I started working on an essay to clarify my ideas. Since one purpose of the essay was to search for a generally- acceptable statement of the 'hard problems' in the field of computational auditory scene analysis, I thought I would send it out to this list. I enclose an ascii version of the paper; a slightly better- formatted version can be found on the web at: http://sound.media.mit.edu/~dpwe/writing/hard-probs.html It is my hope that the essay, presumptuous as it is, may be the starting point for debate! -- DAn Ellis <dpwe(at)media.mit.edu> <http://sound.media.mit.edu/~dpwe/> MIT Media Lab Perceptual Computing - Machine Listening Group. - - - - - - - - - - ~/tmp/hard-probs.txt - - - - - - - - - - HARD PROBLEMS IN COMPUTATIONAL AUDITORY SCENE ANALYSIS Dan Ellis, dpwe(at)media.mit.edu, 1995aug03 // Introduction One of the difficulties of working in the field of computational auditory scene analysis (CASA) - building computer models of higher auditory functions - is the nebulous nature of the goals. In the related field of speech recognition, it is relatively easy to define a widely-acceptable target, such as a machine that can sit in a meeting room and transcribe the discussion. Another analogous field is machine vision and this too tends to be very goal-driven (finding the faces in a scene, recognizing particular body gestures); perhaps the fact that most vision researchers have abandoned the idea of a general-purpose scene analyzer in favor of more limited and specific goals should serve as stern advice to researchers in audition. Outside of speech recognition, similar goal-driven applications don't pop-up with the same urgency in the acoustic domain. As a result, there are wide differences of opinion over the essence of computational auditory scene analysis; the body of researchers who identify with the CASA banner can sometimes feel perplexingly out of sympathy with the colleagues they find beneath it. This paper is an effort towards ameliorating that confusion by an offer of a common focus for the field in the form of a description of a set of hard problems. These might constitute a starting point for a debate within our community over what truly are the questions that we should be trying to answer. It is unlikely (and of questionable desirability) that a neat consensus will result, with everybody persuaded to work on the same goals. But it would be valuable to have an overt description of the different perspectives in the field, and a statement of the common problems which may be being studied by several researchers using subtlely different formulations. // Aiming high : holy grails If a talented student expresses an interest in auditory information processing and its modeling, what guidance might he or she be given concerning an area to study? This question has obvious practical relevance, since we who believe in the importance and interest of this area presumably wish to encourage its growth. While one sure way of discouraging potential recruits is to direct them towards an intractable problem, I feel that identifying the ideal goals, the `holy grails' of the field, would help both in motivating research and in identifying relevant and valuable areas for work. Here are my proposals for this category: The sound-scene describer. This is a program that processes a real environmental acoustic signal and converts it into a symbolic description corresponding to a listener's perception of the different sound events and sources that are present. The description might be verbal, akin to that which a person might produce if asked to describe the sound-scene, or an analogous abstract representation. Applications for this kind of system include aids for the deaf (to convert acoustic cues into text or another modality) and automatic indexing of soundtrack data (e.g. to find explosions or helicopter sounds in a database of movie audio). The source-separator or `unmixer'. Rather than converting a sound mixture into an abstract description, one could imagine a machine that takes a single input and produces several separate output channels, each composed of the sound from a single source in the input. In most cases, human listeners would be able to judge if such a system was `working', i.e. whether the separated outputs matched the listener's internal perception of the different contributions; this is the closest we have to a rigorous formulation of the problem. Applications for a system of this kind include the restoration of recordings corrupted by unwanted noises (e.g. coughs at a concert) or hearing-aids for cocktail-party situations. Predictive human model. A major obstacle to certain research projects is that listening tests must be included in the development loop - a perfect example being high-quality masking-based audio compression. In theory, an automatic system could process a sound to predict its subjectively-rated similarity to an original, and the obtrusiveness of any distortion introduced. (Of course, the understanding of human perception that permitted the construction of the model would also have a profound influence on the design of such encoding algorithms.) The insights afforded by such a system would also inform a range of activities from treatment of hearing loss to entertainment sound-design. // Hard Problems The goals and applications described above are unlikely to be achieved in the short-term, but they comprise a context within which to propose and compare more feasible projects. The task of identifying the `hard problems' in the field thus becomes a question of focusing on the major stumbling-blocks separating us from these ultimate goals. Since the goals are distant the stumbling blocks are also indistinct, but the following constitute my perception of the critical breakthroughs that need to be made: The nature of cues: While the importance of certain cues (such as those discussed below) is generally accepted, it is likely that there are more subtle cues being used that we have not yet uncovered. For example, the phenomenon of comodulation masking release, where different frequency channels are fused strongly on the basis of shared aperiodic modulation, would seem to present tantalizing evidence for a broader mechanism of across-frequency fusion. Onset and common-period detectors: These strongest of cues to fusion and event formation still elude really convincing signal-processing implementations, despite numerous attempts. Simple first-order differencing on energy in each frequency channel is confused by sweeping tones, and harmonic trackers have difficulty deciding if certain frequency ratios are adequate for fusion. These kinds of low-level cue detectors must be ripe for definitive modeling, although the trick may lie in their codependence on higher-level analysis. Binaural cue detection: The correct detection and integration of interaural timing and level differences is probably closer to a satisfactory model, although using these to partition the sound energy into separate objects presumably still relies on integration with as-yet unknown higher-level functions. Factoring-out channel characteristics: Human listeners are highly successful at ignoring all but the most extreme of fixed colorations and blurrings resulting from fixed acoustic channel characteristics (e.g. room reverberation). This must be achieved through a combination of low-level suppression-of-reflections with more abstract steadiness constraints, however their precise nature remains mysterious. Event formation: The core of most work on computational auditory scene analysis has consisted of cue detectors driving an algorithm to simulate the fusion of energy from different frequency bands into single `perceptual events'. A proper model of human performance would deal with a broader range of event classes. Properties of events: Distinct from the local attributes of acoustic energy that are the cues to event formation, each distinctly-perceived sound object has its own global properties such as pitch and `timbre' that are somehow derived from its components. Choosing the right representation for these properties and discovering how they are calculated is a prerequisite for successful models of higher stages of abstraction. Sequential processing and stream formation: Beyond the level of auditory events is the process of grouping events into `streams' - sets of distinct sounds perceived as arising from a single source. The manner in which these patterns of temporally-distinct energy are processed and organized constitutes a completely different set of principles to those involved in the formation of events. In all likelihood, modeling this process will require adapting the low-level processing to the expectations derived at this and higher levels. Short-term context-sensitivity: Many psychoacoustic phenomena (such as those associated with the `old-plus-new' principle) underscore the importance of short-term context/expectation/potentiation in auditory perception. This is not easily incorporated into largely stateless signal-processing front ends, whose adaptability is generally limited to automatic gain control. Internal representation and storage: Human listeners are able to remember, generalize and classify instances of sound events. The imprecise nature of this process, where every unique bark of a dog sounds somewhat the same, presents an interesting representational challenge of extracting and storing only the `important' parts of each sound-object. The problem of recognition may appear distinct from that of segregation/organization, but in practice the partial detection of a known sound is bound to influence the organization process. Constructive analysis of mixtures: While the preceding issues apply even to the treatment of isolated sound events, our real interest lies in the ability of the auditory system to handle complex mixtures of sound that overlap in time and frequency. Illusion and restoration phenomena suggest that this is a constructive process i.e. a question of coming up with a hypothetical real-world situation that would be consistent with the received acoustic signal, rather than directly deducing each scene element from parts of the input data. Greater detail of this process remains unclear. Evidence integration: Sound organization systems often encounter the problem of needing to combine information obtained from a wide range of cue detectors and other sources such as vision and other modalities. A typical approach has been to implement an algorithm that combines types of cue in a fixed sequence. In contrast, the robustness of the human listener under a wide range of confusions implies a more adaptive or general process is at work. Principled evidence integration (such as Bayesian belief networks) seems closer to the right approach. Neural plausibility: While the fact that we are trying to model a system built out of neurons is often ignored (perhaps wisely), the question of how a given algorithm may be implemented in a biological brain comprises a boundary around the kinds of models we can reasonably propose. Unfortunately, the structure of the digital computer and its common programming languages is very far removed from the brain's architecture; this gap (and its impact on models) might be reduced with a more brain-like (parallel, distributed) computational paradigm. // Projects for enthusiasts The previous list presents a set of intellectual problems that need eventually to be solved, but for which no solution seems likely in the short term. To return to our original scenario of the enthusiastic student looking for a topic, it might be useful to accumulate a set of `ideologically-approved' projects that will both encourage the researcher to think about the problems that we consider important at the same time as advancing our efforts towards solving them. Here are a few suggestions: Breaking glass detector. This idea was actually suggested by Josh Wachman, a student of Roz Picard here at the Media Lab. Their interest is in automatic media annotation, specifically the idea of using soundtrack as well as the moving image to derive information from a recording. Their domain of action movies contains many catastrophic events (explosions, crashes, things shattering) with ecologically-characteristic transient sound patterns. Compared to, say, detecting the sound of a car engine, it should be relatively easy to pick out many of the gunshots and punches, and classify them according to a few parameters derived from their spectra. Voice counter/streamer. In a similar domain, soundtracks that are known principally to contain speech might be processed by today's harmonic-sound extractor algorithms to detect all the voiced-syllable entities, which could then be streamed into separate monologues and possibly identified with known participants based on larger-scale statistics such as pitch range and syllable rate. (At the Media Lab, Michael Hawley and the students of Chris Schmandt have considered systems of this ilk). Sound similarity model. High-performance sound compression algorithms are looking for the loosest approximation to the original signal that still sounds good to a human listener. This is a highly complex and poorly-understood criterion, but a similarity metric that ignored static phase and magnitude distortion, while emphasizing gating of high-frequency energy, seems technically feasible and might be a useful approximation to the `human model' holy grail. The wideband audio coding community is the natural home for this work. Constructive explanation in restricted domains. The idea of explaining a signal by guessing the components that have caused it and then checking what they would predict (forward modeling) rather than deriving their characteristics directly from the resulting signal (backward modeling) entails a new kind of algorithm that is currently little understood. To define a somewhat tractable problem of this kind might entail dramatically restricting the domain to, e.g. only noise bursts or steady tonal events. Such a `toy problem' could yield valuable insights into the general properties of such analysis-by-synthesis systems. Streaming systems. One popular topic for sound-organization models has been simulation of musical streaming and reproduction of phenomena such as the `trill-threshold' (e.g. Beauvois & Meddis, and the neural-network system of Brown & Cooke). While the ecological significance of these stimuli is a little obscure, it is a neat place to start, with plenty of experimental results to match. Streaming systems that better explain the influence of `timbre' would be a worthy achievement. Short-term learning of percussive music. I have thought about using some of the techniques of machine learning to build a system that derives the minimum set of individual spectra which can be combined to form an observed series of composite events. The exemplary domain for this dense percussion music where instruments rarely occur in isolation, but are usually cosynchronous with their peers. Listening to such music, one rapidly builds up an idea of the identity and number of instruments that are present, which can suddenly alter if two previously unison instruments separate. This indicates the important role of short-term memory in event fusion, i.e. that common-onset is a flexible cue strongly influenced by recent experience. // Conclusion In an effort to define a unifying focus for researchers modeling higher auditory functions, I have listed my vision of what the ultimate goals of this work might be, and what specific discoveries or techniques must be developed to get us there. I have added a collection of more tractable projects based on the same ideas perhaps to provide some inspiration for newcomers to the field. I do not presume to have made a definitive job of any of these lists, but I hope that other members of the community will share my interest in producing this kind of manifesto, and will either suggest some of their ideas for inclusion in future versions of this document, or produce alternative versions of their own. // Acknowledgments This paper has had the benefit of direct input from the following people, whose contributions are gratefully acknowledged: Bill Gardner, Kunio Kashino, Keith Martin, David Rosenthal and Lonce Wyse. Copyright (c) 1995 Dan Ellis. You may redistribute this article to anyone for any non-commercial purpose. The current version is available at: http://sound.media.mit.edu/~dpwe/writing/hard-probs.html - - - - - - - - - - - - - - - - - - - - - - - - - - -

This message came from the mail archive
maintained by:
DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University