Dissertation: Prediction-driven Computational Auditory Scene (Dan Ellis )

Subject: Dissertation: Prediction-driven Computational Auditory Scene
From:    Dan Ellis  <dpwe(at)MEDIA.MIT.EDU>
Date:    Wed, 15 May 1996 15:58:42 -0400

After busting through umpteen deadlines of varying degrees of severity, I have actually finished my Ph.D. The dissertation might be of interest to people on this list, so I am including the abstract at the bottom of this message. You can also read the abstract and contents, download the postscript of the whole document, and listen to the sound examples at its web site: http://sound.media.mit.edu/~dpwe/pdcasa/ Personal note: I have now left MIT to be a post-doc with Nelson Morgan's speech recognition group at the International Computer Science Institute, attached to the University of California, Berkeley. I hope to find ways to use the ideas of computational auditory scene analysis for the practical benefit of speech recognition systems. I have a new email address, dpwe(at)icsi.berkeley.edu, but will probably retain my MIT addresss for the forseeable future. DAn. - - - - - - - - - - ~/public_html/pdcasa/front.txt - - - - - - - - - - Prediction-driven computational auditory scene analysis by Daniel P. W. Ellis Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering at the Massachusetts Institute of Technology June 1996 ABSTRACT The sound of a busy environment, such as a city street, gives rise to a perception of numerous distinct events in a human listener - the `auditory scene analysis' of the acoustic information. Recent advances in the understanding of this process from experimental psychoacoustics have led to several efforts to build a computer model capable of the same function. This work is known as `computational auditory scene analysis'. The dominant approach to this problem has been as a sequence of modules, the output of one forming the input to the next. Sound is converted to its spectrum, cues are picked out, and representations of the cues are grouped into an abstract description of the initial input. This `data-driven' approach has some specific weaknesses in comparison to the auditory system: it will interpret a given sound in the same way regardless of its context, and it cannot `infer' the presence of a sound for which direct evidence is hidden by other components. The `prediction-driven' approach is presented as an alternative, in which analysis is a process of reconciliation between the observed acoustic features and the predictions of an internal model of the sound-producing entities in the environment. In this way, predicted sound events will form part of the scene interpretation as long as they are consistent with the input sound, regardless of whether direct evidence is found. A blackboard-based implementation of this approach is described which analyzes dense, ambient sound examples into a vocabulary of noise clouds, transient clicks, and a correlogram-based representation of wide-band periodic energy called the weft. The system is assessed through experiments that firstly investigate subjects' perception of distinct events in ambient sound examples, and secondly collect quality judgments for sound events resynthesized by the system. Although rated as far from perfect, there was good agreement between the events detected by the model and by the listeners. In addition, the experimental procedure does not depend on special aspects of the algorithm (other than the generation of resyntheses), and is applicable to the assessment and comparison of other models of human auditory organization. - - - - - - - - - - - - - - - - - - - - - - - - - - -

This message came from the mail archive
maintained by:
DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University