Re: CASA problems and solutions (DeLiang Wang )

Subject: Re: CASA problems and solutions
From:    DeLiang Wang  <dwang(at)CIS.OHIO-STATE.EDU>
Date:    Tue, 30 Jan 2001 13:58:46 -0500

Dear Dr. Bates, There is a distinction between auditory scene analysis (ASA) as defined by Bregman, and computational auditory scene analysis (CASA). From my reading of your writeup, your criticisms are targeted to ASA. To a CASA researcher like me, the problem is well-defined: how to extract auditory events (objects) from one sensor (monaural), two sensors (binaural), or more (sensor array). This definition is, in Marrian framework, at the level of computational theory, independent of Fourier analysis, cochlear models, or even the auditory system. One may insist on biological plausibility, or one may pay close attention to how the auditory system solves the problem (which we don't know) in order to borrow ideas and solutions. But, regardless of approaches, the CASA problem remains unsolved. I think that a lot of people are not really biased in terms of approaches and some frankly don't care about human audition. This speaks to the challenge of the problem itself. Moreover, one need not be too pessimistic. Think about computer vision, where A LOT more people have been working, and artificial intelligence. What I am getting at is that, if you can manage to separate multiple sources spatially and reconstruct them, it will be a great technological breakthrough. I don't know if you can do it and I have doubts (see below), but we will study your approach. Location-based source separation has been attempted before with some success. But it is far from solving the CASA problem. Successes in a few demos are far from a general solution. There are well-documented test databases to measure success in CASA in a systematic way (see references below). Results on these databases would be a lot more revealing. Since you have made a connection with vision, spatial analysis in audition would roughly correspond to depth perception from two images, a problem that has been studied since the early days of computer vision. The challenge there, which remains to this day, is the correspondence problem: which pixels of one image correspond to which pixels of the other. It's hard to find a solution to the correspondence problem without image analysis (grouping and segmentation). Similarly, it's hard for me to imagine a solution to CASA purely on the basis of spatial analysis without other cues of ASA. The replies by Al Bregman and John Culling suggest that location may not even be a major cue for ASA. I'd like to be proven wrong, so that at least we have one solution to rely on. Some recent references about CASA: D.F. Rosenthal and H.G. Okuno, Ed. Computational auditory scene analysis. Mahwah NJ: Lawrence Erlbaum, 1998. D.L. Wang and G.J. Brown, "Separation of speech from interfering sounds based on oscillatory correlation," IEEE Trans. Neural Net., vol. 10, pp. 684-697, 1999. (PDF file available on my web.) Cheers, DeLiang Wang -- ------------------------------------------------------------ Dr. DeLiang Wang Department of Computer and Information Science The Ohio State University 2015 Neil Ave. Columbus, OH 43210-1277, U.S.A. Email: dwang(at) Phone: 614-292-6827 (OFFICE); 614-292-7402 (LAB) Fax: 614-292-2911 URL:

This message came from the mail archive
maintained by:
DAn Ellis <>
Electrical Engineering Dept., Columbia University