(Dan Ellis )

From:    Dan Ellis  <dpwe(at)MEDIA.MIT.EDU>
Date:    Tue, 8 Dec 1992 12:23:18 -0500

In response to David Huron's interesting description of his project in computational stream formation, I have been working on a similar project for a while. To begin with, I focused on a categorical representation of the acoustic energy, breaking it up into clearly delimited sinusoid-like components. While that proved hard and is not yet solved to my satisfaction, I have recently been considering the higher-level problem of forming these into sources. The way I am tending to think about it is in terms of _fusion_ rather than _segregation_, at least in the first instance, although the distinction is subtle. I am considering fusion because it seems such a powerful and early process : my first level of analysis would analyse a periodic stimulus into harmonics and formant bursts, but it is, of course, very difficult to hear such a stimulus as anything but a single sound. Put another way, the impression of 'one sound' seems more solid, more distinct than the impression of 'more than one sound', so maybe the best way to make a computer recognize the latter condition is to make it do a good job of recognizing the former, and then see when it breaks or gets confused. (David's use of the Pitch Model I see as equivalently placing fusion in the first stage, in that pitch is an attribute of a fused stimulus). Again, as David points out, the problem at the higher level is `filling in gaps' and making the right kinds of heuristic associations between data distinct in time and frequency. I am working with simultaneous use of rules such as harmonicity, common onset and common modulation to form networks of basic elements with a high liklihood of fusing. But I am getting a strong sense that to get results of comparable robustness to human source formation, it is necessary to have very high level models (hypotheses) of what the sound `is' or what has generated it. Such a model ("clarinet playing middle C") can be strongly supported in one region of time frequency, and then use the plausibility so established to mop-up more weakly-associated data in a different region. But the problem of acquiring and triggering such high level models seems very hard. On the acoustic/auditory stream debate: I think that auditory streams are the only ones worth worrying about. I don't really see acoustic streams as being well defined, since the physical origin of different components of a sound can be arbitrarily close. The sound of a guitar string being plucked consists of a initial transient of the pluck-click, perhaps mainly radiated from the pick, followed by the periodic oscillation of the string : are these separate acoustic sources? For me, the only interesting definition of a source is the psychological one, i.e. an ensemble of acoustic energy perceived as a single event or entity. I'd be very interested in any comments from readers of AUDITORY. DAn Ellis, MIT Media Lab Perceptual Computing Group. <dpwe(at)media.mit.edu>

This message came from the mail archive
maintained by:
DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University