[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CASA problems and solutions



Dear Al and List,
   In return, I appreciate your feedback on my essay.  It gives me some
feeling for points that need further explanation.  I'll address a few that
you mentioned, but I can see that it is really necessary for me to finish
the second part of the essay that will supply more detail on how I handle
the higher levels of perception while using the "waveform information
vector" concept.
  Briefly, this process breaks the waveform into particles of information
that can be manipulated according to common features, of which, at the
input level, instantaneous direction of arrival (DOA) is the most
predominant.  After being sorted into coarsely defined streams by DOA the
particles may then be further separated according to their temporal
features (instead of spectral features.)  From this point, the streams are
further refined in subsequent stages  according to their increasing value
in terms of their relative need for attention by the auditory requirements
for aiding survival.
  Obviously, DOA is not a prerequisite for sorting sounds and removing
reverberations, but it helps a lot. (Consider the telephone.)  The key
point is that with the high time resolution of the WIVs, the sorting of
overlapping source streams is greatly improved over what can be done with
spectral processing.
   This point can be illustrated by an experiment in which I separate the
voice from the pulse train _without DOA selection_ by selecting only the
voice's periodicity spectrum.  I will see if I have the space in my site to
upload it.

>It is an excellent beginning.  The reason I use the word "beginning" is that
>for humans, and presumably other animals, the use of spatial position is

  Yes, I realize this is just a beginning.  It will be a long haul, but I
think it will be on the right path.

>  I suspect that to replicate the full range of
>human auditory scene analysis (ASA), the attempt to solve the problem
>computationally (CASA) will have to use the same range of environmental
>cues.

   I have a plan for this.  As I mentioned at the end of my essay, the
processing will be done in stages that extract intermediate levels of
meaning.

>Apart from spatial origin, the following sorts of information are used by
>humans:
>
>(A) For integrating components that arrive overlapped in time:
>
>    1.  harmonic relations
>    2.  asynchrony of onset and offset
>    3.  spectral separation
>    4.  Independence of amplitude changes in different
>         parts of the spectrum
>
>(B) For integrating components over time:
>
>    5.  Spectral separation
>    6.  Separation in time (interacts with other factors)
>    7.  Differences in spectral shape
>    8.  Differences in intensity (a weak effect)
>    9.  Abruptness/smoothness of transition from one sound
>         to the next
>
>(I have attached a 2-page summary of what is known about ASA in humans.  As
>well as mentioning factors 1 to 9, it describes the effects of ASA on the
>experience of the listener. I have used it as a handout in talks I have
>given. It is in RTF format which should be readable by most versions of
>Word.)

  Yes, I have read it.  It looks very familiar!

>I'm not sure whether your rejection of the Fourier method extends to all
>methods of decomposing the input into spectral components.  However if it
>does, we should bear in mind that factors 3, 4, and 5, 7, and probably 1,
>listed above, are most naturally stated on a frequency x time
>representation -- that is, on a spectrogram or something like it.
>
>Furthermore, when you look at a spectrographic representation of an auditory
>signal, the visual grouping that occurs is often directly analogous to the
>auditory organization (provided that the time and frequency axes are
>properly scaled).  Why would this be so if some sort of frequency axis were
>not central to auditory perception, playing a role analogous to a spatial
>dimension  in vision?  Perhaps the Fourier transform is not the best
>approach to forming this frequency dimension, but something that does a
>similar job is required.  Finally there is overwhelming physiological
>evidence that the human nervous system does a frequency analysis of the
>sound and retains separate frequency representations all the way to the
>brain.

   Although I do reject the Fourier method, I believe that I have addressed
your requirements in both practice and concept.  You might notice, in the
second set of experiments on my site, that the display format includes a
periodicity spectrum that is a time-domain version of the frequency
spectrum.  The difference is that the "periodicity sorting matrix"
processor instantaneously recognizes mixed periodic events in the stream of
WIVs even though they might be from different sources.  This avoids the
window problem of spectral methods.  Interestingly, it has an inherent
octave-related logarithmic scale that matches the tonotopic arrangement of
the ear.
  More on that in the second section of the essay.
  Again, this method addresses the requirement for periodicity/frequency
perception, but does it in a way that allows separating mixed sources into
separate streams.  It's a trick I once devised for separating and
identifying radar pulse streams.

>Perhaps I have missed some of the consequences of your method.  If so I
>would be happy to be corrected.

  I realize that the concept is radical, and that is what makes it hard to
translate.  I hope this explanation helps.

   Best wishes,

   John Bates
   Time/Space Systems
   Pleasantville, New York