[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[no subject]

system operates by maintaining a collection of 'world-model' hypotheses,
iving rise to an aggregate prediction for the observed features which is
then reconciled to actual observations by adjusting the model hypotheses.
With this kind of system, the way to detect speech against music is to run
the system: if the speech hypotheses are able to find enough support in
the observations, they will be invoked and pursued.  Never mind that there
may be lots of other stuff going on at the same time - the system would
generate other, independent hypotheses to account for that energy, and the
process of combined-prediction and reconciliation would absorb the
distortion and masking of overlapped features.

My use of the conditional will have tipped you off that I'm not
particularly close to implementing this system.  I am, however, trying -
see, for instance, my paper from Mohonk last year which is available at
http://www.icsi.berkeley.edu/real/papers.html .

It seems a shame for the speech recognition community to put effort into
pattern-recognition solutions for detecting the music in signals so that
they can avoid running recognition over those episodes, when the development
of a more flexible and general-purpose model of the signal itself, one that
accepted that most sounds consist of more than one source, might solve not
only the problem of detecting when music is present, but also the
recognition of the simultaneous speech.  I don't think it's possible to
detect the presence of speech with anything less complex than a speech
recognizer, but I do feel that a speech recognizer is probably
three-quarters of the solution.

Not exactly an answer to the original question, but an alternative
perspective that I hope may be of interest!

-- DAn Ellis  <dpwe@icsi.berkeley.edu>  <http://www.icsi.berkeley.edu/~dpwe/>
   International Computer Science Institute  Berkeley  CA