[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: speech/music



Sue/List -

I can't say I know a good answer to the speech/music problem, but having
been named by Al, I'll try to outline my take on its solution:

Sue Johnson asked:
>> Has anyone any ideas on how you might be able to recognise the presence
>> of speech when music is also present.  Does it have any special
>> pyscho-acoustic properties?

I'm sure it does - the 'characteristic spectro-temporal properties' alluded
to in the sinewave speech work of Remez et al - but I don't believe that
they are simple, low-level features you can pick out by running a 4 Hz
bandpass filter over the subband envelopes (as might be suggested by
Houtgast & Steeneken).  Rather, I think a lot of our uncanny ability to
pick out speech from the most bizarrely-distorted signals comes from the
very highly sophisticated speech-pattern decoders we have, which employ all
of the many levels of constraints applicable to speech percepts (from
low-level pitch continuity through to high level semantic prior
likelihoods) to create and develop plausible speech hypotheses that can
account for portions of the perceived auditory scene.

On the face of it, this might be quite a depressing conclusion for
researchers hoping to address the automatic distinction of speech and
music, since any model that incorporates these kinds of abstract
constraints must surely be extremely complex and 'intelligent'.  But in
fact we do already have models that do many of these things - namely,
current speech recognizers, whose language-models amount to all kinds of
abstract linguistic knowledge, and whose HMM-style acoustic-models provide
a very effective first approximation to the kinds of flexible
spectral-temporal constraints that human listeners must use.

However, it's within the speech recognition community that much of the
interest in speech/music distinction emerges, showing that the current
level of technology certainly hasn't solved this problem.  If you look at
how speech recognition works, I think it's not hard to understand why: the
basic operation of hidden Markov model decoding is to score the likelihood
that a current acoustic 'feature vector' corresponds to a particular speech
sound (a phoneme in a given context, for example).  But this operation
doesn't account for the possibility that there might be *other* energy
represented by the vector in addition to the speech - the influence on the
feature dimensions of the added sound, be it white noise or background
music, serves to push the feature vector outside the region covered by the
training examples; the recognizer concludes that the frame either
looks like some other phoneme, or doesn't look like any phoneme.  (The
near-universal use of cepstral coefficients, which spread changes in
single spectral coefficients over every feature dimension, certainly
reinforces this problem.)

The work being done by Martin Cooke et al at Sheffield on missing-data
recognition, where certain spectral regions may be excluded from the
likelihood calculation if they are known to be dominated by the nonspeech
components, is one way to address this.  The so-called "HMM decomposition"
first suggested by Roger Moore, which computes new models for the feature
vectors of each speech sound in combination with each possible 'state' of
the nonspeech interference, is another theoretically attractive solution,
with obvious practical difficulties if a wide range of interference is to
be covered.  I think some kind of approach which, like these two,
recognizes the fact that a sound scene may consist of multiple sources, and
then attempts to account separately for each of them - i.e.  computational
auditory scene analysis - is very clearly the most satisfying solution to
the problem.