Subject: Re: speech/music From: Dan Ellis <dpwe(at)ICSI.BERKELEY.EDU> Date: Mon, 30 Mar 1998 10:06:39 PST
Sue/List - I can't say I know a good answer to the speech/music problem, but having been named by Al, I'll try to outline my take on its solution: Sue Johnson asked: >> Has anyone any ideas on how you might be able to recognise the presence >> of speech when music is also present. Does it have any special >> pyscho-acoustic properties? I'm sure it does - the 'characteristic spectro-temporal properties' alluded to in the sinewave speech work of Remez et al - but I don't believe that they are simple, low-level features you can pick out by running a 4 Hz bandpass filter over the subband envelopes (as might be suggested by Houtgast & Steeneken). Rather, I think a lot of our uncanny ability to pick out speech from the most bizarrely-distorted signals comes from the very highly sophisticated speech-pattern decoders we have, which employ all of the many levels of constraints applicable to speech percepts (from low-level pitch continuity through to high level semantic prior likelihoods) to create and develop plausible speech hypotheses that can account for portions of the perceived auditory scene. On the face of it, this might be quite a depressing conclusion for researchers hoping to address the automatic distinction of speech and music, since any model that incorporates these kinds of abstract constraints must surely be extremely complex and 'intelligent'. But in fact we do already have models that do many of these things - namely, current speech recognizers, whose language-models amount to all kinds of abstract linguistic knowledge, and whose HMM-style acoustic-models provide a very effective first approximation to the kinds of flexible spectral-temporal constraints that human listeners must use. However, it's within the speech recognition community that much of the interest in speech/music distinction emerges, showing that the current level of technology certainly hasn't solved this problem. If you look at how speech recognition works, I think it's not hard to understand why: the basic operation of hidden Markov model decoding is to score the likelihood that a current acoustic 'feature vector' corresponds to a particular speech sound (a phoneme in a given context, for example). But this operation doesn't account for the possibility that there might be *other* energy represented by the vector in addition to the speech - the influence on the feature dimensions of the added sound, be it white noise or background music, serves to push the feature vector outside the region covered by the training examples; the recognizer concludes that the frame either looks like some other phoneme, or doesn't look like any phoneme. (The near-universal use of cepstral coefficients, which spread changes in single spectral coefficients over every feature dimension, certainly reinforces this problem.) The work being done by Martin Cooke et al at Sheffield on missing-data recognition, where certain spectral regions may be excluded from the likelihood calculation if they are known to be dominated by the nonspeech components, is one way to address this. The so-called "HMM decomposition" first suggested by Roger Moore, which computes new models for the feature vectors of each speech sound in combination with each possible 'state' of the nonspeech interference, is another theoretically attractive solution, with obvious practical difficulties if a wide range of interference is to be covered. I think some kind of approach which, like these two, recognizes the fact that a sound scene may consist of multiple sources, and then attempts to account separately for each of them - i.e. computational auditory scene analysis - is very clearly the most satisfying solution to the problem.