[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: speech/music characteristics

Sue Johnson wrote:
> Hi!
> I'm working in speech recognition, and am trying to be able to distinguish
> between speech and non-speech (especially music) sounds in an audio track.
> I wondered if anyone had any ideas (for example from a speech/music
> perception point of view) of the things that characterise music and
> speech. For example, is the periodicity important, or is it to do with
> continuity?

You might have a look at

 Scheirer and Slaney, "Construction and Evaluation of a Robust
   Multifeature Speech/Music Discriminator", Proc IEEE ICASSP

(It's also on Malcolm's WWW page at

While the research was pursued from an engineering viewpoint
rather than a perceptual/scientific viewpoint, some of the
results could be treated as perceptual hypotheses.

We tested 13 features and four classification frameworks; we got
about 1.4% error in the best condition over a broad database
(which is available for comparative testing).  As Prof. Todd
has suggested, rhythmicity is a useful feature, so especially
are features about the modulation rate of the signal.  In
speech, the energy and spectral centroid both bounce around
a lot with the rapid formant and voicing changes; music is
usually more static in this regard.

The most promising feature I know of that we didn't test was
reported in Mike Hawley's unpublished dissertation -- he used a
measure of the "flatness" of harmonic partials to identify music.

> How do we know when something is music and something is just noise?
> How does the brain recognise music, how can you recognise both music and
> speech if they are played at the same time..

In both the speech/music case, and the music/noise case, there's
lots of philosophical gray area.  For example, "spoken word
poetry" is often unaccompanied, usually not rhythmic, but is
performed by musicians and sometimes filed under music in record
shops.  I think this is a case that is not particularly amenable
to pattern-recognition-style analysis.  There's many like
this along the music--noise continuum, as well (Cage, for

We explicitly tested our feature set and our classification
frameworks with a speech/music/noise/speech+music data set.  We
were hard-pressed to get much better than 50% results (which
is still much better than chance).  In general, distinguishing
between "music" and "music + speech" seems very difficult.
Again, there's some difficulties of problem definition
(what if the speech is 10dB down from the music?  60dB down?)

I've seen a reference to

  Spina and Zue, "Automatic transcription of general audio data:
    Preliminary analyses".  In Proc. IC on Spoken Lang Proc 1996

but haven't read the paper.  According to the citation (in
a forthcoming paper by J. Foote), they report 19.1% error on
a seven-way classification of clean speech/noisy speech/telephone

Best regards to all,

 -- Eric

|  Eric Scheirer  |A-7b5 D7b9|G-7 C7|Cb   C-7b5 F7#9|Bb  |B-7 E7|
|eds@media.mit.edu|      < http://sound.media.mit.edu/~eds >
|  617 253 0112   |A A/G# F#-7 F#-/E|Eb-7b5 D7b5|Db|C7b5 B7b5|Bb|