Re: speech/music characteristics (Eric Scheirer )

Subject: Re: speech/music characteristics
From:    Eric Scheirer  <eds(at)>
Date:    Thu, 26 Mar 1998 09:32:30 -0500

Sue Johnson wrote: > > Hi! > I'm working in speech recognition, and am trying to be able to distinguish > between speech and non-speech (especially music) sounds in an audio track. > I wondered if anyone had any ideas (for example from a speech/music > perception point of view) of the things that characterise music and > speech. For example, is the periodicity important, or is it to do with > continuity? You might have a look at Scheirer and Slaney, "Construction and Evaluation of a Robust Multifeature Speech/Music Discriminator", Proc IEEE ICASSP 1997. (It's also on Malcolm's WWW page at While the research was pursued from an engineering viewpoint rather than a perceptual/scientific viewpoint, some of the results could be treated as perceptual hypotheses. We tested 13 features and four classification frameworks; we got about 1.4% error in the best condition over a broad database (which is available for comparative testing). As Prof. Todd has suggested, rhythmicity is a useful feature, so especially are features about the modulation rate of the signal. In speech, the energy and spectral centroid both bounce around a lot with the rapid formant and voicing changes; music is usually more static in this regard. The most promising feature I know of that we didn't test was reported in Mike Hawley's unpublished dissertation -- he used a measure of the "flatness" of harmonic partials to identify music. > How do we know when something is music and something is just noise? > How does the brain recognise music, how can you recognise both music and > speech if they are played at the same time.. In both the speech/music case, and the music/noise case, there's lots of philosophical gray area. For example, "spoken word poetry" is often unaccompanied, usually not rhythmic, but is performed by musicians and sometimes filed under music in record shops. I think this is a case that is not particularly amenable to pattern-recognition-style analysis. There's many like this along the music--noise continuum, as well (Cage, for example). We explicitly tested our feature set and our classification frameworks with a speech/music/noise/speech+music data set. We were hard-pressed to get much better than 50% results (which is still much better than chance). In general, distinguishing between "music" and "music + speech" seems very difficult. Again, there's some difficulties of problem definition (what if the speech is 10dB down from the music? 60dB down?) here. I've seen a reference to Spina and Zue, "Automatic transcription of general audio data: Preliminary analyses". In Proc. IC on Spoken Lang Proc 1996 but haven't read the paper. According to the citation (in a forthcoming paper by J. Foote), they report 19.1% error on a seven-way classification of clean speech/noisy speech/telephone speech/silence/noise/music/music+speech. Best regards to all, -- Eric -- +-----------------+ | Eric Scheirer |A-7b5 D7b9|G-7 C7|Cb C-7b5 F7#9|Bb |B-7 E7| |eds(at)| < > | 617 253 0112 |A A/G# F#-7 F#-/E|Eb-7b5 D7b5|Db|C7b5 B7b5|Bb| +-----------------+

This message came from the mail archive
maintained by:
DAn Ellis <>
Electrical Engineering Dept., Columbia University