Subject: Re: speech/music -> speech/singing From: Dan Ellis <dpwe(at)ICSI.BERKELEY.EDU> Date: Fri, 3 Apr 1998 20:33:02 PST
Bruno offers three low-level features to distinguish speech from music: speech has smoothly-varying pitch, smoothly-varying formant structure, and is not very strictly rhythmic; in contrast, music tends to have piecewise-constant 'pitch', changes in spectra which are abrupt when they occur, and much stricter rhythm. Bruno also pointed to a fascinating example, singing, which made me think: how can you tell the difference between someone talking and someone singing? My informal observation is that you *can* do this pretty easily, and I suspect that of the three cues, it's the *pitch* variation (or lack of it) that's the most important factor (although the other two certainly apply). Two supporting observations are (1) when someone happens to hold a voiced sound in speech *with a pitch stability that would be acceptable in music*, it sounds out of place after a very short time. (I guess that with the coupling of vocal-fold vibration rate with sub-glottal pressure, it actually requires quite a lot of 'trimming' to keep pitch constant as a syllable trails off). Without having done the investigation, I believe that if you look at the pitch-track even of long filled-pauses in speech, then compare them to sung vowels, you'll find that pitch is held markedly more constant in 'musical' voice sounds. Observation (2) is that when looking at speech spectrograms that occasionally have music mixed in, it is often immediately obvious where the music appears, owing to the very 'flat' time-extended fixed-frequency harmonics of the music, appearing as long horizontal striations. (I'm thinking here of wideband spectrograms, where the harmonics of the speech are in fact rarely visible at all). This suggests that you can look for music just by looking for the extended (isolated, i.e. high-f0) harmonics that show an unnatural stability in frequency. This is what Eric mentioned in his original reply, referring to Mike Hawley's work. Personally, I like the idea of having a multiple-pitch-sensitive (i.e. polyphonic) model of pitch perception, and looking for music's unnaturally-stable pitch-tracks in the output of that. That would also give you a domain to spot the converse, the fluctuating pitch-tracks that might form the bottom-up starting point of extracting and recognizing speech. [All this to show that I'm not *opposed* to bottom-up mechanisms, it's just that I've had a personal revalation as to the importance of the top-down processes that act in combination with them, and I now feel the obligation of a zealot to make sure that the expectation-based mechanisms are given their due consideration in any debate. But Malcolm has that covered in this instance ;-)] DAn.