Re: speech/music (Neil Todd )

Subject: Re: speech/music
From:    Neil Todd  <TODD(at)FS4.PSY.MAN.AC.UK>
Date:    Tue, 31 Mar 1998 19:14:08 GMT

Hi Bruno/Malcolm/Sue/List Repp wrote: >Sue Johnson wrote: > >>I'm sure you must be able to detect the presence of speech independent of >>being able to recognise it. If someone spoke to me in Finnish say, I would >>be able to tell they were speaking (even in the presence of background >>music/noise), even though I couldn't even segment the words, never mind >>syntactically or semantically parse them. >>I think there must be some way the brain splits up (deconvolves) the >>signal before applying a speech recogniser. >>(I have no proof of this of course, it's just a gut feeling) > > I am not sure the brain really deconvolves the signal completely. >However, I agree that there must be a bottom-up way of recognizing the >presence of speech in noise or music. One characteristic of speech that >is not shared by music is the presence of smooth and fairly rapid >changes in both fundamental frequency and formant frequencies. This is >quite rare in music, which tends to proceed in stepwise changes. Therefore, >some measure of the rate and/or continuity of spectral change should be >relevant to detecting speech automatically. Another relevant feature is >the amplitude envelope. Speech is organized syllabically and therefore >alternates between periods of high and low amplitude at an average rate >of about 4 Hz. Moreover, this alternation is not strictly periodic and >often interrupted by pauses. Music tends to be more strictly periodic >and has a much wider range of tempi than speech. Therefore, some measure >of the distance and regularity of amplitude peaks in the signal would >also seem to be a relevant measure. > > An interesting problem would be to try to automatically distinguish >song from instrumental music. But perhaps the "easier" problem of separating >music from unrelated speech should be tackled first (though not by me!). > > I also agree that speech can be detected without recognition and that the rhythmic organisation is one of a number of important cues. However, just concentrating on a narrow 4 Hz syllable band is too restrictive. For example, Chris Lee and I have recently been looking at cross-linguistic effects in the perception of speech rhythm. We obtained a set of spoken English and French sentences by native English and French speakers. The sentences were balanced so that they were matched for number of syllables, number and location of stressed syllables, syntax and general meaning (loosely based on the sentences in Scott et al, 1985, but modified to avoid alliteration). E.g. Jerome et Marie ont pris le bus Jerome and Marie have caught the bus. The mean lengths of the sentences were very similar so that on average they should yield similar tempo judgements if both judged according to crude syllable/stressed syllable rate. In order to test this we asked another set of native English and French speakers to judge how fast each utterance was spoken. We had predicted that syllable-timed French would be judged as faster than stress-timed English. However, the clear result was that the English native speakers judged French and English utterances as having the same mean tempo, whereas the French native speakers judged English to be faster. One possible explanation for this is that listeners judge the rate of speech flow according to their metrical segmentation strategy (MSS - based on the mora, syllable or stress-foot) (See Cutler, 1996, for review). So that an MSS based on the stress foot yields similar units in both French and English, but an MSS based on the syllable yields syllables in French but a mixture of syllable and sub-syllabic events in English. Given the reality of stress in French and the extreme variability of the syllable in English (due to reduction and ambisyllabicity) this seems like a reasonable interpretation. In order to test this further Chris asked a set of native Italian speakers (who did not speak or comprehend either English or French) to judge the sentences. The idea was that Italian is also supposed to be syllable timed, and indeed he found that the native Italian speakers also judged English to be faster than French. This supports the view that syllable timers pick up on much of the sub-syllabic structure of English, which typically has rates of crudely 10-12 Hz. We also noted that there are gender differences in both performance and perception. French women on average speak faster than French men. The typical syllable rate is about 6 Hz in our sample. Similarly, Italian women judge English to be faster than do Italian men, consistent with the sensory-motor theory that expected rates are partially informed by the dynamics of the motor production system, and on average, women have smaller jaws than men. So, the moral of the story is that 4 Hz is very much a male Anglo-Saxon number. Cheers Neil

This message came from the mail archive
maintained by:
DAn Ellis <>
Electrical Engineering Dept., Columbia University