[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: speech/music

Hi Bruno/Malcolm/Sue/List

Repp wrote:

>Sue Johnson wrote:
>>I'm sure you must be able to detect the presence of speech independent of
>>being able to recognise it. If someone spoke to me in Finnish say, I would
>>be able to tell they were speaking (even in the presence of background
>>music/noise), even though I couldn't even segment the words, never mind
>>syntactically or semantically parse them.
>>I think there must be some way the brain splits up (deconvolves) the
>>signal before applying a speech recogniser.
>>(I have no proof of this of course, it's just a gut feeling)
>        I am not sure the brain really deconvolves the signal completely.
>However, I agree that there must be a bottom-up way of recognizing the
>presence of speech in noise or music. One characteristic of speech that
>is not shared by music is the presence of smooth and fairly rapid
>changes in both fundamental frequency and formant frequencies. This is
>quite rare in music, which tends to proceed in stepwise changes. Therefore,
>some measure of the rate and/or continuity of spectral change should be
>relevant to detecting speech automatically. Another relevant feature is
>the amplitude envelope. Speech is organized syllabically and therefore
>alternates between periods of high and low amplitude at an average rate
>of about 4 Hz. Moreover, this alternation is not strictly periodic and
>often interrupted by pauses. Music tends to be more strictly periodic
>and has a much wider range of tempi than speech. Therefore, some measure
>of the distance and regularity of amplitude peaks in the signal would
>also seem to be a relevant measure.
>        An interesting problem would be to try to automatically distinguish
>song from instrumental music. But perhaps the "easier" problem of separating
>music from unrelated speech should be tackled first (though not by me!).

I also agree that speech can be detected without recognition and that the
organisation is one of a number of important cues. However, just concentrating
a narrow 4 Hz syllable band is too restrictive. For example, Chris Lee and I
 have recently
been looking at cross-linguistic effects in the perception of speech rhythm. We
a set of spoken English and French sentences by native English and French
 speakers. The
sentences were balanced so that they were matched for number of syllables,
 number and
location of stressed syllables, syntax and general meaning (loosely based on the
in Scott et al, 1985, but modified to avoid alliteration). E.g.

Jerome et Marie ont pris le bus

Jerome and Marie have caught the bus.

The mean lengths of the sentences were very similar so that on average they
 should yield
similar tempo judgements if both judged according to crude syllable/stressed
 syllable rate.
In order to test this we asked another set of native English and French speakers
 to judge
how fast each utterance was spoken. We had predicted that syllable-timed French
 would be
judged as faster than stress-timed English. However, the clear result was that
 the English
native speakers judged French and English utterances as having the same mean
 tempo, whereas
the French native speakers judged English to be faster.

One possible explanation for this is that listeners judge the rate of speech
 flow according
to their metrical segmentation strategy (MSS - based on the mora, syllable or
(See Cutler, 1996,  for review). So that an MSS based on the stress foot yields
units in both French and English, but an MSS based on the syllable yields
 syllables in
French but a mixture of syllable and sub-syllabic events in English. Given the
 reality of
stress in French and the extreme variability of the syllable in English (due to
and ambisyllabicity) this seems like a reasonable interpretation.

In order to test this further Chris asked a set of native Italian speakers (who
 did not
speak or comprehend either English or French) to judge the sentences. The idea
 was that
Italian is also supposed to be syllable timed, and indeed he found that the
 native Italian
speakers also judged English to be faster than French. This supports the view
 that syllable
timers pick up on much of the sub-syllabic structure of English, which typically
 has rates
of crudely 10-12 Hz. We also noted that there are gender differences in both
 performance and
perception. French women on average speak faster than French men. The typical
 syllable rate
is about 6 Hz in our sample. Similarly, Italian women judge English to be faster
 than do
Italian men, consistent with the sensory-motor theory that expected rates are
informed by the dynamics of the motor production system, and on average, women
 have smaller
jaws than men.

So, the moral of the story is that 4 Hz is very much a male Anglo-Saxon number.