[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
>Sue Johnson wrote:
>>I'm sure you must be able to detect the presence of speech independent of
>>being able to recognise it. If someone spoke to me in Finnish say, I would
>>be able to tell they were speaking (even in the presence of background
>>music/noise), even though I couldn't even segment the words, never mind
>>syntactically or semantically parse them.
>>I think there must be some way the brain splits up (deconvolves) the
>>signal before applying a speech recogniser.
>>(I have no proof of this of course, it's just a gut feeling)
> I am not sure the brain really deconvolves the signal completely.
>However, I agree that there must be a bottom-up way of recognizing the
>presence of speech in noise or music. One characteristic of speech that
>is not shared by music is the presence of smooth and fairly rapid
>changes in both fundamental frequency and formant frequencies. This is
>quite rare in music, which tends to proceed in stepwise changes. Therefore,
>some measure of the rate and/or continuity of spectral change should be
>relevant to detecting speech automatically. Another relevant feature is
>the amplitude envelope. Speech is organized syllabically and therefore
>alternates between periods of high and low amplitude at an average rate
>of about 4 Hz. Moreover, this alternation is not strictly periodic and
>often interrupted by pauses. Music tends to be more strictly periodic
>and has a much wider range of tempi than speech. Therefore, some measure
>of the distance and regularity of amplitude peaks in the signal would
>also seem to be a relevant measure.
> An interesting problem would be to try to automatically distinguish
>song from instrumental music. But perhaps the "easier" problem of separating
>music from unrelated speech should be tackled first (though not by me!).
I also agree that speech can be detected without recognition and that the
organisation is one of a number of important cues. However, just concentrating
a narrow 4 Hz syllable band is too restrictive. For example, Chris Lee and I
been looking at cross-linguistic effects in the perception of speech rhythm. We
a set of spoken English and French sentences by native English and French
sentences were balanced so that they were matched for number of syllables,
location of stressed syllables, syntax and general meaning (loosely based on the
in Scott et al, 1985, but modified to avoid alliteration). E.g.
Jerome et Marie ont pris le bus
Jerome and Marie have caught the bus.
The mean lengths of the sentences were very similar so that on average they
similar tempo judgements if both judged according to crude syllable/stressed
In order to test this we asked another set of native English and French speakers
how fast each utterance was spoken. We had predicted that syllable-timed French
judged as faster than stress-timed English. However, the clear result was that
native speakers judged French and English utterances as having the same mean
the French native speakers judged English to be faster.
One possible explanation for this is that listeners judge the rate of speech
to their metrical segmentation strategy (MSS - based on the mora, syllable or
(See Cutler, 1996, for review). So that an MSS based on the stress foot yields
units in both French and English, but an MSS based on the syllable yields
French but a mixture of syllable and sub-syllabic events in English. Given the
stress in French and the extreme variability of the syllable in English (due to
and ambisyllabicity) this seems like a reasonable interpretation.
In order to test this further Chris asked a set of native Italian speakers (who
speak or comprehend either English or French) to judge the sentences. The idea
Italian is also supposed to be syllable timed, and indeed he found that the
speakers also judged English to be faster than French. This supports the view
timers pick up on much of the sub-syllabic structure of English, which typically
of crudely 10-12 Hz. We also noted that there are gender differences in both
perception. French women on average speak faster than French men. The typical
is about 6 Hz in our sample. Similarly, Italian women judge English to be faster
Italian men, consistent with the sensory-motor theory that expected rates are
informed by the dynamics of the motor production system, and on average, women
jaws than men.
So, the moral of the story is that 4 Hz is very much a male Anglo-Saxon number.