Sue Johnson wrote:

>I'm sure you must be able to detect the presence of speech independent of
>being able to recognise it. If someone spoke to me in Finnish say, I would
>be able to tell they were speaking (even in the presence of background
>music/noise), even though I couldn't even segment the words, never mind
>syntactically or semantically parse them.
>I think there must be some way the brain splits up (deconvolves) the
>signal before applying a speech recogniser.
>(I have no proof of this of course, it's just a gut feeling)

        I am not sure the brain really deconvolves the signal completely.
However, I agree that there must be a bottom-up way of recognizing the
presence of speech in noise or music. One characteristic of speech that
is not shared by music is the presence of smooth and fairly rapid
changes in both fundamental frequency and formant frequencies. This is
quite rare in music, which tends to proceed in stepwise changes. Therefore,
some measure of the rate and/or continuity of spectral change should be
relevant to detecting speech automatically. Another relevant feature is
the amplitude envelope. Speech is organized syllabically and therefore
alternates between periods of high and low amplitude at an average rate
of about 4 Hz. Moreover, this alternation is not strictly periodic and
often interrupted by pauses. Music tends to be more strictly periodic
and has a much wider range of tempi than speech. Therefore, some measure
of the distance and regularity of amplitude peaks in the signal would
also seem to be a relevant measure.

        An interesting problem would be to try to automatically distinguish
song from instrumental music. But perhaps the "easier" problem of separating
music from unrelated speech should be tackled first (though not by me!).

