[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Brian Karlsen: Re: speech/music
Dear List -
I'm forwarding this message on behalf of Brian Karlsen <firstname.lastname@example.org>
who is having problems with the listserver.
-- forwarded message
Date: Thu, 02 Apr 1998 09:30:45 +0200
From: Brian Lykkegaard Karlsen <email@example.com>
Organization: Aalborg University
Subject: Re: speech/music
To get back to Sue's original question, I have this late response:
Sue Johnson wrote:
> > Houtgast & Steeneken). Rather, I think a lot of our uncanny ability to
> > pick out speech from the most bizarrely-distorted signals comes from the
> > very highly sophisticated speech-pattern decoders we have, which employ all
> > of the many levels of constraints applicable to speech percepts (from
> > low-level pitch continuity through to high level semantic prior
> > likelihoods) to create and develop plausible speech hypotheses that can
> > account for portions of the perceived auditory scene.
> I have problems with this. (sorry)
> I'm sure you must be able to detect the presence of speech independent of
> being able to recognise it. If someone spoke to me in Finnish say, I would
> be able to tell they were speaking (even in the presence of background
> music/noise), even though I couldn't even segment the words, never mind
> syntactically or semantically parse them.
> I think there must be some way the brain splits up (deconvolves) the
> signal before applying a speech recogniser.
> (I have no proof of this of course, it's just a gut feeling)
> I agree having a recogniser which would cope with speech would be the
> ideal solution, but there is problems of training appropriate models to
> recognise music you haven't seen before (the current HMM methods assume
> the training data represents in some way the same distribution as the test
> data), and from a time constraint, any removal of audio without relevant
> information content before recognition is helpful.
> I dont have the slightest idea of how the brain detects speech, but it
> would seem logical to me that it can do that on a very low-level acoustic
> basis. If this were true then in theory a front-end speech detector should
> be possible.
> I admit I know very little on this subject, so am looking forward to
> people correcting me.
> thanks for all your comments.
I think you're partially right about this. Of course it doesn't require
recognition to tell speech from other sounds, but I think the point that Dan wa
trying to make, is that primitive processes (as referred to in Auditory Scene
Analysis) are not sufficient for making speech into a "stream" (also an ASA
term). Too many incoherent acoustic elements are involved - how do we for
example group an /s/ and a /u/ together? They have practically nothing in commo
acoustically. So you need some kind of higher order knowledge about the speech
signal to be able to segregate it from a mixture with an undesired sound. Of
course I'm not talking about conscious knowledge here. One way of thinking of i
is to picture a process hierarchy which is neither exclusively bottom-up nor
exclusively top-down. At the bottom you'll find the inner ear where mechanical
motion is transformed into neural spikes, and at the top you'll find some kind
of sound recognition engine. In between there will be all kinds of intermediate
levels which can interact with each other. At one of these levels individual
streams are separated from each other. This level is the one which is pertinent
to this discussion.
I have to say that many of these ideas are my interpretations of the ideas of
other people taken partially from Al Bregman's book on ASA and partially from
discussions with Phil Green, Martin Cooke and Guy Brown at Sheffield.
Center for Personkommunikation
-- end of forwarded message