Speech detection

I have performed an fMRI experiment involving overt responses, which we were able to record using equipment provided by opti-MRI, which includes online noise cancelling. For our needs response time is very important and I was wondering whether there is any literature on an optimal way to detect the onset and duration of the responses from the audio signal. Because of the nature of fMRI experiments the recording is pretty noise (with random spikes appearing here and there). Any help would be really appreciated.

