Dept. of Comput. and Inf. Sci., Ohio State Univ., 978 Afton Rd., Columbus, OH 43221
The suitability of the Patterson and Holdsworth auditory model as a front end for speech recognition was studied by measuring the ability of neural networks trained on input from the P&H model to distinguish between the plosive consonants, using networks trained on fast Fourier transform (FFT) input as a control. Numerous experiments were performed to control for speaker dependence, network weight and architecture bias, utterance ordering, and other variables. The results indicate that the P&H model provides an advantage over a plain FFT. In every experiment, agents trained on the P&H mode categorized test stimulus more accurately than did FFT trained networks, while maintaining comparable training times. A typical experiment gave a result of 66% correct stimulus response for the P&H trained networks versus 12% for the FFT trained networks. Moreover, errors made by the P&H trained networks were ``good'' ones, such as mistaking /b/ for /p/, indicating learning took place. The Patterson and Holdsworth model processes the audio signal through (among other steps) adaptive thresholding and, especially, triggered temporal integration of the signal. The results of this experiment may be useful in focusing future research on studying the acoustic features made prominent by the P&H model and their importance to speech.