CAIP Ctr., Rutgers Univ., Piscataway, NJ 08855-1390
This paper describes algorithms for automatic end-point detection of microphone-array speech signals. Microphone arrays provide a hands-free sound pickup. The captured sound typically has a higher signal-to-noise ratio (SNR) than that captured with conventional microphones used at distances, such as in teleconferencing environments. However, due to multipath distortion (room reverberation) and ambient noise, the detection of starting/ending points of array speech is more difficult than that of close-talking speech. In this paper, short-time energy and short-time zero-crossing rate are computed for the original speech waveform and its high-pass filtered and low-pass filtered versions. These six functions are then utilized in different combinations to determine the end points. Speech data used in the experiments are collected in a hard-walled laboratory room, having a reverberation time of approximately 0.5 s with a one-dimensional beamforming line array. From the experiments, it is found that the high-pass filtered signal gives a more reliable estimate of end points than does the low-pass filtered counterpart. This result is consistent with the fact that reverberation and noise in rooms are typically more prominent at low frequencies and are relatively moderate at mid- and high frequencies. The detection algorithms have been integrated into a dynamic-time-warping- (DTW) based speech recognizer. Recognition performance of the system is evaluated for both array speech and for close-talking speech.