ATR Human Information Processing Res. Labs., 2-2 Hikari-dai, Seika-cho, Soraku-gun, Kyoto 619-02, Japan
An auditory model front-end that reflects spectrotemporal masking characteristics is proposed. The model gives an excellent performance in the multi-speaker word recognition system using a cochlear filter. Recent auditory perception research shows that the forward masking pattern becomes more wide spread over the frequency axis as the masker-signal interval increases [E. Miyasaka, J. Acoust. Soc. Jpn. 39, 614--623 (1983)]. This spectrotemporal masking characteristics appears to be effective for eliminating the speaker-dependent spectral tilt that reflects individual source variation and for enhancing the spectral dynamics that convey phonological information in speech signals. The spectrotemporal masking characteristics is modeled and applied to a multi-speaker word recognition system. The current masking level is calculated as the weighted sum of the smoothed preceding spectra. The weight values become smaller and the smoothing window size becomes wider on the frequency axis as the masker-signal interval increases. The power spectra are extracted using a 64-channel fixed Q cochlear filter (FQF). The FQF covers the frequency range from 1.5 to 18.5 Bark. The current-masked spectrum is obtained by subtracting the masking levels from the current spectrum. Recognition experiments for phonetically balanced 216 Japanese words uttered by 10 male speakers demonstrate that the introduction of the spectrotemporal masking model improves the recognition performance in the multi-speaker word recognition system.