ATR Human Information Process. Res. Labs., 2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-02 Japan
This paper proposes an onset-sensitive time-frequency masking mechanism in order to improve dynamic feature extraction. Application of the proposed mechanism to Japanese 23-phoneme recognition using hidden Markov models demonstrated that onset-sensitive MASP outperforms time-invariant MASP. Masked Spectrum (MASP) [Aikawa et al., Proc. ICASSP93 II, 668--671 (1993)] is a new spectral representation incorporating time-frequency forward masking and has been reported to provide excellent performance when used for speaker-dependent and speaker-independent speech recognition. The masking pattern production mechanism was previously modeled by a time-invariant time-frequency filter, but the masking level rises at the onsets and offsets in a speech sound [T. Hirahara, J. Acoust. Soc. Jpn. E12 (2), 57--68 (1991); E. Miyasaka, J. Acoust. Soc. Jpn. 39 (9), 614--623 (1983)]. This phenomenon suggests that an adaptive masking mechanism is effective for balancing instantaneous and transitional spectral features depending on vowels or consonants. The masking pattern is calculated as the weighted sum of the smoothed preceding spectra obtained by time--distance-dependent spectral smoothing lifters. The masking level is controlled by the slope of the temporal contour of the instantaneous sound energy. The masked spectrum is obtained by subtracting the masking pattern from the current spectrum. Onset--offset-sensitive masking models are also examined.