TNO Inst. for Perception, P.O. Box 23, 3769 ZG Soesterberg, The Netherlands
The energy at any location f[sub i]t[sub i] in the speech spectrogram can be derived by convolution of the speech signal with appropriate Gaussian-shaped cosine and sine waves. This leads to the so-called pixel energy, with an integration window centered at f[sub i]t[sub i] and Gaussian-shaped in both frequency and time. Analysis of the correlation for pairs of pixels at a given mutual distance in the spectrogram provides a parametric representation of speech, which shows interesting characteristics. One such representation is the spectral correlation matrix: the correlation for pairs of energy pixels drawn at the same instance at two different frequencies f[sub i] and f[sub j], with f[sub i] and f[sub j] chosen from a range of values at appropriate intervals along the frequency scale. It is found that, for a given talker, this spectral correlation matrix stabilizes after analyzing a speech token of about 10 to 15 s: Its features are then typical of speech as such, and do not depend on the actual content of the speech token. Details of such spectral correlation matrices appear to be characteristic of the talker.