[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: MFCC method

Dear Arturo,

In your response to Dick Lyon you refer to the observation that the Mel Scale "approximates the human auditory system's response more closely than the linearly-spaced frequency bands used in the normal cepstrum" and you make a reference to my frequency-position function of 1961, 1990, and 1991 as a potential substitute.

[Greenwood, D.D. (1961) Critical bandwidth and the frequency coordinates of the basilar membrane. J. Acoust. Soc. Am. 33, 1344-1356.

Greenwood, D.D. (1990) A cochlear frequency-position function for several species - 29 years later. J. Acoust. Soc. Am. 87, 2592-2605.

Greenwood, D.D. (1991) Critical bandwidth and consonance in relation to cochlear frequency-position coordinates, Hearing Res. 54, 165-208.]

As for abandoning the Mel Scale, I would agree - whatever replaces it. It has seemed to me since 1956 (for reasons appearing below) that there are good reasons not to use the Mel Scale at all any more. In 2006 I replied to that effect privately in a message responding to a post of Jim Beauchamp, at which time he suggested that my reply would make a good post to the list. Perhaps I should have done so then.

In any case, I do so now as an expedient alternative to composing another. Here (between the horizontal dashed lines) is Jim's message and my reply to him as of 2006. His message is indicated by marginal bars. My reply appears between and after them.
Date:         Wed, 17 May 2006 15:07:02 -0500
From: beaucham <beaucham@xxxxxxxxxxxxxxxxxxxxxx>
Subject: critical band vocoder
To: AUDITORY@xxxxxxxxxxxxxxx

For many years vocoders were used for data reduction of speech signals.
A vocoder separates the input signal into consecutive bands, .

Recently, mel-frequency cepstral coefficients have been popular for
speech recognition. Mel frequency spacing is approximately proportional
to critical-band frequency spacing.

Only approximately - and not closely proportional. Ironically, although Stevens stressed the purported proportionality, if equal numbers of Mels had been any more closely proportional to either equal distances on Bekesy's map or to CB (the actual data is meant here - not Zwicker's CB curve, which differs in major respects from the CB data). the less justified would Stevens' other conclusion have been that equal pitch differences did NOT correspond to equal frequency ratios (which much annoyed musicians and John Pierce). An almost logarithmic map, as Bekesy's map obviously was, implies that equal distances correspond closely to equal frequency ratios except where the "almost" becomes relevant, i.e. below 400 to 500 Hz. But perhaps Stevens' "other conclusion" was mainly an unexamined hangover from the discarded and very different 1937 mel scale, obtained by a different method.

Furthermore, the mel scale's popularity has had little justification. There have been good reasons for not using the mel scale for many years. A major one was that it (the 1940 mel scale) was not replicated by Lewis (1942) nor checked by anyone else (so far as I know) until 1956 (at Steven's behest). The full results of that 1956 check (not actually intended to be a check of the mel scale itself - though the results turned out that way) were published in Hearing Research in 97:

Greenwood, D.D. (1997) The Mel Scale's disqualifying bias and a consistency of pitch-difference equisections in 1956 with equal cochlear distances and equal frequency ratios, Hearing Res. 103, 199-2248.

A shorter paper of similar content was presented (and 'published') at the 97 Fechner Society meeting in Poznan, Poland under the title: "THE MEL SCALE'S BIAS AND EQUAL PITCH-DIFFERENCES: IMPLICATIONS OF AN ALMOST LOGARITHMIC COCHLEA AND POSSIBLY SUBJECT-DEPENDENT CRITERIA".

[Stevens' reference to the 1956 methodological check (in a 1957 paper of his [Stevens, S.S. (1957) On the psychophysical law, Psychol. Rev., 64, 153-181]) was in relation to the distinction he conceived between different types of scales (metathetic vs prothetic) rather than to the further implication of the results in respect to the mel scale itself - which he may never have considered.]

My question is: Has anyone designed
and tested a vocoder using critical-band spacing of the filters?

Mine also. I suggested this for the sound spectrogram, and later vocoder, numerous times (starting in the 60s) to colleagues in speech, and to miscellaneous others, to no observed effect.

Cheers (and greetings to Jont if you see him),


[This greeting to Jont still applies.]

A Part of the Abstract of my 1997 Mel paper should provide the reasons Stevens wanted the experiment done and a brief statement of the outcomes.

"Abstract of first 1997 paper cited above
In 1956, Stevens "commissioned" an experiment to equisect a pitch difference between two tones. Results appear to reveal a methodological flaw that would invalidate the Mel Scale (Stevens and Volkmann, 1940). Stevens sought to distinguish sensory continua, e.g. loudness and pitch, on various criteria. He expected that the pitch continuum would not exhibit "hysteresis"; i.e., that subjects dividing a pitch difference (Df) into equal-appearing parts would not set dividing frequencies higher when listening to notes in ascending order than in descending order. Seven subjects equisected a pitch difference, between tones of 400 and 7000 Hz, into equal-seeming parts by adjusting the frequencies of three intermediate tones. All seven exhibited hysteresis, contrary to expectation. This outcome bears on other issues: Years prior, Stevens suggested that equal pitch differences might correspond to equal cochlear distances, but not to equal frequency ratios nor to equal musical intervals (Stevens and Davis, 1938; Stevens and Volkmann, 1940). In 1960 (reported now), both the 1940 Mel scale and the equal-pitch differences of 1956 were compared to equal cochlear distances, using a frequency-position function that fitted Békésy's cochlear map (Greenwood, 1961; 1990). When ascending and descending settings were combined to contra-pose biases, equal pitch differences did coincide with equal distances - which the Mel Scale did not. Further, the biased ascending-order data coincided with the Mel scale, suggesting the Mel scale was similarly biased. Thus, the combined-order equal-pitch differences of 1956 - but not the Mel scale - are consistent with equal cochlear distances. But, since the map between 400 and 7000 Hz is nearly logarithmic, equal frequency ratios also approximate equal distances. Ironically, above 400 Hz, Békésy's map and Stevens' equal-distance hypothesis jointly imply that musical intervals will nearly agree with equal pitch differences, which Stevens thought he had disconfirmed. But, given Békésy's map, only near the cochlear apex will equal distances not approximate equal frequency ratios; . . . "

I hope this belated 2006 "post" - and the 1997 paper - may be of interest.

- Donald

On 8 Jan, 2009, at 11:28 PM, Arturo Camacho wrote:

Dear Dick,

The Wikipedia page that you mention says that the Mel scale
"approximates the human auditory system's response more closely than
the linearly-spaced frequency bands used in the normal cepstrum." If
that means that the Mel scale approximates better the tonotopic
response of the cochlea than the linear scale, I wonder if it would
not be an even better idea to use the Greenwood function (see entry in
Wikipedia), which was explicitly created with that purpose. (Recall
that the Mel scale was designed to represent equidistant steps in
pitch, but that does not necessarily corresponds with equidistant
tonotopic steps.)



On Thu, Jan 8, 2009 at 8:46 PM, Richard F. Lyon <DickLyon@xxxxxxx> wrote:
Thanks Malcolm; now that you've told us, it's in wikipedia:
Including the connection to earlier work by Pols; I can share
a copy of Plomp, Pols, and van de Geer (1967) on request.


At 2:07 PM -0800 1/7/09, Malcolm Slaney wrote:

On Jan 7, 2009, at 12:40 PM, James W. Beauchamp wrote:

I'm looking for a (the?) seminal article on the MFCC method of
coding spectral envelopes. It could be a journal paper or a chapter
in a book. Also, who was the first to publish on this idea?

These are the usual references, especially the 1980 paper.

P. Mermelstein, Distance measures for speech recognition, psychological and instrumental, in Pattern Recognition and Artificial Intelligence, C. H.
Chen, Ed., pp. 374 388. Academic, New York, 1976.

S.B. Davis, and P. Mermelstein, Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences, in IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28(4), 1980,
pp. 357 366.

But Mermelstein usually credits John Bridle's work for the idea
      JSRU Report No. 1003
      J . S. Bridle and M. D. Brown

I have copies of the early two if you need them.

- Malcolm


Arturo Camacho, PhD
Computer and Information Science and Engineering
University of Florida

E-mail: acamacho@xxxxxxxxxxxx
Web page: www.cise.ufl.edu/~acamacho