[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: MFCC method
In your response to Dick Lyon you refer to the observation that the
Mel Scale "approximates the human auditory system's response more
closely than the linearly-spaced frequency bands used in the normal
cepstrum" and you make a reference to my frequency-position function
of 1961, 1990, and 1991 as a potential substitute.
[Greenwood, D.D. (1961) Critical bandwidth and the frequency
coordinates of the basilar membrane. J. Acoust. Soc. Am. 33,
Greenwood, D.D. (1990) A cochlear frequency-position function for
several species - 29 years later. J. Acoust. Soc. Am. 87, 2592-2605.
Greenwood, D.D. (1991) Critical bandwidth and consonance in relation
to cochlear frequency-position coordinates, Hearing Res. 54, 165-208.]
As for abandoning the Mel Scale, I would agree - whatever replaces
it. It has seemed to me since 1956 (for reasons appearing below) that
there are good reasons not to use the Mel Scale at all any more. In
2006 I replied to that effect privately in a message responding to a
post of Jim Beauchamp, at which time he suggested that my reply would
make a good post to the list. Perhaps I should have done so then.
In any case, I do so now as an expedient alternative to composing
another. Here (between the horizontal dashed lines) is Jim's message
and my reply to him as of 2006. His message is indicated by marginal
bars. My reply appears between and after them.
Date: Wed, 17 May 2006 15:07:02 -0500
From: beaucham <beaucham@xxxxxxxxxxxxxxxxxxxxxx>
Subject: critical band vocoder
For many years vocoders were used for data reduction of speech
A vocoder separates the input signal into consecutive bands, .
Recently, mel-frequency cepstral coefficients have been popular for
speech recognition. Mel frequency spacing is approximately
to critical-band frequency spacing.
Only approximately - and not closely proportional. Ironically,
although Stevens stressed the purported proportionality, if equal
numbers of Mels had been any more closely proportional to either equal
distances on Bekesy's map or to CB (the actual data is meant here -
not Zwicker's CB curve, which differs in major respects from the CB
data). the less justified would Stevens' other conclusion have been
that equal pitch differences did NOT correspond to equal frequency
ratios (which much annoyed musicians and John Pierce). An almost
logarithmic map, as Bekesy's map obviously was, implies that equal
distances correspond closely to equal frequency ratios except where
the "almost" becomes relevant, i.e. below 400 to 500 Hz. But perhaps
Stevens' "other conclusion" was mainly an unexamined hangover from the
discarded and very different 1937 mel scale, obtained by a different
Furthermore, the mel scale's popularity has had little justification.
There have been good reasons for not using the mel scale for many
years. A major one was that it (the 1940 mel scale) was not
replicated by Lewis (1942) nor checked by anyone else (so far as I
know) until 1956 (at Steven's behest). The full results of that 1956
check (not actually intended to be a check of the mel scale itself -
though the results turned out that way) were published in Hearing
Research in 97:
Greenwood, D.D. (1997) The Mel Scale's disqualifying bias and a
consistency of pitch-difference equisections in 1956 with equal
cochlear distances and equal frequency ratios, Hearing Res. 103,
A shorter paper of similar content was presented (and 'published') at
the 97 Fechner Society meeting in Poznan, Poland under the title: "THE
MEL SCALE'S BIAS AND EQUAL PITCH-DIFFERENCES: IMPLICATIONS OF AN
ALMOST LOGARITHMIC COCHLEA AND POSSIBLY SUBJECT-DEPENDENT CRITERIA".
[Stevens' reference to the 1956 methodological check (in a 1957 paper
of his [Stevens, S.S. (1957) On the psychophysical law, Psychol. Rev.,
64, 153-181]) was in relation to the distinction he conceived between
different types of scales (metathetic vs prothetic) rather than to the
further implication of the results in respect to the mel scale itself
- which he may never have considered.]
My question is: Has anyone designed
and tested a vocoder using critical-band spacing of the filters?
Mine also. I suggested this for the sound spectrogram, and later
vocoder, numerous times (starting in the 60s) to colleagues in speech,
and to miscellaneous others, to no observed effect.
Cheers (and greetings to Jont if you see him),
[This greeting to Jont still applies.]
A Part of the Abstract of my 1997 Mel paper should provide the reasons
Stevens wanted the experiment done and a brief statement of the
"Abstract of first 1997 paper cited above
In 1956, Stevens "commissioned" an experiment to equisect a pitch
difference between two tones. Results appear to reveal a
methodological flaw that would invalidate the Mel Scale (Stevens and
Volkmann, 1940). Stevens sought to distinguish sensory continua, e.g.
loudness and pitch, on various criteria. He expected that the pitch
continuum would not exhibit "hysteresis"; i.e., that subjects dividing
a pitch difference (Df) into equal-appearing parts would not set
dividing frequencies higher when listening to notes in ascending order
than in descending order. Seven subjects equisected a pitch
difference, between tones of 400 and 7000 Hz, into equal-seeming parts
by adjusting the frequencies of three intermediate tones. All seven
exhibited hysteresis, contrary to expectation. This outcome bears on
other issues: Years prior, Stevens suggested that equal pitch
differences might correspond to equal cochlear distances, but not to
equal frequency ratios nor to equal musical intervals (Stevens and
Davis, 1938; Stevens and Volkmann, 1940). In 1960 (reported now), both
the 1940 Mel scale and the equal-pitch differences of 1956 were
compared to equal cochlear distances, using a frequency-position
function that fitted Békésy's cochlear map (Greenwood, 1961; 1990).
When ascending and descending settings were combined to contra-pose
biases, equal pitch differences did coincide with equal distances -
which the Mel Scale did not. Further, the biased ascending-order data
coincided with the Mel scale, suggesting the Mel scale was similarly
biased. Thus, the combined-order equal-pitch differences of 1956 -
but not the Mel scale - are consistent with equal cochlear distances.
But, since the map between 400 and 7000 Hz is nearly logarithmic,
equal frequency ratios also approximate equal distances. Ironically,
above 400 Hz, Békésy's map and Stevens' equal-distance hypothesis
jointly imply that musical intervals will nearly agree with equal
pitch differences, which Stevens thought he had disconfirmed. But,
given Békésy's map, only near the cochlear apex will equal distances
not approximate equal frequency ratios; . . . "
I hope this belated 2006 "post" - and the 1997 paper - may be of
On 8 Jan, 2009, at 11:28 PM, Arturo Camacho wrote:
The Wikipedia page that you mention says that the Mel scale
"approximates the human auditory system's response more closely than
the linearly-spaced frequency bands used in the normal cepstrum." If
that means that the Mel scale approximates better the tonotopic
response of the cochlea than the linear scale, I wonder if it would
not be an even better idea to use the Greenwood function (see entry in
Wikipedia), which was explicitly created with that purpose. (Recall
that the Mel scale was designed to represent equidistant steps in
pitch, but that does not necessarily corresponds with equidistant
On Thu, Jan 8, 2009 at 8:46 PM, Richard F. Lyon <DickLyon@xxxxxxx>
Thanks Malcolm; now that you've told us, it's in wikipedia:
Including the connection to earlier work by Pols; I can share
a copy of Plomp, Pols, and van de Geer (1967) on request.
At 2:07 PM -0800 1/7/09, Malcolm Slaney wrote:
On Jan 7, 2009, at 12:40 PM, James W. Beauchamp wrote:
I'm looking for a (the?) seminal article on the MFCC method of
coding spectral envelopes. It could be a journal paper or a chapter
in a book. Also, who was the first to publish on this idea?
These are the usual references, especially the 1980 paper.
P. Mermelstein, Distance measures for speech recognition,
and instrumental, in Pattern Recognition and Artificial
Intelligence, C. H.
Chen, Ed., pp. 374 388. Academic, New York, 1976.
S.B. Davis, and P. Mermelstein, Comparison of Parametric
for Monosyllabic Word Recognition in Continuously Spoken
Sentences, in IEEE
Transactions on Acoustics, Speech, and Signal Processing, vol.
pp. 357 366.
But Mermelstein usually credits John Bridle's work for the idea
JSRU Report No. 1003
AN EXPERIMENTAL AUTOMATIC WORD·RECOGNITION SYSTEM:
J . S. Bridle and M. D. Brown
I have copies of the early two if you need them.
Arturo Camacho, PhD
Computer and Information Science and Engineering
University of Florida
Web page: www.cise.ufl.edu/~acamacho