[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
# Re: frequency to mel formula

`I'd still like to understand more of the history of the Mel scale,
``formulas for it, and its relationship to other scales; did
``O'Shaughnessy come up with the 700? Or did he get it from somewhere
``else? Someone figured the 1000 was just too high to be realistic?
`

`I've been reviewing some of Don Greenwood's papers, and the wikipedia
``article on his "Greenwood function" at
``http://en.wikipedia.org/wiki/Greenwood_function . And Don's comments
``from last Jan: http://www.auditory.org/postings/2009/53.html
`

`Don says a good map of cochlear position x (from 0 at apex to 1 at
``base) to frequency f in hertz is f = 165.4*(10^(2.1x) - 1). Solving
``for x and scaling to get 1000 at f = 1000, we get a formula in the
``form of the mel-scale formula:
`
m = 512.18 * ln(f/165.4 + 1).

`The key here is not the scale factor, but the "break frequency",
``165.4 Hz, that separates the log-like high-frequency region from the
``linear-like low-frequency region. Don finds that the data imply a
``much lower break frequency than has traditionally been used; his
``papers show that the higher values (700 or 1000) are too high to fit
``the published data that they're supposed to be based on. That means
``the map is logarithmic over a wider range than usually recognized,
``and that the critical bands at the low end are much narrower than
``some scales would imply.
`

`The ERB-rate scale based on Glasberg and Moore 1990 has a
``corresponding break point at 228.8 Hz, much closer to Greenberg's
``interpretation than to the mel-scale interpretations (this is from
``ERB = 24.7 (4.37F/1000 + 1), where 228.8 is 1000/4.37). In terms of
``mel-like formula:
`
m = 594.9 * ln(f/228.8 + 1)

`This is also very close to what I've been using in recent cochlear
``models for machine hearing (used by Malcolm Slaney in the 1993
``auditory toolbox; actually I'm using 245 Hz now for some reason I
``don't recall). So I guess it's time to take Don seriously at his
``suggestion to see if such a change away from mel scale and closer to
``reality would improve a speech system (vocoder or recognizer). But
``I'm not in that business, so I'll have to bend some ears...toward a
``more logarithmic scale.
`

`Of course, with this relatively small deviation from logarithmic,
``there's also not a lot of deviation from bandwidth being a "constant
``Q" function of center frequency, so other simple parameterizations
``are likely to fit as well. The Bark scale is an example of such a
``thing, and there are others; the Bark scale is closer to mel than to
``the Greenwood or ERB-rate scales.
`

`If you want to look at the mappings, they are compared at
``http://www.speech.kth.se/~giampi/auditoryscales/ ; but the
``normalization isn't at 1000 Hz, so it's hard to compare shapes, and
``they're not on a log frequency scale, so it's hard to see the
``predominantly log-like nature of the mappings. So I took and
``modified the code from there, added Greenwood, and you can run it if
``you have matlab or octave handy. It's clear that the Greenwood and
``ERB-rate scales have a long "straight" log segment, and that the mel
``and Bark scales break at too high a frequency.
`
f = 1000;
erb_1k = 214 * log10(1 + f/228.8);
bark_1k = 13*atan(0.00076*f)+3.5*atan((f/7000).^2);
f = (10:10:20000)';
erb = 214 * log10(1 + f/228.8); % very close to lyon w 245 Hz break
mel = 1127 * log(1 + f/700);
bark = 13*atan(0.00076*f) + 3.5*atan((f/7000).^2);
greenwood = 512.18 * log(1 + f/165.4);
semilogx(f, [1000*erb/erb_1k, mel, 1000*bark/bark_1k, greenwood])
legend('ERB', 'Mel', 'Bark', 'Greenwood', 'Location', 'SouthEast')
xlabel('frequency (Hz)')
ylabel('normalized scales')

`Other things I found online include a study that evaluated different
``pitch scales on a speech intonation application:
``http://www.ling.cam.ac.uk/francis/Nolan%20Semitones.pdf Here the log
``mapping (semitone scale) came out best, with ERB-rate not far behind
``(and presumably Greenwood's would have been better than ERB-rate,
``being a little closer to log). Mel and Bark were not much better
``than linear; on this task, the frequency range of interest included
``just voice pitch range, up to 500 Hz, where these latter scales are
``essentially just linear. It's not clear if this "repetition pitch"
``task is very closely related to the "frequency" scaling that the
``scales are designed to cover, but it's a step.
`

`Here's one:
``http://recherche.ircam.fr/equipes/analyse-synthese/burred/pdf/burred_AES121.pdf
``that concludes that Mel, ERB, and Bark are all significantly better
``than either constant-Q (log) or linear scales, for source separation
``of stereo mixtures. But the results are about the same for the three
``"auditory" scales.
`
Here's an ASR study that found no consistent best among ERB, Mel, and Bark:
ftp://cs.joensuu.fi/pub/PhLic/2004_PhLic_Kinnunen_Tomi.pdf
Any other good comparisons?
Dick