Re: frequency to mel formula

On Thu, Jul 23, 2009 at 12:43 AM, Richard F. Lyon <DickLyon@xxxxxxx> wrote:

I'd still like to understand more of the history of the Mel scale, formulas for it, and its relationship to other scales; did O'Shaughnessy come up with the 700? Or did he get it from somewhere else? Someone figured the 1000 was just too high to be realistic?

I've been reviewing some of Don Greenwood's papers, and the wikipedia article on his "Greenwood function" at http://en.wikipedia.org/wiki/Greenwood_function . And Don's comments from last Jan: http://www.auditory.org/postings/2009/53.html

Don says a good map of cochlear position x (from 0 at apex to 1 at base) to frequency f in hertz is f = 165.4*(10^(2.1x) - 1). Solving for x and scaling to get 1000 at f = 1000, we get a formula in the form of the mel-scale formula:

m = 512.18 * ln(f/165.4 + 1).

The key here is not the scale factor, but the "break frequency", 165.4 Hz, that separates the log-like high-frequency region from the linear-like low-frequency region. Don finds that the data imply a much lower break frequency than has traditionally been used; his papers show that the higher values (700 or 1000) are too high to fit the published data that they're supposed to be based on. That means the map is logarithmic over a wider range than usually recognized, and that the critical bands at the low end are much narrower than some scales would imply.

The ERB-rate scale based on Glasberg and Moore 1990 has a corresponding break point at 228.8 Hz, much closer to Greenberg's interpretation than to the mel-scale interpretations (this is from ERB = 24.7 (4.37F/1000 + 1), where 228.8 is 1000/4.37). In terms of mel-like formula:

m = 594.9 * ln(f/228.8 + 1)

This is also very close to what I've been using in recent cochlear models for machine hearing (used by Malcolm Slaney in the 1993 auditory toolbox; actually I'm using 245 Hz now for some reason I don't recall). So I guess it's time to take Don seriously at his suggestion to see if such a change away from mel scale and closer to reality would improve a speech system (vocoder or recognizer). But I'm not in that business, so I'll have to bend some ears...toward a more logarithmic scale.

Of course, with this relatively small deviation from logarithmic, there's also not a lot of deviation from bandwidth being a "constant Q" function of center frequency, so other simple parameterizations are likely to fit as well. The Bark scale is an example of such a thing, and there are others; the Bark scale is closer to mel than to the Greenwood or ERB-rate scales.

If you want to look at the mappings, they are compared at http://www.speech.kth.se/~giampi/auditoryscales/ ; but the normalization isn't at 1000 Hz, so it's hard to compare shapes, and they're not on a log frequency scale, so it's hard to see the predominantly log-like nature of the mappings. So I took and modified the code from there, added Greenwood, and you can run it if you have matlab or octave handy. It's clear that the Greenwood and ERB-rate scales have a long "straight" log segment, and that the mel and Bark scales break at too high a frequency.

f = 1000;
erb_1k = 214 * log10(1 + f/228.8);
bark_1k = 13*atan(0.00076*f)+3.5*atan((f/7000).^2);

f = (10:10:20000)';
erb = 214 * log10(1 + f/228.8); % very close to lyon w 245 Hz break
mel = 1127 * log(1 + f/700);
bark = 13*atan(0.00076*f) + 3.5*atan((f/7000).^2);
greenwood = 512.18 * log(1 + f/165.4);

semilogx(f, [1000*erb/erb_1k, mel, 1000*bark/bark_1k, greenwood])
legend('ERB', 'Mel', 'Bark', 'Greenwood', 'Location', 'SouthEast')
xlabel('frequency (Hz)')
ylabel('normalized scales')

Other things I found online include a study that evaluated different pitch scales on a speech intonation application: http://www.ling.cam.ac.uk/francis/Nolan%20Semitones.pdf Here the log mapping (semitone scale) came out best, with ERB-rate not far behind (and presumably Greenwood's would have been better than ERB-rate, being a little closer to log). Mel and Bark were not much better than linear; on this task, the frequency range of interest included just voice pitch range, up to 500 Hz, where these latter scales are essentially just linear. It's not clear if this "repetition pitch" task is very closely related to the "frequency" scaling that the scales are designed to cover, but it's a step.

Here's one: http://recherche.ircam.fr/equipes/analyse-synthese/burred/pdf/burred_AES121.pdf that concludes that Mel, ERB, and Bark are all significantly better than either constant-Q (log) or linear scales, for source separation of stereo mixtures. But the results are about the same for the three "auditory" scales.

Here's an ASR study that found no consistent best among ERB, Mel, and Bark:
ftp://cs.joensuu.fi/pub/PhLic/2004_PhLic_Kinnunen_Tomi.pdf

Any other good comparisons?

Dick