[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: frequency to mel formula
Adding one more item to Richard's list, in a paper John Harris and I published last year (see reference below) we proposed a new pitch estimator named SWIPE, which was shown to outperform 12 other estimators. SWIPE takes samples of the spectrum using steps uniformly distributed in a scale of the form log(1+f/s), where f is frequency in Hertz. The values of s explored to create SWIPE were 0, 229, 700, and infinity. They produce the logarithmic (semitone), ERB, mel, and Hertz scales, respectively. We also explored the Bark scale z = C/(1+s/f), with s = 1960. (This scale is similar to the previous ones in what f is scaled, in this case by 1960.) The paper reports results using all these scales and from those results it can be observed that (1) the ERB scale was the one the produced the best results, and (2) the futher s gets from 229 (in both directions), the worst the results. More recently I explored the Greenwood scale (s = 165) and found that, consistently with our previous results, this scale performed better than the logarithmic scale (s = 0) but not as good as the ERB scale (s = 229) (results not published yet).
The previous results show that for SWIPE the best sampling of the spectrum is produced with a value of s close to 229 (at least within the range [165,700]). However, this is very different from saying that the scale that produce equidistant steps in pitch is this scale. As a musician, my personal experience is that equidistant steps in pitch are produced by a logarithmic scale, at least within the range I can perceive pitch (aprox. 30 to 6000 Hz).
Camacho, A., Harris, J. G., “A sawtooth waveform inspired pitch estimator for speech and music”, Journal of the Acoustical Society of America, vol. 124, pp. 1638-1652, September 2008
On Thu, Jul 23, 2009 at 12:43 AM, Richard F. Lyon <DickLyon@xxxxxxx>
I'd still like to understand more of the history of the Mel scale, formulas for it, and its relationship to other scales; did O'Shaughnessy come up with the 700? Or did he get it from somewhere else? Someone figured the 1000 was just too high to be realistic?
I've been reviewing some of Don Greenwood's papers, and the wikipedia article on his "Greenwood function" at http://en.wikipedia.org/wiki/Greenwood_function . And Don's comments from last Jan: http://www.auditory.org/postings/2009/53.html
Don says a good map of cochlear position x (from 0 at apex to 1 at base) to frequency f in hertz is f = 165.4*(10^(2.1x) - 1). Solving for x and scaling to get 1000 at f = 1000, we get a formula in the form of the mel-scale formula:
m = 512.18 * ln(f/165.4 + 1).
The key here is not the scale factor, but the "break frequency", 165.4 Hz, that separates the log-like high-frequency region from the linear-like low-frequency region. Don finds that the data imply a much lower break frequency than has traditionally been used; his papers show that the higher values (700 or 1000) are too high to fit the published data that they're supposed to be based on. That means the map is logarithmic over a wider range than usually recognized, and that the critical bands at the low end are much narrower than some scales would imply.
The ERB-rate scale based on Glasberg and Moore 1990 has a corresponding break point at 228.8 Hz, much closer to Greenberg's interpretation than to the mel-scale interpretations (this is from ERB = 24.7 (4.37F/1000 + 1), where 228.8 is 1000/4.37). In terms of mel-like formula:
m = 594.9 * ln(f/228.8 + 1)
This is also very close to what I've been using in recent cochlear models for machine hearing (used by Malcolm Slaney in the 1993 auditory toolbox; actually I'm using 245 Hz now for some reason I don't recall). So I guess it's time to take Don seriously at his suggestion to see if such a change away from mel scale and closer to reality would improve a speech system (vocoder or recognizer). But I'm not in that business, so I'll have to bend some ears...toward a more logarithmic scale.
Of course, with this relatively small deviation from logarithmic, there's also not a lot of deviation from bandwidth being a "constant Q" function of center frequency, so other simple parameterizations are likely to fit as well. The Bark scale is an example of such a thing, and there are others; the Bark scale is closer to mel than to the Greenwood or ERB-rate scales.
If you want to look at the mappings, they are compared at http://www.speech.kth.se/~giampi/auditoryscales/ ; but the normalization isn't at 1000 Hz, so it's hard to compare shapes, and they're not on a log frequency scale, so it's hard to see the predominantly log-like nature of the mappings. So I took and modified the code from there, added Greenwood, and you can run it if you have matlab or octave handy. It's clear that the Greenwood and ERB-rate scales have a long "straight" log segment, and that the mel and Bark scales break at too high a frequency.
f = 1000;
erb_1k = 214 * log10(1 + f/228.8);
bark_1k = 13*atan(0.00076*f)+3.5*atan((f/7000).^2);
f = (10:10:20000)';
erb = 214 * log10(1 + f/228.8); % very close to lyon w 245 Hz break
mel = 1127 * log(1 + f/700);
bark = 13*atan(0.00076*f) + 3.5*atan((f/7000).^2);
greenwood = 512.18 * log(1 + f/165.4);
semilogx(f, [1000*erb/erb_1k, mel, 1000*bark/bark_1k, greenwood])
legend('ERB', 'Mel', 'Bark', 'Greenwood', 'Location', 'SouthEast')
Other things I found online include a study that evaluated different pitch scales on a speech intonation application: http://www.ling.cam.ac.uk/francis/Nolan%20Semitones.pdf Here the log mapping (semitone scale) came out best, with ERB-rate not far behind (and presumably Greenwood's would have been better than ERB-rate, being a little closer to log). Mel and Bark were not much better than linear; on this task, the frequency range of interest included just voice pitch range, up to 500 Hz, where these latter scales are essentially just linear. It's not clear if this "repetition pitch" task is very closely related to the "frequency" scaling that the scales are designed to cover, but it's a step.
Here's one: http://recherche.ircam.fr/equipes/analyse-synthese/burred/pdf/burred_AES121.pdf that concludes that Mel, ERB, and Bark are all significantly better than either constant-Q (log) or linear scales, for source separation of stereo mixtures. But the results are about the same for the three "auditory" scales.
Here's an ASR study that found no consistent best among ERB, Mel, and Bark:
Any other good comparisons?