[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: frequency to mel formula

Dear members of the list:

Following the interesting observations made by Dick, I want to say
that some years ago I proposed a pitch estimator named SWIPE [A.
Camacho and J. Harris, J. Acoust. Soc. Am. 124, 1638-1652 (2008)] that
computes the spectrum using scales of the form x(f) = C (1+log(f/s)).
I tested SWIPE with values of s equal to 0, 229, 700, and infinity,
corresponding to the log, ERB, mel, and Hertz scales, respectively.
The one that produced the best results overall was s=229 (ERB scale).
Interestingly, the results were very monotonic: s=229 produced better
results than s=700, and s=700 produced better results than s=infinity.
On the other hand, s=0 was even worst than s=infinity. Unfortunately,
I did not know about Greenwood's scale at that time.

I just finished another study in which I added to SWIPE a
preprocessing stage based on an auditory model. This time I
incorporated s=150, which corresponds to Greenwood's scale.  (in his
1990 paper, Greenwood does not precise an exact value for k in eq.
[1], but suggests values between 0.8 and 1, which produces values of s
= 165.4 k between 132.3 and 165.4 Hz). The results obtained were not
as good as with s=229 and not as bad as with s=0, which suggest that
for SWIPE the optimal value of s must be between 150 and 700, and
maybe very close to 230.


On Thu, Jul 23, 2009 at 12:43 AM, Richard F. Lyon <DickLyon@xxxxxxx> wrote:
> I'd still like to understand more of the history of the Mel scale, formulas
> for it, and its relationship to other scales; did O'Shaughnessy come up with
> the 700?  Or did he get it from somewhere else?  Someone figured the 1000
> was just too high to be realistic?
> I've been reviewing some of Don Greenwood's papers, and the wikipedia
> article on his "Greenwood function" at
> http://en.wikipedia.org/wiki/Greenwood_function .  And Don's comments from
> last Jan: http://www.auditory.org/postings/2009/53.html
> Don says a good map of cochlear position x (from 0 at apex to 1 at base) to
> frequency f in hertz is f = 165.4*(10^(2.1x) - 1).  Solving for x and
> scaling to get 1000 at f = 1000, we get a formula in the form of the
> mel-scale formula:
>  m = 512.18 * ln(f/165.4 + 1).
> The key here is not the scale factor, but the "break frequency", 165.4 Hz,
> that separates the log-like high-frequency region from the linear-like
> low-frequency region.  Don finds that the data imply a much lower break
> frequency than has traditionally been used; his papers show that the higher
> values (700 or 1000) are too high to fit the published data that they're
> supposed to be based on.  That means the map is logarithmic over a wider
> range than usually recognized, and that the critical bands at the low end
> are much narrower than some scales would imply.
> The ERB-rate scale based on Glasberg and Moore 1990 has a corresponding
> break point at 228.8 Hz, much closer to Greenberg's interpretation than to
> the mel-scale interpretations (this is from ERB = 24.7 (4.37F/1000 + 1),
> where 228.8 is 1000/4.37).  In terms of mel-like formula:
>  m = 594.9 * ln(f/228.8 + 1)
> This is also very close to what I've been using in recent cochlear models
> for machine hearing (used by Malcolm Slaney in the 1993 auditory toolbox;
> actually I'm using 245 Hz now for some reason I don't recall).  So I guess
> it's time to take Don seriously at his suggestion to see if such a change
> away from mel scale and closer to reality would improve a speech system
> (vocoder or recognizer).  But I'm not in that business, so I'll have to bend
> some ears...toward a more logarithmic scale.
> Of course, with this relatively small deviation from logarithmic, there's
> also not a lot of deviation from bandwidth being a "constant Q" function of
> center frequency, so other simple parameterizations are likely to fit as
> well.  The Bark scale is an example of such a thing, and there are others;
> the Bark scale is closer to mel than to the Greenwood or ERB-rate scales.
> If you want to look at the mappings, they are compared at
> http://www.speech.kth.se/~giampi/auditoryscales/ ; but the normalization
> isn't at 1000 Hz, so it's hard to compare shapes, and they're not on a log
> frequency scale, so it's hard to see the predominantly log-like nature of
> the mappings.  So I took and modified the code from there, added Greenwood,
> and you can run it if you have matlab or octave handy.  It's clear that the
> Greenwood and ERB-rate scales have a long "straight" log segment, and that
> the mel and Bark scales break at too high a frequency.
> f = 1000;
> erb_1k = 214 * log10(1 + f/228.8);
> bark_1k = 13*atan(0.00076*f)+3.5*atan((f/7000).^2);
> f = (10:10:20000)';
> erb = 214 * log10(1 + f/228.8);  % very close to lyon w 245 Hz break
> mel = 1127 * log(1 + f/700);
> bark = 13*atan(0.00076*f) + 3.5*atan((f/7000).^2);
> greenwood = 512.18 * log(1 + f/165.4);
> semilogx(f, [1000*erb/erb_1k, mel, 1000*bark/bark_1k, greenwood])
> legend('ERB', 'Mel', 'Bark', 'Greenwood', 'Location', 'SouthEast')
> xlabel('frequency (Hz)')
> ylabel('normalized scales')
> Other things I found online include a study that evaluated different pitch
> scales on a speech intonation application:
> http://www.ling.cam.ac.uk/francis/Nolan%20Semitones.pdf  Here the log
> mapping (semitone scale) came out best, with ERB-rate not far behind (and
> presumably Greenwood's would have been better than ERB-rate, being a little
> closer to log).  Mel and Bark were not much better than linear; on this
> task, the frequency range of interest included just voice pitch range, up to
> 500 Hz, where these latter scales are essentially just linear.  It's not
> clear if this "repetition pitch" task is very closely related to the
> "frequency" scaling that the scales are designed to cover, but it's a step.
> Here's one:
> http://recherche.ircam.fr/equipes/analyse-synthese/burred/pdf/burred_AES121.pdf
> that concludes that Mel, ERB, and Bark are all significantly better than
> either constant-Q (log) or linear scales, for source separation of stereo
> mixtures.  But the results are about the same for the three "auditory"
> scales.
> Here's an ASR study that found no consistent best among ERB, Mel, and Bark:
> ftp://cs.joensuu.fi/pub/PhLic/2004_PhLic_Kinnunen_Tomi.pdf
> Any other good comparisons?
> Dick

Arturo Camacho Lozano
Profesor Invitado
Escuela de Ciencias de la Computación e Informática
Universidad de Costa Rica