Re: frequency to mel formula (Arturo Camacho )

Subject: Re: frequency to mel formula From: Arturo Camacho <acamacho@xxxxxxxx> Date: Thu, 23 Jul 2009 09:47:31 -0600 List-Archive:<http://lists.mcgill.ca/scripts/wa.exe?LIST=AUDITORY> --0016364c7793d74107046f61672c Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Adding one more item to Richard's list, in a paper John Harris and I published last year (see reference below) we proposed a new pitch estimator named SWIPE, which was shown to outperform 12 other estimators. SWIPE takes samples of the spectrum using steps uniformly distributed in a scale of the form log(1+f/s), where f is frequency in Hertz. The values of s explored to create SWIPE were 0, 229, 700, and infinity. They produce the logarithmic (semitone), ERB, mel, and Hertz scales, respectively. We also explored the Bark scale z =3D C/(1+s/f), with s =3D 1960. (This scale is similar to the previous ones in what f is scaled, in this case by 1960.) The paper reports results using all these scales and from those results it can be observed that (1) the ERB scale was the one the produced the best results, and (2) the futher s gets from 229 (in both directions), the worst the results. Mor= e recently I explored the Greenwood scale (s =3D 165) and found that, consistently with our previous results, this scale performed better than th= e logarithmic scale (s =3D 0) but not as good as the ERB scale (s =3D 229) (results not published yet). The previous results show that for SWIPE the best sampling of the spectrum is produced with a value of s close to 229 (at least within the range [165,700]). However, this is very different from saying that the scale that produce equidistant steps in pitch is this scale. As a musician, my persona= l experience is that equidistant steps in pitch are produced by a logarithmic scale, at least within the range I can perceive pitch (aprox. 30 to 6000 Hz). References: Camacho, A., Harris, J. G., =93A sawtooth waveform inspired pitch estimator for speech and music=94, Journal of the Acoustical Society of America, vol. 124, pp. 1638-1652, September 2008 Arturo On Thu, Jul 23, 2009 at 12:43 AM, Richard F. Lyon <DickLyon@xxxxxxxx> wrote: > I'd still like to understand more of the history of the Mel scale, formul= as > for it, and its relationship to other scales; did O'Shaughnessy come up w= ith > the 700? Or did he get it from somewhere else? Someone figured the 1000 > was just too high to be realistic? > > I've been reviewing some of Don Greenwood's papers, and the wikipedia > article on his "Greenwood function" at > http://en.wikipedia.org/wiki/Greenwood_function . And Don's comments fro= m > last Jan: http://www.auditory.org/postings/2009/53.html > > Don says a good map of cochlear position x (from 0 at apex to 1 at base) = to > frequency f in hertz is f =3D 165.4*(10^(2.1x) - 1). Solving for x and > scaling to get 1000 at f =3D 1000, we get a formula in the form of the > mel-scale formula: > > m =3D 512.18 * ln(f/165.4 + 1). > > The key here is not the scale factor, but the "break frequency", 165.4 Hz= , > that separates the log-like high-frequency region from the linear-like > low-frequency region. Don finds that the data imply a much lower break > frequency than has traditionally been used; his papers show that the high= er > values (700 or 1000) are too high to fit the published data that they're > supposed to be based on. That means the map is logarithmic over a wider > range than usually recognized, and that the critical bands at the low end > are much narrower than some scales would imply. > > The ERB-rate scale based on Glasberg and Moore 1990 has a corresponding > break point at 228.8 Hz, much closer to Greenberg's interpretation than t= o > the mel-scale interpretations (this is from ERB =3D 24.7 (4.37F/1000 + 1)= , > where 228.8 is 1000/4.37). In terms of mel-like formula: > > m =3D 594.9 * ln(f/228.8 + 1) > > This is also very close to what I've been using in recent cochlear models > for machine hearing (used by Malcolm Slaney in the 1993 auditory toolbox; > actually I'm using 245 Hz now for some reason I don't recall). So I gues= s > it's time to take Don seriously at his suggestion to see if such a change > away from mel scale and closer to reality would improve a speech system > (vocoder or recognizer). But I'm not in that business, so I'll have to b= end > some ears...toward a more logarithmic scale. > > Of course, with this relatively small deviation from logarithmic, there's > also not a lot of deviation from bandwidth being a "constant Q" function = of > center frequency, so other simple parameterizations are likely to fit as > well. The Bark scale is an example of such a thing, and there are others= ; > the Bark scale is closer to mel than to the Greenwood or ERB-rate scales. > > If you want to look at the mappings, they are compared at > http://www.speech.kth.se/~giampi/auditoryscales/<http://www.speech.kth.se= /%7Egiampi/auditoryscales/>; but the normalization isn't at 1000 Hz, so it'= s hard to compare shapes, > and they're not on a log frequency scale, so it's hard to see the > predominantly log-like nature of the mappings. So I took and modified th= e > code from there, added Greenwood, and you can run it if you have matlab o= r > octave handy. It's clear that the Greenwood and ERB-rate scales have a l= ong > "straight" log segment, and that the mel and Bark scales break at too hig= h a > frequency. > > f =3D 1000; > erb_1k =3D 214 * log10(1 + f/228.8); > bark_1k =3D 13*atan(0.00076*f)+3.5*atan((f/7000).^2); > > f =3D (10:10:20000)'; > erb =3D 214 * log10(1 + f/228.8); % very close to lyon w 245 Hz break > mel =3D 1127 * log(1 + f/700); > bark =3D 13*atan(0.00076*f) + 3.5*atan((f/7000).^2); > greenwood =3D 512.18 * log(1 + f/165.4); > > semilogx(f, [1000*erb/erb_1k, mel, 1000*bark/bark_1k, greenwood]) > legend('ERB', 'Mel', 'Bark', 'Greenwood', 'Location', 'SouthEast') > xlabel('frequency (Hz)') > ylabel('normalized scales') > > Other things I found online include a study that evaluated different pitc= h > scales on a speech intonation application: > http://www.ling.cam.ac.uk/francis/Nolan%20Semitones.pdf Here the log > mapping (semitone scale) came out best, with ERB-rate not far behind (and > presumably Greenwood's would have been better than ERB-rate, being a litt= le > closer to log). Mel and Bark were not much better than linear; on this > task, the frequency range of interest included just voice pitch range, up= to > 500 Hz, where these latter scales are essentially just linear. It's not > clear if this "repetition pitch" task is very closely related to the > "frequency" scaling that the scales are designed to cover, but it's a ste= p. > > Here's one: > http://recherche.ircam.fr/equipes/analyse-synthese/burred/pdf/burred_AES1= 21.pdfthat concludes that Mel, ERB, and Bark are all significantly better t= han > either constant-Q (log) or linear scales, for source separation of stereo > mixtures. But the results are about the same for the three "auditory" > scales. > > Here's an ASR study that found no consistent best among ERB, Mel, and Bar= k: > ftp://cs.joensuu.fi/pub/PhLic/2004_PhLic_Kinnunen_Tomi.pdf > > Any other good comparisons? > > Dick > --0016364c7793d74107046f61672c Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Adding one more item to Richard's list, in a paper John Harris and I pu= blished last year (see reference below) we proposed a ne= w pitch estimator named SWIPE, which was shown to outperform 12 other estim= ators. SWIPE takes samples of the spectrum using steps uniformly distribute= d in a scale of the form log(1+f/s), where f is frequency in Hertz. The val= ues of s explored to create SWIPE were 0, 229, 700, and infinity. They prod= uce the logarithmic (semitone), ERB, mel, and Hertz scales, respectively. W= e also explored the Bark scale z =3D C/(1+s/f), with s =3D 1960. (This scal= e is similar to the previous ones in what f is scaled, in this case by 1960= .) The paper reports results using all these scales and from those results = it can be observed that (1) the ERB scale was the one the produced the best= results, and (2) the futher s gets from 229 (in both directions), the wors= t the results. More recently I explored the Greenwood scale (s =3D 165) and= found that, consistently with our previous results, this scale performed b= etter than the logarithmic scale (s =3D 0) but not as good as the ERB scale= (s =3D 229) (results not published yet). The previous results show that for SWIPE the best sampling of the spect= rum is produced with a value of s close to 229 (at least within the range [= 165,700]). However, this is very different from saying that the scale that = produce equidistant steps in pitch is this scale. As a musician, my persona= l experience is that equidistant steps in pitch are produced by a logarithm= ic scale, at least within the range I can perceive pitch (aprox. 30 to 6000= Hz). References: Camacho, A., Harris, J. G., = =93A sawtooth waveform inspired pitch estimator for speech and music=94, Jo= urnal of the Acoustical Society of America, vol. 124, pp. 1638-1652, Septem= ber 2008 Arturo <= br> <div class=3D"gmail_quote">On Thu, = Jul 23, 2009 at 12:43 AM, Richard F. Lyon <<a href=3D"= mailto:DickLyon@xxxxxxxx">DickLyon@xxxxxxxx</a>> wrote: <blockquote class=3D"gmail_quote" style=3D"border-left: 1px solid rgb(204, = 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">I'd still lik= e to understand more of the history of the Mel scale, formulas for it, and = its relationship to other scales; did O'Shaughnessy come up with the 70= 0? =A0Or did he get it from somewhere else? =A0Someone figured the 1000 was= just too high to be realistic? I've been reviewing some of Don Greenwood's papers, and the wikiped= ia article on his "Greenwood function" at <a href=3D"http://en.wi= kipedia.org/wiki/Greenwood_function" target=3D"_blank">http://en.wikipedia.= org/wiki/Greenwood_function</a> . =A0And Don's comments from last Jan: = <a href=3D"http://www.auditory.org/postings/2009/53.html" target=3D"_blank"= >http://www.auditory.org/postings/2009/53.html</a> Don says a good map of cochlear position x (from 0 at apex to 1 at base) to= frequency f in hertz is f =3D 165.4*(10^(2.1x) - 1). =A0Solving for x and = scaling to get 1000 at f =3D 1000, we get a formula in the form of the mel-= scale formula: =A0m =3D 512.18 * ln(f/165.4 + 1). The key here is not the scale factor, but the "break frequency", = 165.4 Hz, that separates the log-like high-frequency region from the linear= -like low-frequency region. =A0Don finds that the data imply a much lower b= reak frequency than has traditionally been used; his papers show that the h= igher values (700 or 1000) are too high to fit the published data that they= 're supposed to be based on. =A0That means the map is logarithmic over = a wider range than usually recognized, and that the critical bands at the l= ow end are much narrower than some scales would imply. The ERB-rate scale based on Glasberg and Moore 1990 has a corresponding bre= ak point at 228.8 Hz, much closer to Greenberg's interpretation than to= the mel-scale interpretations (this is from ERB =3D 24.7 (4.37F/1000 + 1),= where 228.8 is 1000/4.37). =A0In terms of mel-like formula: =A0m =3D 594.9 * ln(f/228.8 + 1) This is also very close to what I've been using in recent cochlear mode= ls for machine hearing (used by Malcolm Slaney in the 1993 auditory toolbox= ; actually I'm using 245 Hz now for some reason I don't recall). = =A0So I guess it's time to take Don seriously at his suggestion to see = if such a change away from mel scale and closer to reality would improve a = speech system (vocoder or recognizer). =A0But I'm not in that business,= so I'll have to bend some ears...toward a more logarithmic scale. Of course, with this relatively small deviation from logarithmic, there&#39= ;s also not a lot of deviation from bandwidth being a "constant Q&quot= ; function of center frequency, so other simple parameterizations are likel= y to fit as well. =A0The Bark scale is an example of such a thing, and ther= e are others; the Bark scale is closer to mel than to the Greenwood or ERB-= rate scales. If you want to look at the mappings, they are compared at <a href=3D"http:/= /www.speech.kth.se/%7Egiampi/auditoryscales/" target=3D"_blank">http://www.= speech.kth.se/~giampi/auditoryscales/</a> ; but the normalization isn't= at 1000 Hz, so it's hard to compare shapes, and they're not on a l= og frequency scale, so it's hard to see the predominantly log-like natu= re of the mappings. =A0So I took and modified the code from there, added Gr= eenwood, and you can run it if you have matlab or octave handy. =A0It's= clear that the Greenwood and ERB-rate scales have a long "straight&qu= ot; log segment, and that the mel and Bark scales break at too high a frequ= ency. f =3D 1000; erb_1k =3D 214 * log10(1 + f/228.8); bark_1k =3D 13*atan(0.00076*f)+3.5*atan((f/7000).^2); f =3D (10:10:20000)'; erb =3D 214 * log10(1 + f/228.8); =A0% very close to lyon w 245 Hz break<br= > mel =3D 1127 * log(1 + f/700); bark =3D 13*atan(0.00076*f) + 3.5*atan((f/7000).^2); greenwood =3D 512.18 * log(1 + f/165.4); semilogx(f, [1000*erb/erb_1k, mel, 1000*bark/bark_1k, greenwood]) legend('ERB', 'Mel', 'Bark', 'Greenwood', &= #39;Location', 'SouthEast') xlabel('frequency (Hz)') ylabel('normalized scales') Other things I found online include a study that evaluated different pitch = scales on a speech intonation application: <a href=3D"http://www.ling.cam.a= c.uk/francis/Nolan%20Semitones.pdf" target=3D"_blank">http://www.ling.cam.a= c.uk/francis/Nolan%20Semitones.pdf</a> =A0Here the log mapping (semitone sc= ale) came out best, with ERB-rate not far behind (and presumably Greenwood&= #39;s would have been better than ERB-rate, being a little closer to log). = =A0Mel and Bark were not much better than linear; on this task, the frequen= cy range of interest included just voice pitch range, up to 500 Hz, where t= hese latter scales are essentially just linear. =A0It's not clear if th= is "repetition pitch" task is very closely related to the "f= requency" scaling that the scales are designed to cover, but it's = a step. Here's one: <a href=3D"http://recherche.ircam.fr/equipes/analyse-synthe= se/burred/pdf/burred_AES121.pdf" target=3D"_blank">http://recherche.ircam.f= r/equipes/analyse-synthese/burred/pdf/burred_AES121.pdf</a> that concludes = that Mel, ERB, and Bark are all significantly better than either constant-Q= (log) or linear scales, for source separation of stereo mixtures. =A0But t= he results are about the same for the three "auditory" scales.<br= > Here's an ASR study that found no consistent best among ERB, Mel, and B= ark: <a href=3D"ftp://cs.joensuu.fi/pub/PhLic/2004_PhLic_Kinnunen_Tomi.pdf" targ= et=3D"_blank">ftp://cs.joensuu.fi/pub/PhLic/2004_PhLic_Kinnunen_Tomi.pdf</a= > Any other good comparisons? Dick </blockquote></div> --0016364c7793d74107046f61672c--

This message came from the mail archive
http://www.auditory.org/postings/2009/
maintained by:

DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University