Subject:Re: [MUSIC-IR] Re: Re: How to compare classification resultsFrom:Geoffroy Peeters <Geoffroy.Peeters@xxxxxxxx>Date:Fri, 29 Feb 2008 16:01:51 +0100List-Archive:<http://lists.mcgill.ca/scripts/wa.exe?LIST=AUDITORY>This is a multi-part message in MIME format. --------------040507060807020409030602 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: quoted-printable X-MIME-Autoconverted: from 8bit to quoted-printable by torrent.cc.mcgill.ca id m1TF1pAk026427 Thanks all for your replies, and I completly agree on the importance to=20 take into account the specificities of the statistical properties of the=20 data-sets and on the use of the F-Measure. In fact what I'm interrested is precisely in the case where the=20 experiments are comparable (i.e. the statistical properties of the=20 underlying classes are the same -same separability, ...-) but the number=20 of classes differ. The question I mentionned is the same whether you use Recall or F-Measure. Example: I use the same test set with the same algorithm and the same=20 measure, but in one case I consider a two class problem and in the other=20 case I consider a three class problem; how do I compare the results ? Best regards Geoffroy Peeters Xavier Amatriain a =E9crit : > Hi, > > When evaluating classification methods, especially if the classes are=20 > imbalanced, recognition rate is not a good measure. > Some common measures are recall =3D TP/ (TP + FN) and precision =3D=20 > TP/(TP+FP) > > with TP =3D true positives, FN =3D false negatives and FP =3D false pos= itives > > Even better the "F measures" are able to summarize both recall and=20 > precision into a single number. > > You can find more on this, for instance, in the paper "Evaluating=20 > metrics for Hard Classifiers" > > www.in*f*erence.phy.cam.ac.uk/hmw26/papers/evaluation.ps > > Kris West wrote: >> Hi Geoffroy, >> >> My two pence: >> >> The number of times better than random is a reasonable statistic in >> machine learning. However, its never truly possible to compare >> classification experiments on completely different datasets hence peop= le >> usually report accuracy statistics on a single dataset and use the >> 'times better than random' to look at how powerful the learning >> technique was. However its not a great statistic and can't really be >> used to compare systems across different datasets. If you have a numb= er >> of measurements of it (across multiple algorithms all tested on the sa= me >> datasets) they can be used as data points to estimate the significance >> of the difference between algorithms and variants of them (i.e. I migh= t >> do a student's T test to determine if my variant of the C4.5 was alway= s >> significantly better than the standard version). However, you still ha= ve >> to measure the statistic on the same datasets and are really just usin= g >> this stat as a normalisation of the accuracy scores. >> >> To better understand why you can't compare these scores across dataset= s, >> consider the situation where one dataset is close to linearly separabl= e, >> while the other is non-linearly separable. These properties may arise >> from different feature sets, different example tracks or a combination >> of the two. The linearly separable case will get good results using >> linear classifiers (e.g. LDA or SMO with a first order polynomial >> kernel). The second dataset might not be linearly separable but contai= n >> nice contiguous regions of particular classes. Hence, a decision tree >> model or lazy classifier might do really well here, while the linear >> classifiers do not. Such situations are possible in genre classificati= on >> experiments for example (e.g. a dataset of Classical, Electronic and >> Heavy Metal tracks might be linearly separable, where Jazz, Blues and >> Country is not, alternatively, you might switch from a MFCC based >> features set to a Beat histogram and achieve similar effects). In thi= s >> situation, using the scores for the linear classifiers on the first >> dataset you would significantly overestimate their performance on the >> second dataset. >> >> So to summarise, you have to fix at least one variable to make >> comparisons and the comparisons you can make depend on the variable >> fixed. This can mean fixing the dataset (and learning about algorithms= ), >> fixing the algorithm (and learning about datasets) or some other >> suitable situation. Hence, if you use the same algorithm in tests on >> two different datasets 'times better than random' can tell you how har= d >> the algorithm found the dataset... but not really much more than the >> accuracy told you. An algorithm A that did well in a small test might = be >> outperformed by Algorithm B on the larger test, the only useful thing >> that 'times better than random' can tell you here is whether the 'time= s >> better than random' stayed constant despite the change in data-size >> (hence it might keep scaling up with a fairly constant performance). >> >> K >> >> >> >> Geoffroy Peeters wrote: >> =20 >>> Dear all, >>> >>> has anyone already deal with the comparison of classification results >>> coming from experiments using various number of classes. >>> In other words: how to compare a recognition-rate X coming from an >>> experiment with N classes to a recognition-rate Y coming from an >>> experiment with M classes. >>> >>> I guess one possibility is to compute for both, the ratio of the >>> obtained recognition-rate to the random recognition rate (which >>> depends on the number of classes). >>> Example: >>> - recognition-rate of 50% for 2 classes would give 1 (50%/50%) >>> - recognition-rate of 50% for 4 classes would give 2 (50%/25%); >>> So this would lead to the conclusion that the second system performs >>> better. >>> >>> However, this measure has the drawback that it favors experiments wit= h >>> large number of classes: >>> A 2 classes problem will never exceed a ratio of 2 (100%/50%) ! >>> >>> Thanks for any suggestions of references >>> >>> Best regards >>> Geoffroy Peeters >>> >>> =20 >> >> =20 > --=20 Geoffroy Peeters Ircam - R&D tel: +33/1/44.78.14.22 email: peeters@xxxxxxxx <http://www.ircam.fr> --------------040507060807020409030602 Content-Type: multipart/related; boundary="------------090704010903000006090406" --------------090704010903000006090406 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type"> </head> <body bgcolor="#ffffff" text="#000000"> Thanks all for your replies, and I completly agree on the importance to take into account the specificities of the statistical properties of the data-sets and on the use of the F-Measure.<br> <br> In fact what I'm interrested is precisely in the case where the experiments are comparable (i.e. the statistical properties of the underlying classes are the same -same separability, ...-) but the number of classes differ.<br> The question I mentionned is the same whether you use Recall or F-Measure.<br> <br> Example: I use the same test set with the same algorithm and the same measure, but in one case I consider a two class problem and in the other case I consider a three class problem; how do I compare the results ?<br> <br> Best regards<br> Geoffroy Peeters<br> <br> Xavier Amatriain a écrit : <blockquote cite="mid:47C81BA4.3090808@xxxxxxxx" type="cite">Hi, <br> <br> When evaluating classification methods, especially if the classes are imbalanced, recognition rate is not a good measure. <br> Some common measures are recall = TP/ (TP + FN) and precision = TP/(TP+FP) <br> <br> with TP = true positives, FN = false negatives and FP = false positives <br> <br> Even better the "F measures" are able to summarize both recall and precision into a single number. <br> <br> You can find more on this, for instance, in the paper "Evaluating metrics for Hard Classifiers" <br> <br> <a class="moz-txt-link-abbreviated" href="http://www.in*f*erence.phy.cam.ac.uk/hmw26/papers/evaluation.ps">www.in*f*erence.phy.cam.ac.uk/hmw26/papers/evaluation.ps</a> <br> <br> Kris West wrote: <br> <blockquote type="cite">Hi Geoffroy, <br> <br> My two pence: <br> <br> The number of times better than random is a reasonable statistic in <br> machine learning. However, its never truly possible to compare <br> classification experiments on completely different datasets hence people <br> usually report accuracy statistics on a single dataset and use the <br> 'times better than random' to look at how powerful the learning <br> technique was. However its not a great statistic and can't really be <br> used to compare systems across different datasets. If you have a number <br> of measurements of it (across multiple algorithms all tested on the same <br> datasets) they can be used as data points to estimate the significance <br> of the difference between algorithms and variants of them (i.e. I might <br> do a student's T test to determine if my variant of the C4.5 was always <br> significantly better than the standard version). However, you still have <br> to measure the statistic on the same datasets and are really just using <br> this stat as a normalisation of the accuracy scores. <br> <br> To better understand why you can't compare these scores across datasets, <br> consider the situation where one dataset is close to linearly separable, <br> while the other is non-linearly separable. These properties may arise <br> from different feature sets, different example tracks or a combination <br> of the two. The linearly separable case will get good results using <br> linear classifiers (e.g. LDA or SMO with a first order polynomial <br> kernel). The second dataset might not be linearly separable but contain <br> nice contiguous regions of particular classes. Hence, a decision tree <br> model or lazy classifier might do really well here, while the linear <br> classifiers do not. Such situations are possible in genre classification <br> experiments for example (e.g. a dataset of Classical, Electronic and <br> Heavy Metal tracks might be linearly separable, where Jazz, Blues and <br> Country is not, alternatively, you might switch from a MFCC based <br> features set to a Beat histogram and achieve similar effects). In this <br> situation, using the scores for the linear classifiers on the first <br> dataset you would significantly overestimate their performance on the <br> second dataset. <br> <br> So to summarise, you have to fix at least one variable to make <br> comparisons and the comparisons you can make depend on the variable <br> fixed. This can mean fixing the dataset (and learning about algorithms), <br> fixing the algorithm (and learning about datasets) or some other <br> suitable situation. Hence, if you use the same algorithm in tests on <br> two different datasets 'times better than random' can tell you how hard <br> the algorithm found the dataset... but not really much more than the <br> accuracy told you. An algorithm A that did well in a small test might be <br> outperformed by Algorithm B on the larger test, the only useful thing <br> that 'times better than random' can tell you here is whether the 'times <br> better than random' stayed constant despite the change in data-size <br> (hence it might keep scaling up with a fairly constant performance). <br> <br> K <br> <br> <br> <br> Geoffroy Peeters wrote: <br> <blockquote type="cite">Dear all, <br> <br> has anyone already deal with the comparison of classification results <br> coming from experiments using various number of classes. <br> In other words: how to compare a recognition-rate X coming from an <br> experiment with N classes to a recognition-rate Y coming from an <br> experiment with M classes. <br> <br> I guess one possibility is to compute for both, the ratio of the <br> obtained recognition-rate to the random recognition rate (which <br> depends on the number of classes). <br> Example: <br> - recognition-rate of 50% for 2 classes would give 1 (50%/50%) <br> - recognition-rate of 50% for 4 classes would give 2 (50%/25%); <br> So this would lead to the conclusion that the second system performs <br> better. <br> <br> However, this measure has the drawback that it favors experiments with <br> large number of classes: <br> A 2 classes problem will never exceed a ratio of 2 (100%/50%) ! <br> <br> Thanks for any suggestions of references <br> <br> Best regards <br> Geoffroy Peeters <br> <br> </blockquote> <br> </blockquote> <br> </blockquote> <br> <br> <div class="moz-signature">-- <br> <font face="Verdana">Geoffroy Peeters<br> <font color="#555555" size="-1"> Ircam - R&D<br> tel: +33/1/44.78.14.22<br> email: <a class="moz-txt-link-abbreviated" href="mailto:peeters@xxxxxxxx">peeters@xxxxxxxx</a><br> </font></font><a href="http://www.ircam.fr" hreflang="fr"> <img src="cid:part1.07080605.06030705@xxxxxxxx" border="0" width="120"></a><br> </div> </body> </html> --------------090704010903000006090406 Content-Type: image/gif; name="logo1.gif" Content-ID: <part1.07080605.06030705@xxxxxxxx> Content-Disposition: inline; filename="logo1.gif" Content-Transfer-Encoding: base64 R0lGODlhfQBNAPcAAAAAAAEBAQICAgMDAwQEBAUFBQYGBgcHBwgICAkJCQoKCgsLCwwMDA0N DQ4ODg8PDxAQEBERERISEhMTExQUFBUVFRYWFhcXFxgYGBkZGRoaGhsbGxwcHB0dHR4eHh8f HyAgICEhISIiIiMjIyQkJCUlJSYmJicnJygoKCkpKSoqKisrKywsLC0tLS4uLi8vLzAwMDEx MTIyMjMzMzQ0NDU1NTY2Njc3Nzg4ODk5OTo6Ojs7Ozw8PD09PT4+Pj8/P0BAQEFBQUJCQkND Q0REREVFRUZGRkdHR0hISElJSUpKSktLS0xMTE1NTU5OTk9PT1BQUFFRUVJSUlNTU1RUVFVV VVZWVldXV1hYWFlZWVpaWltbW1xcXF1dXV5eXl9fX2BgYGFhYWJiYmNjY2RkZGVlZWZmZmdn Z2hoaGlpaWpqamtra2xsbG1tbW5ubm9vb3BwcHFxcXJycnNzc3R0dHV1dXZ2dnd3d3h4eHl5 eXp6ent7e3x8fH19fX5+fn9/f4CAgIGBgYKCgoODg4SEhIWFhYaGhoeHh4iIiImJiYqKiouL i4yMjI2NjY6Ojo+Pj5CQkJGRkZKSkpOTk5SUlJWVlZaWlpeXl5iYmJmZmZqampubm5ycnJ2d nZ6enp+fn6CgoKGhoaKioqOjo6SkpKWlpaampqenp6ioqKmpqaqqqqurq6ysrK2tra6urq+v r7CwsLGxsbKysrOzs7S0tLW1tba2tre3t7i4uLm5ubq6uru7u7y8vL29vb6+vr+/v8DAwMHB wcLCwsPDw8TExMXFxcbGxsfHx8jIyMnJycrKysvLy8zMzM3Nzc7Ozs/Pz9DQ0NHR0dLS0tPT 09TU1NXV1dbW1tfX19jY2NnZ2dra2tvb29zc3N3d3d7e3t/f3+Dg4OHh4eLi4uPj4+Tk5OXl 5ebm5ufn5+jo6Onp6erq6uvr6+zs7O3t7e7u7u/v7/Dw8PHx8fLy8vPz8/T09PX19fb29vf3 9/j4+Pn5+fr6+vv7+/z8/P39/f7+/v///ywAAAAAfQBNAAAI/wD/CRxIsKDBgwgTKlzIsKHD hxAjSpxIsaLFixgzatzIsaPHjyBDihxJMmIdAAAOFCvJsqXBkylXupzJcg5KA8Jo6hz5iUiT Kdd2Ch1KtKjRozP78Vs6UCnTf/myUSPIThkzdwnNGZNWz6DTfv/2XSs2b+A8ZszsIX24iAOK F88EfiGR4sU9VDoWcBFojcsHCRM+rClX0NeRDRAooOjTVeC9KSJGXDkWpIIDFaL+OSohQQKL U2sb2gSAU6AOlAfADEAp5R80EShjA6hhbiCpArJRNmlM7wXKDB1kNyiSm4Ct0AthqhQIJDfK Kf+GOEdpRqC4CdMBKBJoT0b27EaQK/9ULrN5bAZY6mTyFnvBGjEJUFoQ929R7A94oMSOUbZ7 bBZ7+CDbCIL8ENsB4iVEHnOxeWDMQLTEJodAXMSmzD843LTMP/3kgJID4PzjHwALPLjOBSgV 8Ms/7mQQW4IILfiPeQBgQhAsseUi0DBkoNHFN//kEgssw/gjUBsfhjgiC/QIpARKE2D1j4Eo wXiQjOYR0A5BscQWDEP6oPPPHig9oKR3AMCQj0BOoBRBOgIF8aKVBWF5Ez9cxrYiQviwUkUO KtjwQZlnomSXQE24CeeMc9I5kJ2k4TlQlyjtaRA/ZWRnpohoHvpPogC8yWCVjj6KWnl35lkp QrbItoAIChD/yqmh9yCq6KgAlGpqTKMaIKlAlAJgaUFgxKZEMOp0IeuInoIqKqOk6gqpr6oC 8OVBKqAEQW3/6LFsp7V+eiu0uer6z7S//uNKbLMIdIsZa3wBpAkokRAuGt/Samuoi9Jo7rmn 9pquL7FhIU87PFj4D70AKJDNP/jQkC8AzY7rr7noElQObgAIAEO2KI0AZ4YozWDIFLFtymy4 zvbbaKkZE6RGdocIhIhzB0xcMb+4/itHisAIdEOqBMVDhXNorMliCrLhsARKCXTzDz0toOSC Wv8coS23Q0db6itiqPGG1P9AgsYabhhZ0D2caOEDEFussg9B3rxhRBFpjFOLGWnQ/7FlPoiY cYYhc/9zSRlq1CGPQGavsca/FcnTJEL54KM25JhnrrmV9mDD1eYOmSP66KSXbrro55wz+jnp DnRNHmSEgYc0Eekzjjj6lHrC7rz37vvvJ8AgQwsmnEDCDIQVFM8bVXDRBRVsrAMROm2oMU6p d2Sv/fbcd7/HaRCskUcdfrxjUDJZgGFNNWREcQtE54DBxfWO/mH//fjn70chjCTyByB94IAK gOEKP/TBEPAwSDG0UAcOKWIKqhAIPuIhD0ndAx74yEc8sJYNMXxhKvqAh1rqoZZ+yCMe+DAK EVbIwha6EAkgAAACiFCEH4iBG3kAwA2I8IRFEeQYWfCDQP/IYQ3pbSMQZDjDJZrUii4cohBl sMM3wFGFL3yhCtxQhhYEAQs2bCIfnEgDGfwQFKJY44xoTOM1rnHGbbTiAhFIBTesgY1oMAEA PkhFLGrRmB9moQ8FocccpuCFLEghE/8gRRSswAUsREES5JADGL4Ah3MMgwleiIIWNHGKKGCh C1JoQzyI0oBSmtKUEYAAAgxQSgHsgBmWgMACGhAfACQAAQKYQDgMAkRAEkQbUThDMlBRhTfU IxVTcMMzIGGFOvBjHB70xj+AIYUsKGIa19gDFUKBDDQ8IS5DaYU4xynOWBwCAQigxCxY8Ypq nAEAXIhFK1whi1nAwhWtkEUfB9L/y4IYYwp16Ac4tvAGeaTCCY/4hzCu4IZ9nMOD9AEGFMgA J3ewIQvV+EcgoLALohzho1CgghM+moQKwCAZqyDCEZLANAP0AAkfjekRiBAFH/LzjwUphhXM cIlGZKENBkXoP4iBhTfswxwe3CUwmNCHWqmjDFxQRCbQAIXjDMULXgDDDEgTBTBsYQ7Y8EMB hBCGrKIBDWDAqlqxugUzsIOXOBWRPPphjCtYUQxo2AM9DuqIoRb1qEmdJlOdaoYugAEMaRiD L4iSCU1cogQYmAUxKqGJTiQBABxwhCYywdnOepazl/BEWQrST3+Qgg6/aEYV4FCNblxDaqdw Ql+JalSk/37hektt6j/YkYYt7AIc2bDGlobCAQ5UwAfLqEQDLtABDVSAAyPwQHGnS13qYgAF 5DDIMLSQBnnAAw5IYEU3oAAHeoyjEZ64i2z9atRyhOEL3BCsbudRhyo0Yx+jSIQ1iAIMYPyi LwDAwjCCAQxhDLi/CE6wgn8xjHAR5Btb2IIfBIEFL2TjHVngAiHwwIQ59AMV66WtPt7RBS/k YR3EGCyHCiGFOhDiCk+I71AwgIEMOAAlDdAAjXfM4x73mAIkoF9BRpGFJzjBC68QSClg7AQy JOMfoBCCIQRrhlph4go9uMYwgDAHrDnDDE94ghVGcbmddFYToyBFJz7L5jaD9v8Toy1IPpzx ilZQQ1L5UMY8H/YPb/gio+oIhjLwdI9k+EIe6ujFM37VDViwohitA52kJ03pSlt6JMm4RCc2 8QlnRNoizojEsP7BC0kc49IDEYWYp6BJWXAEFT1oREEYYYRPoFrJUajDKyKxhS9IKSPLUERH CQKJJmTm1qZwAiT+4Y82WGEaI6GEE0Zx638kO6FjusKKkkEKUiRjbtXYxC2IEYpZuCMYoHBF V6wRClgoIxSxKMs0OLGSeNBCFMBQBBSo/Q9mmEIUwlCLNz7RUXKUghT/urZA9oCFZMACC1Lw 5Cr+IQsleEEKU9ACHqRAhSpYguJSwMIVpjAFW7OiCJL/+IcqOC4FMIz5H7fYQhSkUAVQ/CMY TQDEP5ZxhSok/AmM4Ec50HAFZrThCoMIxBXUoI9bdJUSc+DCFRiRBy6M4R+1yEIYImGHLNzh H7FowiX+cQgrvGESZahCKfAhhyoQIhFb4AI8jFGFQvyjGV7Yi7lMQQU1IOINU8DDNrZQBnSU owxacEcumsAHrF+hDf94RheqU4srTGgXgeeHLMROdiegoj5POMU5yLAFb7jDDU5AxzHqfve8 J/wKWmDCFfTADW884Q3wcMcbnoCOXDDB7ry4Ah7+IQ4vUP4KdBiqFPLQj82PHRFNiOAknHAK chwWHPy4Qw/pbne8611XpogC/yCaQQ3zZeMJbXAHO9rAe98P4h+7uIId/vEN4zs++cKYAvOd /w/oRzAS1FcOkwQO+2AH28d63pdwTpByA5ENTuAG79AOqHcO7gd/8kd/9ld5yTcMy9d8nOd/ /yAJ1Gd9YIB92pcOdDdlzuB6e7deA8ENVLAG7bAOa0AF61CB8Td/9Xd8yUcMHch/IAgJAVhW 3qAPdZB6xmAFQqQMLAh+LigQSMUFrJAKXCAG9IALTPB+OYiBlIcFZ9ALkiB/msd5hTAFhQAM bTAFpgAPbGAF7RYGUoBiWTAGv/AIXPB9pSIKR5AIBMEPjbBIVgAFi/APs0AEgIQL5PUP3XAF YIB1WsvABVJwBVTwfq2Acv+wCZ5kBVbwBDZ3CVBgBVcgBYDQD9awBVhABVaQBT5nLr2gBxNH EOhQCXMwB5RAGMeQB9TGDIGASOdgCLJWC1jABpjgBmQ0VHkAC/9QDoLwBozgCX2gC//QDpxA B3HgCNyiCnOwB6OgCHZnLvtgD0rTh+7gDpLyjWvCD/eQQv6Qjo7XQO0wOd+YO1PTDvfwjYXD D+/gDvLIIe9QFvfgYNWWEBoYkDQxkATpErQwBRNykAzZkA75kBAZkRI5kQwREAA7 --------------090704010903000006090406-- --------------040507060807020409030602--

This message came from the mail archive

http://www.auditory.org/postings/2008/

maintained by: DAn Ellis <dpwe@ee.columbia.edu>

Electrical Engineering Dept., Columbia University