Re: [MUSIC-IR] Re: Re: How to compare classification results (Geoffroy Peeters )

Subject: Re: [MUSIC-IR] Re: Re: How to compare classification results From: Geoffroy Peeters <Geoffroy.Peeters@xxxxxxxx> Date: Fri, 29 Feb 2008 16:01:51 +0100 List-Archive:<http://lists.mcgill.ca/scripts/wa.exe?LIST=AUDITORY> This is a multi-part message in MIME format. --------------040507060807020409030602 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: quoted-printable X-MIME-Autoconverted: from 8bit to quoted-printable by torrent.cc.mcgill.ca id m1TF1pAk026427 Thanks all for your replies, and I completly agree on the importance to=20 take into account the specificities of the statistical properties of the=20 data-sets and on the use of the F-Measure. In fact what I'm interrested is precisely in the case where the=20 experiments are comparable (i.e. the statistical properties of the=20 underlying classes are the same -same separability, ...-) but the number=20 of classes differ. The question I mentionned is the same whether you use Recall or F-Measure. Example: I use the same test set with the same algorithm and the same=20 measure, but in one case I consider a two class problem and in the other=20 case I consider a three class problem; how do I compare the results ? Best regards Geoffroy Peeters Xavier Amatriain a =E9crit : > Hi, > > When evaluating classification methods, especially if the classes are=20 > imbalanced, recognition rate is not a good measure. > Some common measures are recall =3D TP/ (TP + FN) and precision =3D=20 > TP/(TP+FP) > > with TP =3D true positives, FN =3D false negatives and FP =3D false pos= itives > > Even better the "F measures" are able to summarize both recall and=20 > precision into a single number. > > You can find more on this, for instance, in the paper "Evaluating=20 > metrics for Hard Classifiers" > > www.in*f*erence.phy.cam.ac.uk/hmw26/papers/evaluation.ps > > Kris West wrote: >> Hi Geoffroy, >> >> My two pence: >> >> The number of times better than random is a reasonable statistic in >> machine learning. However, its never truly possible to compare >> classification experiments on completely different datasets hence peop= le >> usually report accuracy statistics on a single dataset and use the >> 'times better than random' to look at how powerful the learning >> technique was. However its not a great statistic and can't really be >> used to compare systems across different datasets. If you have a numb= er >> of measurements of it (across multiple algorithms all tested on the sa= me >> datasets) they can be used as data points to estimate the significance >> of the difference between algorithms and variants of them (i.e. I migh= t >> do a student's T test to determine if my variant of the C4.5 was alway= s >> significantly better than the standard version). However, you still ha= ve >> to measure the statistic on the same datasets and are really just usin= g >> this stat as a normalisation of the accuracy scores. >> >> To better understand why you can't compare these scores across dataset= s, >> consider the situation where one dataset is close to linearly separabl= e, >> while the other is non-linearly separable. These properties may arise >> from different feature sets, different example tracks or a combination >> of the two. The linearly separable case will get good results using >> linear classifiers (e.g. LDA or SMO with a first order polynomial >> kernel). The second dataset might not be linearly separable but contai= n >> nice contiguous regions of particular classes. Hence, a decision tree >> model or lazy classifier might do really well here, while the linear >> classifiers do not. Such situations are possible in genre classificati= on >> experiments for example (e.g. a dataset of Classical, Electronic and >> Heavy Metal tracks might be linearly separable, where Jazz, Blues and >> Country is not, alternatively, you might switch from a MFCC based >> features set to a Beat histogram and achieve similar effects). In thi= s >> situation, using the scores for the linear classifiers on the first >> dataset you would significantly overestimate their performance on the >> second dataset. >> >> So to summarise, you have to fix at least one variable to make >> comparisons and the comparisons you can make depend on the variable >> fixed. This can mean fixing the dataset (and learning about algorithms= ), >> fixing the algorithm (and learning about datasets) or some other >> suitable situation. Hence, if you use the same algorithm in tests on >> two different datasets 'times better than random' can tell you how har= d >> the algorithm found the dataset... but not really much more than the >> accuracy told you. An algorithm A that did well in a small test might = be >> outperformed by Algorithm B on the larger test, the only useful thing >> that 'times better than random' can tell you here is whether the 'time= s >> better than random' stayed constant despite the change in data-size >> (hence it might keep scaling up with a fairly constant performance). >> >> K >> >> >> >> Geoffroy Peeters wrote: >> =20 >>> Dear all, >>> >>> has anyone already deal with the comparison of classification results >>> coming from experiments using various number of classes. >>> In other words: how to compare a recognition-rate X coming from an >>> experiment with N classes to a recognition-rate Y coming from an >>> experiment with M classes. >>> >>> I guess one possibility is to compute for both, the ratio of the >>> obtained recognition-rate to the random recognition rate (which >>> depends on the number of classes). >>> Example: >>> - recognition-rate of 50% for 2 classes would give 1 (50%/50%) >>> - recognition-rate of 50% for 4 classes would give 2 (50%/25%); >>> So this would lead to the conclusion that the second system performs >>> better. >>> >>> However, this measure has the drawback that it favors experiments wit= h >>> large number of classes: >>> A 2 classes problem will never exceed a ratio of 2 (100%/50%) ! >>> >>> Thanks for any suggestions of references >>> >>> Best regards >>> Geoffroy Peeters >>> >>> =20 >> >> =20 > --=20 Geoffroy Peeters Ircam - R&D tel: +33/1/44.78.14.22 email: peeters@xxxxxxxx <http://www.ircam.fr> --------------040507060807020409030602 Content-Type: multipart/related; boundary="------------090704010903000006090406" --------------090704010903000006090406 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type"> </head> <body bgcolor="#ffffff" text="#000000"> Thanks all for your replies, and I completly agree on the importance to take into account the specificities of the statistical properties of the data-sets and on the use of the F-Measure. In fact what I'm interrested is precisely in the case where the experiments are comparable (i.e. the statistical properties of the underlying classes are the same -same separability, ...-) but the number of classes differ. The question I mentionned is the same whether you use Recall or F-Measure. Example: I use the same test set with the same algorithm and the same measure, but in one case I consider a two class problem and in the other case I consider a three class problem; how do I compare the results ? Best regards Geoffroy Peeters Xavier Amatriain a écrit : <blockquote cite="mid:47C81BA4.3090808@xxxxxxxx" type="cite">Hi, When evaluating classification methods, especially if the classes are imbalanced, recognition rate is not a good measure. Some common measures are recall = TP/ (TP + FN) and precision = TP/(TP+FP) with TP = true positives, FN = false negatives and FP = false positives Even better the "F measures" are able to summarize both recall and precision into a single number. You can find more on this, for instance, in the paper "Evaluating metrics for Hard Classifiers" <a class="moz-txt-link-abbreviated" href="http://www.in*f*erence.phy.cam.ac.uk/hmw26/papers/evaluation.ps">www.in*f*erence.phy.cam.ac.uk/hmw26/papers/evaluation.ps</a> Kris West wrote: <blockquote type="cite">Hi Geoffroy, My two pence: The number of times better than random is a reasonable statistic in machine learning. However, its never truly possible to compare classification experiments on completely different datasets hence people usually report accuracy statistics on a single dataset and use the 'times better than random' to look at how powerful the learning technique was. However its not a great statistic and can't really be used to compare systems across different datasets.  If you have a number of measurements of it (across multiple algorithms all tested on the same datasets) they can be used as data points to estimate the significance of the difference between algorithms and variants of them (i.e. I might do a student's T test to determine if my variant of the C4.5 was always significantly better than the standard version). However, you still have to measure the statistic on the same datasets and are really just using this stat as a normalisation of the accuracy scores. To better understand why you can't compare these scores across datasets, consider the situation where one dataset is close to linearly separable, while the other is non-linearly separable. These properties may arise from different feature sets, different example tracks or a combination of the two. The linearly separable case will get good results using linear classifiers (e.g. LDA or SMO with a first order polynomial kernel). The second dataset might not be linearly separable but contain nice contiguous regions of particular classes. Hence, a decision tree model or lazy classifier might do really well here, while the linear classifiers do not. Such situations are possible in genre classification experiments for example (e.g. a dataset of Classical, Electronic and Heavy Metal tracks might be linearly separable, where Jazz, Blues and Country is not, alternatively, you might switch from a MFCC based features set to a Beat histogram and achieve similar effects).  In this situation, using the scores for the linear classifiers on the first dataset you would significantly overestimate their performance on the second dataset. So to summarise, you have to fix at least one variable to make comparisons and the comparisons you can make depend on the variable fixed. This can mean fixing the dataset (and learning about algorithms), fixing the algorithm (and learning about datasets) or some other suitable situation.  Hence, if you use the same algorithm in tests on two different datasets 'times better than random' can tell you how hard the algorithm found the dataset... but not really much more than the accuracy told you. An algorithm A that did well in a small test might be outperformed by Algorithm B on the larger test, the only useful thing that 'times better than random' can tell you here is whether the 'times better than random' stayed constant despite the change in data-size (hence it might keep scaling up with a fairly constant performance). K Geoffroy Peeters wrote:   <blockquote type="cite">Dear all, has anyone already deal with the comparison of classification results coming from experiments using various number of classes. In other words: how to compare a recognition-rate X coming from an experiment with N classes to a recognition-rate Y coming from an experiment with M classes. I guess one possibility is to compute for both, the ratio of the obtained recognition-rate to the random recognition rate (which depends on the number of classes). Example: - recognition-rate of 50% for 2 classes would give 1 (50%/50%) - recognition-rate of 50% for 4 classes would give 2 (50%/25%); So this would lead to the conclusion that the second system performs better. However, this measure has the drawback that it favors experiments with large number of classes: A 2 classes problem will never exceed a ratio of 2 (100%/50%) ! Thanks for any suggestions of references Best regards Geoffroy Peeters     </blockquote>   </blockquote> </blockquote> <div class="moz-signature">-- Geoffroy Peeters Ircam - R&D tel: +33/1/44.78.14.22 email: <a class="moz-txt-link-abbreviated" href="mailto:peeters@xxxxxxxx">peeters@xxxxxxxx</a> <a href="http://www.ircam.fr" hreflang="fr"> <img src="cid:part1.07080605.06030705@xxxxxxxx" border="0" width="120"></a> </div> </body> </html> --------------090704010903000006090406 Content-Type: image/gif; name="logo1.gif" Content-ID: <part1.07080605.06030705@xxxxxxxx> Content-Disposition: inline; filename="logo1.gif" Content-Transfer-Encoding: base64 R0lGODlhfQBNAPcAAAAAAAEBAQICAgMDAwQEBAUFBQYGBgcHBwgICAkJCQoKCgsLCwwMDA0N DQ4ODg8PDxAQEBERERISEhMTExQUFBUVFRYWFhcXFxgYGBkZGRoaGhsbGxwcHB0dHR4eHh8f HyAgICEhISIiIiMjIyQkJCUlJSYmJicnJygoKCkpKSoqKisrKywsLC0tLS4uLi8vLzAwMDEx MTIyMjMzMzQ0NDU1NTY2Njc3Nzg4ODk5OTo6Ojs7Ozw8PD09PT4+Pj8/P0BAQEFBQUJCQkND Q0REREVFRUZGRkdHR0hISElJSUpKSktLS0xMTE1NTU5OTk9PT1BQUFFRUVJSUlNTU1RUVFVV VVZWVldXV1hYWFlZWVpaWltbW1xcXF1dXV5eXl9fX2BgYGFhYWJiYmNjY2RkZGVlZWZmZmdn Z2hoaGlpaWpqamtra2xsbG1tbW5ubm9vb3BwcHFxcXJycnNzc3R0dHV1dXZ2dnd3d3h4eHl5 eXp6ent7e3x8fH19fX5+fn9/f4CAgIGBgYKCgoODg4SEhIWFhYaGhoeHh4iIiImJiYqKiouL i4yMjI2NjY6Ojo+Pj5CQkJGRkZKSkpOTk5SUlJWVlZaWlpeXl5iYmJmZmZqampubm5ycnJ2d nZ6enp+fn6CgoKGhoaKioqOjo6SkpKWlpaampqenp6ioqKmpqaqqqqurq6ysrK2tra6urq+v r7CwsLGxsbKysrOzs7S0tLW1tba2tre3t7i4uLm5ubq6uru7u7y8vL29vb6+vr+/v8DAwMHB wcLCwsPDw8TExMXFxcbGxsfHx8jIyMnJycrKysvLy8zMzM3Nzc7Ozs/Pz9DQ0NHR0dLS0tPT 09TU1NXV1dbW1tfX19jY2NnZ2dra2tvb29zc3N3d3d7e3t/f3+Dg4OHh4eLi4uPj4+Tk5OXl 5ebm5ufn5+jo6Onp6erq6uvr6+zs7O3t7e7u7u/v7/Dw8PHx8fLy8vPz8/T09PX19fb29vf3 9/j4+Pn5+fr6+vv7+/z8/P39/f7+/v///ywAAAAAfQBNAAAI/wD/CRxIsKDBgwgTKlzIsKHD hxAjSpxIsaLFixgzatzIsaPHjyBDihxJMmIdAAAOFCvJsqXBkylXupzJcg5KA8Jo6hz5iUiT Kdd2Ch1KtKjRozP78Vs6UCnTf/myUSPIThkzdwnNGZNWz6DTfv/2XSs2b+A8ZszsIX24iAOK F88EfiGR4sU9VDoWcBFojcsHCRM+rClX0NeRDRAooOjTVeC9KSJGXDkWpIIDFaL+OSohQQKL U2sb2gSAU6AOlAfADEAp5R80EShjA6hhbiCpArJRNmlM7wXKDB1kNyiSm4Ct0AthqhQIJDfK Kf+GOEdpRqC4CdMBKBJoT0b27EaQK/9ULrN5bAZY6mTyFnvBGjEJUFoQ929R7A94oMSOUbZ7 bBZ7+CDbCIL8ENsB4iVEHnOxeWDMQLTEJodAXMSmzD843LTMP/3kgJID4PzjHwALPLjOBSgV 8Ms/7mQQW4IILfiPeQBgQhAsseUi0DBkoNHFN//kEgssw/gjUBsfhjgiC/QIpARKE2D1j4Eo wXiQjOYR0A5BscQWDEP6oPPPHig9oKR3AMCQj0BOoBRBOgIF8aKVBWF5Ez9cxrYiQviwUkUO KtjwQZlnomSXQE24CeeMc9I5kJ2k4TlQlyjtaRA/ZWRnpohoHvpPogC8yWCVjj6KWnl35lkp QrbItoAIChD/yqmh9yCq6KgAlGpqTKMaIKlAlAJgaUFgxKZEMOp0IeuInoIqKqOk6gqpr6oC 8OVBKqAEQW3/6LFsp7V+eiu0uer6z7S//uNKbLMIdIsZa3wBpAkokRAuGt/Samuoi9Jo7rmn 9pquL7FhIU87PFj4D70AKJDNP/jQkC8AzY7rr7noElQObgAIAEO2KI0AZ4YozWDIFLFtymy4 zvbbaKkZE6RGdocIhIhzB0xcMb+4/itHisAIdEOqBMVDhXNorMliCrLhsARKCXTzDz0toOSC Wv8coS23Q0db6itiqPGG1P9AgsYabhhZ0D2caOEDEFussg9B3rxhRBFpjFOLGWnQ/7FlPoiY cYYhc/9zSRlq1CGPQGavsca/FcnTJEL54KM25JhnrrmV9mDD1eYOmSP66KSXbrro55wz+jnp DnRNHmSEgYc0Eekzjjj6lHrC7rz37vvvJ8AgQwsmnEDCDIQVFM8bVXDRBRVsrAMROm2oMU6p d2Sv/fbcd7/HaRCskUcdfrxjUDJZgGFNNWREcQtE54DBxfWO/mH//fjn70chjCTyByB94IAK gOEKP/TBEPAwSDG0UAcOKWIKqhAIPuIhD0ndAx74yEc8sJYNMXxhKvqAh1rqoZZ+yCMe+DAK EVbIwha6EAkgAAACiFCEH4iBG3kAwA2I8IRFEeQYWfCDQP/IYQ3pbSMQZDjDJZrUii4cohBl sMM3wFGFL3yhCtxQhhYEAQs2bCIfnEgDGfwQFKJY44xoTOM1rnHGbbTiAhFIBTesgY1oMAEA PkhFLGrRmB9moQ8FocccpuCFLEghE/8gRRSswAUsREES5JADGL4Ah3MMgwleiIIWNHGKKGCh C1JoQzyI0oBSmtKUEYAAAgxQSgHsgBmWgMACGhAfACQAAQKYQDgMAkRAEkQbUThDMlBRhTfU IxVTcMMzIGGFOvBjHB70xj+AIYUsKGIa19gDFUKBDDQ8IS5DaYU4xynOWBwCAQigxCxY8Ypq nAEAXIhFK1whi1nAwhWtkEUfB9L/y4IYYwp16Ac4tvAGeaTCCY/4hzCu4IZ9nMOD9AEGFMgA J3ewIQvV+EcgoLALohzho1CgghM+moQKwCAZqyDCEZLANAP0AAkfjekRiBAFH/LzjwUphhXM cIlGZKENBkXoP4iBhTfswxwe3CUwmNCHWqmjDFxQRCbQAIXjDMULXgDDDEgTBTBsYQ7Y8EMB hBCGrKIBDWDAqlqxugUzsIOXOBWRPPphjCtYUQxo2AM9DuqIoRb1qEmdJlOdaoYugAEMaRiD L4iSCU1cogQYmAUxKqGJTiQBABxwhCYywdnOepazl/BEWQrST3+Qgg6/aEYV4FCNblxDaqdw Ql+JalSk/37hektt6j/YkYYt7AIc2bDGlobCAQ5UwAfLqEQDLtABDVSAAyPwQHGnS13qYgAF 5DDIMLSQBnnAAw5IYEU3oAAHeoyjEZ64i2z9atRyhOEL3BCsbudRhyo0Yx+jSIQ1iAIMYPyi LwDAwjCCAQxhDLi/CE6wgn8xjHAR5Btb2IIfBIEFL2TjHVngAiHwwIQ59AMV66WtPt7RBS/k YR3EGCyHCiGFOhDiCk+I71AwgIEMOAAlDdAAjXfM4x73mAIkoF9BRpGFJzjBC68QSClg7AQy JOMfoBCCIQRrhlph4go9uMYwgDAHrDnDDE94ghVGcbmddFYToyBFJz7L5jaD9v8Toy1IPpzx ilZQQ1L5UMY8H/YPb/gio+oIhjLwdI9k+EIe6ujFM37VDViwohitA52kJ03pSlt6JMm4RCc2 8QlnRNoizojEsP7BC0kc49IDEYWYp6BJWXAEFT1oREEYYYRPoFrJUajDKyKxhS9IKSPLUERH CQKJJmTm1qZwAiT+4Y82WGEaI6GEE0Zx638kO6FjusKKkkEKUiRjbtXYxC2IEYpZuCMYoHBF V6wRClgoIxSxKMs0OLGSeNBCFMBQBBSo/Q9mmEIUwlCLNz7RUXKUghT/urZA9oCFZMACC1Lw 5Cr+IQsleEEKU9ACHqRAhSpYguJSwMIVpjAFW7OiCJL/+IcqOC4FMIz5H7fYQhSkUAVQ/CMY TQDEP5ZxhSok/AmM4Ec50HAFZrThCoMIxBXUoI9bdJUSc+DCFRiRBy6M4R+1yEIYImGHLNzh H7FowiX+cQgrvGESZahCKfAhhyoQIhFb4AI8jFGFQvyjGV7Yi7lMQQU1IOINU8DDNrZQBnSU owxacEcumsAHrF+hDf94RheqU4srTGgXgeeHLMROdiegoj5POMU5yLAFb7jDDU5AxzHqfve8 J/wKWmDCFfTADW884Q3wcMcbnoCOXDDB7ry4Ah7+IQ4vUP4KdBiqFPLQj82PHRFNiOAknHAK chwWHPy4Qw/pbne8611XpogC/yCaQQ3zZeMJbXAHO9rAe98P4h+7uIId/vEN4zs++cKYAvOd /w/oRzAS1FcOkwQO+2AH28d63pdwTpByA5ENTuAG79AOqHcO7gd/8kd/9ld5yTcMy9d8nOd/ /yAJ1Gd9YIB92pcOdDdlzuB6e7deA8ENVLAG7bAOa0AF61CB8Td/9Xd8yUcMHch/IAgJAVhW 3qAPdZB6xmAFQqQMLAh+LigQSMUFrJAKXCAG9IALTPB+OYiBlIcFZ9ALkiB/msd5hTAFhQAM bTAFpgAPbGAF7RYGUoBiWTAGv/AIXPB9pSIKR5AIBMEPjbBIVgAFi/APs0AEgIQL5PUP3XAF YIB1WsvABVJwBVTwfq2Acv+wCZ5kBVbwBDZ3CVBgBVcgBYDQD9awBVhABVaQBT5nLr2gBxNH EOhQCXMwB5RAGMeQB9TGDIGASOdgCLJWC1jABpjgBmQ0VHkAC/9QDoLwBozgCX2gC//QDpxA B3HgCNyiCnOwB6OgCHZnLvtgD0rTh+7gDpLyjWvCD/eQQv6Qjo7XQO0wOd+YO1PTDvfwjYXD D+/gDvLIIe9QFvfgYNWWEBoYkDQxkATpErQwBRNykAzZkA75kBAZkRI5kQwREAA7 --------------090704010903000006090406-- --------------040507060807020409030602--

This message came from the mail archive
http://www.auditory.org/postings/2008/
maintained by:

DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University