[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [MUSIC-IR] Re: Re: How to compare classification results



Thanks all for your replies, and I completly agree on the importance to take into account the specificities of the statistical properties of the data-sets and on the use of the F-Measure.

In fact what I'm interrested is precisely in the case where the experiments are comparable (i.e. the statistical properties of the underlying classes are the same -same separability, ...-) but the number of classes differ.
The question I mentionned is the same whether you use Recall or F-Measure.

Example: I use the same test set with the same algorithm and the same measure, but in one case I consider a two class problem and in the other case I consider a three class problem; how do I compare the results ?

Best regards
Geoffroy Peeters

Xavier Amatriain a écrit :
Hi,

When evaluating classification methods, especially if the classes are imbalanced, recognition rate is not a good measure.
Some common measures are recall = TP/ (TP + FN) and precision = TP/(TP+FP)

with TP = true positives, FN = false negatives and FP = false positives

Even better the "F measures" are able to summarize both recall and precision into a single number.

You can find more on this, for instance, in the paper "Evaluating metrics for Hard Classifiers"

www.in*f*erence.phy.cam.ac.uk/hmw26/papers/evaluation.ps

Kris West wrote:
Hi Geoffroy,

My two pence:

The number of times better than random is a reasonable statistic in
machine learning. However, its never truly possible to compare
classification experiments on completely different datasets hence people
usually report accuracy statistics on a single dataset and use the
'times better than random' to look at how powerful the learning
technique was. However its not a great statistic and can't really be
used to compare systems across different datasets.  If you have a number
of measurements of it (across multiple algorithms all tested on the same
datasets) they can be used as data points to estimate the significance
of the difference between algorithms and variants of them (i.e. I might
do a student's T test to determine if my variant of the C4.5 was always
significantly better than the standard version). However, you still have
to measure the statistic on the same datasets and are really just using
this stat as a normalisation of the accuracy scores.

To better understand why you can't compare these scores across datasets,
consider the situation where one dataset is close to linearly separable,
while the other is non-linearly separable. These properties may arise
from different feature sets, different example tracks or a combination
of the two. The linearly separable case will get good results using
linear classifiers (e.g. LDA or SMO with a first order polynomial
kernel). The second dataset might not be linearly separable but contain
nice contiguous regions of particular classes. Hence, a decision tree
model or lazy classifier might do really well here, while the linear
classifiers do not. Such situations are possible in genre classification
experiments for example (e.g. a dataset of Classical, Electronic and
Heavy Metal tracks might be linearly separable, where Jazz, Blues and
Country is not, alternatively, you might switch from a MFCC based
features set to a Beat histogram and achieve similar effects).  In this
situation, using the scores for the linear classifiers on the first
dataset you would significantly overestimate their performance on the
second dataset.

So to summarise, you have to fix at least one variable to make
comparisons and the comparisons you can make depend on the variable
fixed. This can mean fixing the dataset (and learning about algorithms),
fixing the algorithm (and learning about datasets) or some other
suitable situation.  Hence, if you use the same algorithm in tests on
two different datasets 'times better than random' can tell you how hard
the algorithm found the dataset... but not really much more than the
accuracy told you. An algorithm A that did well in a small test might be
outperformed by Algorithm B on the larger test, the only useful thing
that 'times better than random' can tell you here is whether the 'times
better than random' stayed constant despite the change in data-size
(hence it might keep scaling up with a fairly constant performance).

K



Geoffroy Peeters wrote:
 
Dear all,

has anyone already deal with the comparison of classification results
coming from experiments using various number of classes.
In other words: how to compare a recognition-rate X coming from an
experiment with N classes to a recognition-rate Y coming from an
experiment with M classes.

I guess one possibility is to compute for both, the ratio of the
obtained recognition-rate to the random recognition rate (which
depends on the number of classes).
Example:
- recognition-rate of 50% for 2 classes would give 1 (50%/50%)
- recognition-rate of 50% for 4 classes would give 2 (50%/25%);
So this would lead to the conclusion that the second system performs
better.

However, this measure has the drawback that it favors experiments with
large number of classes:
A 2 classes problem will never exceed a ratio of 2 (100%/50%) !

Thanks for any suggestions of references

Best regards
Geoffroy Peeters

   

 



--
Geoffroy Peeters
Ircam - R&D
tel: +33/1/44.78.14.22
email: peeters@xxxxxxxx

GIF image