[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [MUSIC-IR] Re: Re: How to compare classification results
A late reply.
I agree with the fact that it is important to take into account the
chance classification rate. I do not agree with the way in which you
compute the chance classification rate = 100/number of classes.
One first thing you should take into account is the ratio number of
objects / number of classes. Think about the case where you have 10
objects to classify in 10 classes: any decent algorithm would do a
perfect job at separating the classes. Clearly, a 100% classification
rate in this case is not as good as a 100% classification rate when you
have 10 objects and 2 classes.
Also, you should consider the complexity of your classifier: the more
flexible are the boundaries it draws (e.g., linear vs. quadratic), the
more likely it will be to have good classification performances.
What I would suggest is that you compute the chance classification rate
using a resampling approach (msg me for details), and adjust the
observed classification rate so that 0 = chance and 1 = perfect.
I am not an expert in machine learning, but I hope this helps.
Bruno L. Giordano, Ph.D.
Music Perception and Cognition Laboratory
Schulich School of Music
555 Sherbrooke Street West
Montréal, QC H3A 1E3
Office: +1 514 398 4535 ext. 00900
Geoffroy Peeters wrote:
Thanks all for your replies, and I completly agree on the importance to
take into account the specificities of the statistical properties of the
data-sets and on the use of the F-Measure.
In fact what I'm interrested is precisely in the case where the
experiments are comparable (i.e. the statistical properties of the
underlying classes are the same -same separability, ...-) but the number
of classes differ.
The question I mentionned is the same whether you use Recall or F-Measure.
Example: I use the same test set with the same algorithm and the same
measure, but in one case I consider a two class problem and in the other
case I consider a three class problem; how do I compare the results ?
Xavier Amatriain a écrit :
When evaluating classification methods, especially if the classes are
imbalanced, recognition rate is not a good measure.
Some common measures are recall = TP/ (TP + FN) and precision =
with TP = true positives, FN = false negatives and FP = false positives
Even better the "F measures" are able to summarize both recall and
precision into a single number.
You can find more on this, for instance, in the paper "Evaluating
metrics for Hard Classifiers"
Kris West wrote:
My two pence:
The number of times better than random is a reasonable statistic in
machine learning. However, its never truly possible to compare
classification experiments on completely different datasets hence people
usually report accuracy statistics on a single dataset and use the
'times better than random' to look at how powerful the learning
technique was. However its not a great statistic and can't really be
used to compare systems across different datasets. If you have a number
of measurements of it (across multiple algorithms all tested on the same
datasets) they can be used as data points to estimate the significance
of the difference between algorithms and variants of them (i.e. I might
do a student's T test to determine if my variant of the C4.5 was always
significantly better than the standard version). However, you still have
to measure the statistic on the same datasets and are really just using
this stat as a normalisation of the accuracy scores.
To better understand why you can't compare these scores across datasets,
consider the situation where one dataset is close to linearly separable,
while the other is non-linearly separable. These properties may arise
from different feature sets, different example tracks or a combination
of the two. The linearly separable case will get good results using
linear classifiers (e.g. LDA or SMO with a first order polynomial
kernel). The second dataset might not be linearly separable but contain
nice contiguous regions of particular classes. Hence, a decision tree
model or lazy classifier might do really well here, while the linear
classifiers do not. Such situations are possible in genre classification
experiments for example (e.g. a dataset of Classical, Electronic and
Heavy Metal tracks might be linearly separable, where Jazz, Blues and
Country is not, alternatively, you might switch from a MFCC based
features set to a Beat histogram and achieve similar effects). In this
situation, using the scores for the linear classifiers on the first
dataset you would significantly overestimate their performance on the
So to summarise, you have to fix at least one variable to make
comparisons and the comparisons you can make depend on the variable
fixed. This can mean fixing the dataset (and learning about algorithms),
fixing the algorithm (and learning about datasets) or some other
suitable situation. Hence, if you use the same algorithm in tests on
two different datasets 'times better than random' can tell you how hard
the algorithm found the dataset... but not really much more than the
accuracy told you. An algorithm A that did well in a small test might be
outperformed by Algorithm B on the larger test, the only useful thing
that 'times better than random' can tell you here is whether the 'times
better than random' stayed constant despite the change in data-size
(hence it might keep scaling up with a fairly constant performance).
Geoffroy Peeters wrote:
has anyone already deal with the comparison of classification results
coming from experiments using various number of classes.
In other words: how to compare a recognition-rate X coming from an
experiment with N classes to a recognition-rate Y coming from an
experiment with M classes.
I guess one possibility is to compute for both, the ratio of the
obtained recognition-rate to the random recognition rate (which
depends on the number of classes).
- recognition-rate of 50% for 2 classes would give 1 (50%/50%)
- recognition-rate of 50% for 4 classes would give 2 (50%/25%);
So this would lead to the conclusion that the second system performs
However, this measure has the drawback that it favors experiments with
large number of classes:
A 2 classes problem will never exceed a ratio of 2 (100%/50%) !
Thanks for any suggestions of references
Ircam - R&D
__________ NOD32 2911 (20080229) Information __________
This message was checked by NOD32 antivirus system.