how good is my classifier?. 8/29/03evaluating hypotheses2 have seen the accuracy metric classifier...
Post on 16-Dec-2015
216 Views
Preview:
TRANSCRIPT
Evaluating Hypotheses 28/29/03
How good is my classifier?
Have seen the accuracy metric Classifier performance on a test set
Evaluating Hypotheses 38/29/03
If we are to trust a classifier’s results
Must keep the classifier blindfolded
Make sure that classifier never sees the test data
When things seem too good to be true…
First and Foremost…
Evaluating Hypotheses 48/29/03
Confusion Matrix
Could collect more information
PredictedActual
class pos negpos true pos false negneg false pos true neg
Evaluating Hypotheses 58/29/03
Sensitivity vs. Specificity Sensitivity
Out of the things predicted as being positive, how many were correct
Specificity Out of the things predicted as being negative
how many were correct
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦=𝑡𝑛
𝑡𝑛+ 𝑓𝑝
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦=𝑡𝑝
𝑡𝑝+ 𝑓𝑛
Predicted
Actual
class pos neg
pos true pos false neg
neg false pos
true neg
• Not as sensitive if begins missing what it is trying to detect• If identify more and more things as
target class, then beginning to get less specific
Evaluating Hypotheses 68/29/03
Can we quantify our Uncertainty? Will the accuracy hold with brand
new, never before seen data?
Once we’re sure no cheating is going on…
Evaluating Hypotheses 78/29/03
Binomial Distribution
Discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments
Successes or failures—Just what we’re looking for!
Evaluating Hypotheses 88/29/03
Binomial Distribution
Pr (𝑅=𝑟 )= 𝑛!𝑟 ! (𝑛−𝑟 )!
𝑝𝑟(1−𝑝 )𝑛−𝑟
Probability that the random variable R will take on a specific value r
Might be probability of an error or of a positive
Since we have been working with accuracy let’s go with positive
Book works with errors
Evaluating Hypotheses 98/29/03
Binomial Distribution
Very simple calculations
0 20 40 60 80 100
0.00
0.02
0.04
0.06
0.08
PDF with p = .5
quantile
Density
Evaluating Hypotheses 108/29/03
What Does This Mean?
We can use as an estimator of p Now have p and the distribution given p We have the tools to figure out how
confident we should be in our estimator
Evaluating Hypotheses 118/29/03
The question
How confident should I be in the accuracy measure?
If we can live with statements like: 95% of the accuracy measures will fall in the
range of 94% and 97% Life is good
Confidence interval
0 20 40 60 80 100
0.00
0.02
0.04
0.06
0.08
PDF with p = .5
quantile
Density
Evaluating Hypotheses 128/29/03
How calculate
We want the quantiles where area outside is 5%
We can estimate p
There are tools available in most programming languages
0 20 40 60 80 100
0.00
0.02
0.04
0.06
0.08
PDF with p = .5
quantile
Density
Evaluating Hypotheses 138/29/03
Example
0 20 40 60 80 100
0.00
0.05
0.10
0.15
PDF with p = .95
quantile
Density
0 20 40 60 80 100
0.00
0.02
0.04
0.06
0.08
PDF with p = .5
quantile
Density
In R lb=qbinom(.025,n,p) ub=qbinom(.975,n,p) Lower and upper bound constitute confidence
interval
Evaluating Hypotheses 148/29/03
Still, Are We Really This Confident?
3 4 5 6 7 8 9
23
45
67
Xs
Ys
What if none of the small cluster of Blues were in the training set?
All of them would be in the test set
How well would it do?
Sample error vs. true error
Might have been an accident—a pathological case
Evaluating Hypotheses 158/29/03
Cross-Validation
What if we could test the classifier several times with different test sets
If it performed well each time wouldn’t we be more confident in the results?
ReproducibilityConsistency
Evaluating Hypotheses 168/29/03
K-fold Cross-Validation
Usually we have a big chunk of training data
If we bust it up into randomly drawn chunks
Can train on remainder
And test with chunk
1
2
3
4
5
6
7
8
9
10
Training Data Segregated Into Ten Equaly Sized Random Sets
Evaluating Hypotheses 178/29/03
K-fold Cross-Validation
If 10 chunks Train 10 times Now have
performance data on ten completely different test datasets
1
2
3
4
5
6
7
8
9
10
Training Data Segregated Into Ten Equaly Sized Random Sets
Evaluating Hypotheses 188/29/03
Must stay blindfolded while training Must discard all lessons after each fold
Remember, No Cheating
Evaluating Hypotheses 198/29/03
10-fold Appears to be Most Common Default
Weka and DataMiner both default to 10-fold
Could be just as easily be 20-fold or 25-fold With 20-fold it would be a 95-5 split
Performance is reported as the average accuracy across the K runs
Evaluating Hypotheses 208/29/03
What is the best K?
Related to the question of How large should the training set be
Should be large enough to support a test set of size n such that
Rule of thumb At least 30 examples not too close to 0 or 1
For ten-fold If 1/10th must be 30, Training set
must be 300
K-F
oldIf 10-fold satisfies this
should be in good shape
Evaluating Hypotheses 218/29/03
Can Even Use K=1
Called of leave-one-out Disadvantage: slow Largest possible training set Smallest possible test set
Has been promoted as an unbiased estimator or error
Recent studies indicate that there is no unbiased estimator
Evaluating Hypotheses 228/29/03
Recap
Can calculate confidence interval with a single test set
More runs (K-fold) gives us more confidence that we didn’t just get lucky in test set selection
Do these runs help narrow the confidence interval?
CONFIDENCE INTERVAL
Evaluating Hypotheses 238/29/03
When we average the performance…
Central limit applies As the number of runs grows the
distribution approaches normal With a reasonably large number of runs
we can derive a more trustworthy confidence interval
With 30 test runs (30-fold) can use traditional approaches to calculating mean and standard deviations, and therefore: confidence intervals
Evaluating Hypotheses 248/29/03
Central Limit Theorem
Consider a set of independent, identically distributed random variables Y1…Yn governed by an arbitrary probability distribution with mean and finite variance . Define the sample mean, then as
the distribution governing approaches a Normal distribution, with zero mean and standard deviation equal to 1
Book: This is a quite surprising fact, because it states that we know the form of the distribution that governs the sample mean even when we do not know the form of the underlying distribution that governs the individual
Evaluating Hypotheses 258/29/03
Checking accuracy in R
meanAcc = mean(accuracies)sdAcc = sd(accuracies)qnorm(.975,meanAcc,sdAcc)0.9980772qnorm(.025,meanAcc,sdAcc)0.8169336
0.0 0.2 0.4 0.6 0.8 1.00.000
0.002
0.004
0.006
0.008
Distribution of Accuracies
Accuracy
Evaluating Hypotheses 268/29/03
Can we say that one classifier is significantly better than another
T-test Null
hypothesis: they are from the same distribution
My Classifier’s Better than Yours
0.0 0.2 0.4 0.6 0.8 1.0
0.000
0.002
0.004
0.006
0.008
Two Accuracy distributions
Accuracy
Evaluating Hypotheses 278/29/03
T-testIn R
t.test(distOne,distTwo,paired = TRUE)
Paired t-test
data: distOne and distTwo t = -55.8756, df = 29, p-value < 2.2e-16alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.2052696 -0.1907732 sample estimates:mean of the differences -0.1980214
-0.2 -0.1 0.0 0.1 0.2
0.000
0.002
0.004
0.006
0.008
Student's T Distribution
Differences
Evaluating Hypotheses 288/29/03
T-test
In Perl
use Statistics::TTest;
my $ttest = new Statistics::TTest; $ttest->load_data(\@r1,\@r2); $ttest->set_significance(95);$ttest->print_t_test(); print "\n\nt statistic is ".
$ttest->t_statistic."\n";print "p val ".$ttest->{t_prob}."\n";
-0.2 -0.1 0.0 0.1 0.2
0.000
0.002
0.004
0.006
0.008
Student's T Distribution
Differences
t_prob: 0significance: 95…df1: 29alpha: 0.025t_statistic: 12.8137016607408null_hypothesis: rejected
t statistic is 12.8137016607408p val 0
Evaluating Hypotheses 298/29/03
Example, would you trust this classifier?
The classifier performed exceptionally well achieving 99.9% classifier accuracy on the 1,000 member training set.
The classifier performed exceptionally well achieving an average classifier accuracy of 97.5% utilizing 10-fold cross-validation on a training-set of size 1,000.
The classifier performed exceptionally well achieving an average classifier accuracy of 97.5% utilizing 10-fold cross-validation on a training-set of size 1,000. The variance in the ten accuracy measures indicates a 95% confidence interval of 97%-98%.
The classifier performed exceptionally well achieving an average classifier accuracy of 97.5% utilizing 30-fold cross-validation on a training-set of size 1,000. The variance in the thirty accuracy measures indicates a 95% confidence interval of 97%-98%.
Evaluating Hypotheses 308/29/03
Randomly permute an array From the Perl Cookbook
http://docstore.mik.ua/orelly/perl/cookbook/ch04_18.htm
A Useful Technique
sub fisher_yates_shuffle { my $array = shift; my $i; for ($i = @$array; --$i; ) { my $j = int rand ($i+1); next if $i == $j; @$array[$i,$j] = @$array[$j,$i]; } }
top related