how good is my classifier?. 8/29/03evaluating hypotheses2 have seen the accuracy metric classifier...

EVALUATING HYPOTHESESHow good is my classifier?

Evaluating Hypotheses 28/29/03

How good is my classifier?

Have seen the accuracy metric Classifier performance on a test set

If we are to trust a classifier’s results

Must keep the classifier blindfolded

Make sure that classifier never sees the test data

When things seem too good to be true…

First and Foremost…

Confusion Matrix

Could collect more information

PredictedActual

class pos negpos true pos false negneg false pos true neg

Sensitivity vs. Specificity Sensitivity

Out of the things predicted as being positive, how many were correct

Specificity Out of the things predicted as being negative

how many were correct

𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦=𝑡𝑛

𝑡𝑛+ 𝑓𝑝

𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦=𝑡𝑝

𝑡𝑝+ 𝑓𝑛

Predicted

Actual

class pos neg

pos true pos false neg

neg false pos

true neg

• Not as sensitive if begins missing what it is trying to detect• If identify more and more things as

target class, then beginning to get less specific

Can we quantify our Uncertainty? Will the accuracy hold with brand

new, never before seen data?

Once we’re sure no cheating is going on…

Binomial Distribution

Discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments

Successes or failures—Just what we’re looking for!

Pr (𝑅=𝑟 )= 𝑛!𝑟 ! (𝑛−𝑟 )!

𝑝𝑟(1−𝑝 )𝑛−𝑟

Probability that the random variable R will take on a specific value r

Might be probability of an error or of a positive

Since we have been working with accuracy let’s go with positive

Book works with errors

Very simple calculations

0 20 40 60 80 100

PDF with p = .5

quantile

Density

What Does This Mean?

We can use as an estimator of p Now have p and the distribution given p We have the tools to figure out how

confident we should be in our estimator

The question

How confident should I be in the accuracy measure?

If we can live with statements like: 95% of the accuracy measures will fall in the

range of 94% and 97% Life is good

Confidence interval

0 20 40 60 80 100

PDF with p = .5

quantile

Density

How calculate

We want the quantiles where area outside is 5%

We can estimate p

There are tools available in most programming languages

0 20 40 60 80 100

PDF with p = .5

quantile

Density

Example

0 20 40 60 80 100

PDF with p = .95

quantile

Density

0 20 40 60 80 100

PDF with p = .5

quantile

Density

In R lb=qbinom(.025,n,p) ub=qbinom(.975,n,p) Lower and upper bound constitute confidence

interval

Still, Are We Really This Confident?

3 4 5 6 7 8 9

What if none of the small cluster of Blues were in the training set?

All of them would be in the test set

How well would it do?

Sample error vs. true error

Might have been an accident—a pathological case

Cross-Validation

What if we could test the classifier several times with different test sets

If it performed well each time wouldn’t we be more confident in the results?

ReproducibilityConsistency

K-fold Cross-Validation

Usually we have a big chunk of training data

If we bust it up into randomly drawn chunks

Can train on remainder

And test with chunk

Training Data Segregated Into Ten Equaly Sized Random Sets

K-fold Cross-Validation

If 10 chunks Train 10 times Now have

performance data on ten completely different test datasets

Training Data Segregated Into Ten Equaly Sized Random Sets

Must stay blindfolded while training Must discard all lessons after each fold

Remember, No Cheating

10-fold Appears to be Most Common Default

Weka and DataMiner both default to 10-fold

Could be just as easily be 20-fold or 25-fold With 20-fold it would be a 95-5 split

Performance is reported as the average accuracy across the K runs

What is the best K?

Related to the question of How large should the training set be

Should be large enough to support a test set of size n such that

Rule of thumb At least 30 examples not too close to 0 or 1

For ten-fold If 1/10th must be 30, Training set

must be 300

oldIf 10-fold satisfies this

should be in good shape

Can Even Use K=1

Called of leave-one-out Disadvantage: slow Largest possible training set Smallest possible test set

Has been promoted as an unbiased estimator or error

Recent studies indicate that there is no unbiased estimator

Can calculate confidence interval with a single test set

More runs (K-fold) gives us more confidence that we didn’t just get lucky in test set selection

Do these runs help narrow the confidence interval?

CONFIDENCE INTERVAL

When we average the performance…

Central limit applies As the number of runs grows the

distribution approaches normal With a reasonably large number of runs

we can derive a more trustworthy confidence interval

With 30 test runs (30-fold) can use traditional approaches to calculating mean and standard deviations, and therefore: confidence intervals

Central Limit Theorem

Consider a set of independent, identically distributed random variables Y1…Yn governed by an arbitrary probability distribution with mean and finite variance . Define the sample mean, then as

the distribution governing approaches a Normal distribution, with zero mean and standard deviation equal to 1

Book: This is a quite surprising fact, because it states that we know the form of the distribution that governs the sample mean even when we do not know the form of the underlying distribution that governs the individual

Checking accuracy in R

meanAcc = mean(accuracies)sdAcc = sd(accuracies)qnorm(.975,meanAcc,sdAcc)0.9980772qnorm(.025,meanAcc,sdAcc)0.8169336

0.0 0.2 0.4 0.6 0.8 1.00.000

Distribution of Accuracies

Accuracy

Can we say that one classifier is significantly better than another

T-test Null

hypothesis: they are from the same distribution

My Classifier’s Better than Yours

0.0 0.2 0.4 0.6 0.8 1.0

Two Accuracy distributions

Accuracy

T-testIn R

t.test(distOne,distTwo,paired = TRUE)

Paired t-test

data: distOne and distTwo t = -55.8756, df = 29, p-value < 2.2e-16alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.2052696 -0.1907732 sample estimates:mean of the differences -0.1980214

-0.2 -0.1 0.0 0.1 0.2

Student's T Distribution

Differences

T-test

In Perl

use Statistics::TTest;

my $ttest = new Statistics::TTest; $ttest->load_data(\@r1,\@r2); $ttest->set_significance(95);$ttest->print_t_test(); print "\n\nt statistic is ".

$ttest->t_statistic."\n";print "p val ".$ttest->{t_prob}."\n";

-0.2 -0.1 0.0 0.1 0.2

Student's T Distribution

Differences

t_prob: 0significance: 95…df1: 29alpha: 0.025t_statistic: 12.8137016607408null_hypothesis: rejected

t statistic is 12.8137016607408p val 0

Example, would you trust this classifier?

The classifier performed exceptionally well achieving 99.9% classifier accuracy on the 1,000 member training set.

The classifier performed exceptionally well achieving an average classifier accuracy of 97.5% utilizing 10-fold cross-validation on a training-set of size 1,000.

The classifier performed exceptionally well achieving an average classifier accuracy of 97.5% utilizing 10-fold cross-validation on a training-set of size 1,000. The variance in the ten accuracy measures indicates a 95% confidence interval of 97%-98%.

The classifier performed exceptionally well achieving an average classifier accuracy of 97.5% utilizing 30-fold cross-validation on a training-set of size 1,000. The variance in the thirty accuracy measures indicates a 95% confidence interval of 97%-98%.

Randomly permute an array From the Perl Cookbook

http://docstore.mik.ua/orelly/perl/cookbook/ch04_18.htm

A Useful Technique

sub fisher_yates_shuffle { my $array = shift; my $i; for ($i = @$array; --$i; ) { my $j = int rand ($i+1); next if $i == $j; @$array[$i,$j] = @$array[$j,$i]; } }

What about chi squared

how good is my classifier?. 8/29/03evaluating hypotheses2 have seen the accuracy metric classifier...

Documents

online classification accuracy is a poor metric to …

cross-domain sentiment tagging using meta-classifier and a...

classifier training

classifier evaluation

classifier selection - computer...

expert classifier

the classification accuracy of multiple-metric learning...

high accuracy sand/erosion monitoring · m23 metric brass...

classifier document

motivation background dataset€¦ · motivation background...

16s classifier

implementasi metode naïve bayes classifier untuk seleksi...

cs114 lect20 classifier - cs.brandeis.edu€¦ ·...

tapered roller bearings in x-life design -...

dermatologist-level classification of skin cancer with...

pwista math of chemistry scientific measurement accuracy and...

enhanced smoothing methods using na ve bayes classifier...

chapter 4 voc. terms accuracy decimal system metric system...

introduction to expectation maximization€¦ · data set...

cm classifier