center for big data analytics and discovery informatics artiﬁcial … · 2018. 9. 9. · center...

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Evaluating Classifier Performance

VasantHonavarArtificialIntelligenceResearchLaboratory

InformaticsGraduateProgramComputerScienceandEngineeringGraduateProgram

BioinformaticsandGenomicsGraduateProgramNeuroscienceGraduateProgram

CenterforBigDataAnalyticsandDiscoveryInformaticsHuckInstitutesoftheLifeSciences

InstituteforCyberscienceClinicalandTranslationalSciencesInstitute

NortheastBigDataHubPennsylvaniaStateUniversity

[email protected]://faculty.ist.psu.edu/vhonavar

http://ailab.ist.psu.edu



WhyEvaluateclassifiers?

•  Toknowhowwellaclassifiercanbeexpectedtoperformwhenitisputtouse

•  Tochoosethebestmodelfromamongasetofalternatives



EvaluatingaClassifier

•  Howcanwemeasureperformanceofclassifiers?•  Howwellcanaclassifierbeexpectedtoperformonnoveldata,i.e.,

datanotseenduringtraining?•  Wecanestimatetheperformance(e.g.,accuracy,sensitivity)ofthe

classifierusinganevaluationdataset(notusedfortraining)•  Howcloseistheestimatedperformancetothetrueperformance?



Classificationerror

•  Error=classifyingarecordasbelongingtooneclasswhenitbelongstoanotherclass.

•  Errorrate=percentofmisclassifiedsamplesoutofthetotalsamplesinthevalidationdata



NaïveBaseline

•  Wehopetodobetterthanthenaïvebaseline•  Whenthegoalistoidentifyhigh-valuebutrare

outcomes,wemaydowellbydoingworsethanthenaïvebaselineintermsofaccuracy

Naïvebaseline:classifyallsamplesasbelongingtothemostprevalentclass



EstimatingClassifierPerformance

N:TotalnumberofinstancesinthedatasetTPj: Numberof Truepositivesforclass j FPj : Numberof Falsepositivesforclass j TNj: Numberof TrueNegativesforclass j FNj: Numberof FalseNegativesforclass j

( )jj

jjj

clabelcclassPNTNTP

Accuracy

=∧==

+=

PerfectclassifierßàAccuracy=1PopularmeasureBiasedinfavorofthemajorityclass!Shouldbeusedwithcaution!



ClassifierLearning--MeasuringPerformanceClassLabel

C1 ¬ C1

C1 TP=55 FP=5¬ C1 FN=10 TN=30

355

5305

10085

1003055

6055

55555

6555

105555100

1

1

1

1

=+

=+

=

=+

=+

=

=+

=+

=

=+

=+

=

=+++=

FPTNFPfalsealarm

NTNTPaccuracy

FPTPTPyspecificit

FNTPTPysensitivit

FPTNFNTPN



WhenOneClassisMoreImportantthananother

–  Taxfraud–  Creditdefault–  Responsetopromotionaloffer–  Detectingelectronicnetworkintrusion–  Predictingdelayedflights–  Diagnosingcancer–  Predictingnuclearreactormeltdown

Inmanycasesitismoreimportanttoidentifymembersofaspecifictargetclass

Insuchcases,wemaytolerategreateroverallerror,inreturnforbetterpredictionsofthemoreimportantclass



MeasuringClassifierPerformance:Sensitivity

( )( )

( )jj

j

jj

jj

jj

c classclabelP c classCount

c classclabelCountFNTP

TPensitivityS

===

=

=∧==

+=

|

PerfectclassifieràSensitivity=1ProbabilityofcorrectlylabelingmembersofthetargetclassAlsocalledrecallorhitrate




C1 ¬ C1

C1 TP=55 FP=5¬ C1 FN=10 TN=30

355

5305

10085

1003055

6055

55555

6555

105555100

1

1

1

1

=+

=+

=

=+

=+

=

=+

=+

=

=+

=+

=

=+++=

FPTNFPfalsealarm

NTNTPaccuracy

FPTPTPyspecificit

FNTPTPysensitivit

FPTNFNTPN



MeasuringClassifierPerformance:Specificity ( )

( )( ) |

jj

j

jj

jj

jj

clabelcclassP clabelCount

c classclabelCountFPTP

TPpecificityS

===

=

=∧==

+=

PerfectclassifieràSpecificity=1AlsocalledprecisionProbabilitythatapositivepredictioniscorrect



MeasuringPerformance:Precision,Recall,andFalseAlarmRate

jj

jjj FPTP

TPySpecificitPrecision

+==

jj

jjj FNTP

TPySensitivitRecall

+==

( )( )

( )jj

j

jj

jj

jj

cclassclabelPclabelCount

cclassclabelCountFPTN

FPFalseAlarm

¬===

¬=

¬=∧==

+=

|

PerfectclassifieràPrecision=1PerfectclassifieràRecall=1

PerfectclassifieràFalseAlarmRate=0




C1 ¬ C1

C1 TP=55 FP=5¬ C1 FN=10 TN=30

355

5305

10085

1003055

6055

55555

6555

105555100

1

1

1

1

=+

=+

=

=+

=+

=

=+

=+

=

=+

=+

=

=+++=

FPTNFPfalsealarm

NTNTPaccuracy

FPTPTPyspecificit

FNTPTPysensitivit

FPTNFNTPN



MeasuringPerformance–CorrelationCoefficient

CC j =TPj ×TN j( ) − FPj × FN j( )

TPJ + FN j( ) TPj + FPj( ) TN j + FPj( ) TN j + FN j( ) −1≤ CC j ≤1

CC j =jlabeli − jlabel( ) jclassi − jclass( )

σ JLABELσ JCLASSdi∈D∑

where jlabeli =1 iff the classifier assigns di to class c jjclassi =1 iff the true class of di is class c j



Bewareofterminologicalconfusionintheliterature!•  Somebioinformaticsauthorsuse“accuracy”incorrectlytorefer

torecalli.e.sensitivityorprecisioni.e.specificity•  Inmedicalstatistics,specificitysometimesreferstosensitivity

forthenegativeclassi.e.•  Someauthorsusefalsealarmratetorefertotheprobabilitythat

apositivepredictionisincorrecti.e.Whenyouwrite•  providetheformulaintermsofTP, TN, FP, FN Whenyouread•  checktheformulaintermsofTP, TN, FP, FN

jj

j

FPTNTN+

jjj

j PrecisionTPFP

FP−=

+1



MeasuringClassifierPerformance•  TP,FP,TN,FNprovidetherelevantinformation•  Nosinglemeasuretellsthewholestory•  Aclassifierwith98%accuracycanbeuselessif98%ofthe

populationdoesnothavecancerandthe2%thatdoaremisclassifiedbytheclassifier

•  Useofmultiplemeasuresrecommended•  Bewareofterminologicalconfusion!



Micro-averagedperformancemeasuresPerformanceonarandomsample

⎟⎟⎠

⎞⎜⎜⎝

⎛+⎟⎟

⎠

⎞⎜⎜⎝

⎛+⎟⎟

⎠

⎞⎜⎜⎝

⎛+⎟⎟

⎠

⎞⎜⎜⎝

⎛+

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛×⎟⎟⎠

⎞⎜⎜⎝

⎛−⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛×⎟⎟⎠

⎞⎜⎜⎝

⎛

=

∑∑∑∑∑∑∑∑

∑∑∑∑

jj

jj

jj

jj

jj

jj

jj

jj

jj

jj

jj

jj

FNTNFPTNFPTPFNTP

FNFPTNTPCCgeMicroAvera

∑∑

∑+

=

jj

jj

jj

FPTP

TPPrecision geMicroAvera ∑∑

∑+

=

jj

jj

jj

FNTP

TPRecall geMicroAvera

PrecisiongeMicroAveraFalseAlarmgeMicroAvera 1 −=

•  Microaveraginggivesequalimportancetoeachsample•  Classeswithlargenumberofinstancesdominate

N

TPAccuracygeMicroAvera j

j∑= Etc.



Macro-averagedperformancemeasures

∑=j

jnCoeffCorrelatioM

ionCoeffgeCorrelatMacroAvera 1

∑=j

jpecificitySM

ty SpecificigeMacroAvera 1

∑=j

jensitivitySM

ty SensitivigeMacroAvera 1

MacroaveraginggivesequalimportancetoeachoftheMclasses



CutoffforclassificationMostmachinelearningalgorithmsclassifyviaa2-stepprocess:Foreachsample,

1.  Computeprobabilityofbelongingtoclass“1”2.  Comparetocutoffvalue,andclassifyaccordingly

•  Defaultcutoffvalueis0.50If>=0.50,classifyas“1”If<0.50,classifyas“0”

•  Canusedifferentcutoffvaluesfortradingoffonemeasureagainstanother(moreonthislater)

•  Question:HowwouldthisworkinthecaseofKnearestneighbor?



•  Ifcutoffis0.50:12samplesareclassifiedas“1”•  Ifcutoffis0.80:sevensamplesareclassifiedas“1”

ActualClass Prob.of"1" ActualClass Prob.of"1"1 0.996 1 0.5061 0.988 0 0.4711 0.984 0 0.3371 0.980 1 0.2181 0.948 0 0.1991 0.889 0 0.1491 0.848 0 0.0480 0.762 0 0.0381 0.707 0 0.0251 0.681 0 0.0221 0.656 0 0.0160 0.622 0 0.004

CutoffTable



ReceiverOperatingCharacteristic(ROC)Curve

•  Theconfusionmatrix,andhencethepreviousmeasuresofclassifierperformancearethresholddependent

•  Wecanoftentradeoffrecallversusprecision–e.g.,byadjustingclassificationthresholdθ

•  Isthereathreshold-independentmeasureofclassifierperformance?– ROCcurveisaplotofSensitivityagainstFalseAlarm

Ratewhichissameas(1-Specificity)whichcharacterizesthistradeoffforagivenclassifier

– ROCcurveisobtainedbyplottingsensitivityagainst(1-specificity)byvaryingtheclassificationthreshold



Receiveroperatingcharacteristic(ROC)Curve



MeasuringPerformanceofClassifiers–ROCcurves

•  ROCcurvesofferamorecompletepictureoftheperformanceoftheclassifierasafunctionoftheclassificationthreshold

•  AclassifierhisbetterthananotherclassifiergifROC(h)dominatestheROC(g)

•  ROC(h)dominatesROC(g)àAreaROC(h)>AreaROC(g)

1

1

0

0



ROCCurve



MisclassificationCostsMayDiffer

•  Thecostofmakingamisclassificationerrormaybehigherforoneclassthantheother(s)

•  Lookedatanotherway,thebenefitofmakingacorrectclassificationmaybehigherforoneclassthantheother(s)



Example–ResponsetoPromotionalOffer

•  “Naïverule”(classifyeveryoneas“0”)haserrorrateof1%(seemsgood)

•  Usingmachinelearningsupposewecancorrectlyclassifyeight1’sas1’s

•  Butatthecostofmisclassifyingtwenty0’sas1’sandtwo1’sas0’s.

•  Supposewesendanofferto1000people,with1%averageresponserate

•  “1”=response,“0”=nonresponse



Errorrate=(2+20)=2.2%(higherthannaïverate)

ConfusionMatrix

Predictas1 Predictas0Actual1 8 2Actual0 20 970



IntroducingCosts&BenefitsSuppose:•  Profitfroma“1”is$10•  Costofsendingofferis$1Then:•  Undernaïverule,allareclassifiedas“0”,sono

offersaresent:nocost,noprofit•  UnderDMpredictions,28offersaresent.

8respondwithprofitof$10each20failtorespond,cost$1each972receivenothing(nocost,noprofit)

 



ProfitMatrix

Predictas1 Predictas0Actual1 $80 0Actual0 ($20) 0



EvaluatingaClassifier

•  Whatwehavedonesofaristoestimatetheclassifier’sperformanceonsomeavailabledata.

•  Howwellcanaclassifierbeexpectedtoperformonnoveldata?

•  Performanceestimatedontrainingdataisoftenoptimisticrelativetoperformanceonnoveldata

•  Wecanestimatetheperformance(e.g.,accuracy,sensitivity)oftheclassifierusingevaluationdata(notusedfortraining)

•  Howcloseistheestimatedperformancetothetrueperformance?



Evaluationofaclassifierwithlimiteddata

•  Holdoutmethod–usepartofthedatafortraining,andtherestfortesting

•  Wemaybeluckyorunlucky–trainingdataortestdatamaynotberepresentative

•  Solution–Runmultipleexperimentswithdisjointtrainingandtestdatasetsinwhicheachclassisrepresentedinroughlythesameproportionasintheentiredataset



ClassifierevaluationData Label

0

0

1

1

0

1

0

Trainingdata

Testingdata

Labe

led

data



ClassifierevaluationData Label

0

0

1

1

0

1

0

Trainingdata

Testingdata

trainaclassifier

model

Labe

led

data



Classifierevaluation

Data Label

1

0

Pretendlikewedon’tknowthelabels




Data Label

1

0

model

Classify

1

1





Data Label

1

0

model


Classify

1

1

Comparepredictedlabelstoactuallabels



Comparingalgorithms

Data Label

1

0

model1 1

1

model2 10

Ismodel2betterthanmodel1?



Comparingalgorithms

model1 1

1

model2 1

0

Predicted

1

0

Label

1

0

LabelPredicted

Evaluation

score1

score2

model2betterifscore2>score1

Whenwouldwewanttodothistypeofcomparison?



Ismodel2better?Model1:85%accuracyModel2:80%accuracy

Model1:85.5%accuracyModel2:85.0%accuracy

Model1:0%accuracyModel2:100%accuracy



Comparingscores:significance•  Justcomparingscoresononedatasetisn’t

enough!•  Wedon’tjustwanttoknowwhichsystemis

betterononeparticulardataset,wewanttoknowifmodel1isbetterthanmodel2ingeneral

•  Putanotherway,wewanttobeconfidentthatthedifferenceisrealandnotjustduetorandomchance



Howdoweknowhowvariableamodel’saccuracyis?

Variance



Varianceofperformance

•  Weneedmultipleaccuracyscores!•  Howcanwegetthem?



RepeatedexperimentationData Label

0

0

1

1

0

1

0

Trainingdata

Testingdata

Labe

led

data

Insteadofoneevaluationwithaparticularsplitoftrainingandtestdata,runmultipleevaluations,withdifferentsplitsoftrainingandtestdata



Repeatedexperimentation

Data Label

0

0

1

1

0

1

Trai

ning

dat

a

Data Label

0

0

1

1

0

1

0

0

1

1

0

1

Data Label

…

=evaluation=train



K-foldcrossvalidationTr

aini

ngd

ata

breakintonequal-sizedparts

…

repeatforallparts/splits:trainonK-1partsevaluateontheother

…

split1 split2

…

split3

…



K-foldcrossvalidation

…

split

1

split

2

…sp

lit3

…

…

evaluate

score1

score2

score3

…



K-foldcrossvalidation

•  Betterutilizationoflabeleddata•  Morerobust:don’tjustrelyononeevaluationsetto

evaluatetheapproach(orforoptimizingparameters)•  MultipliesthecomputationaloverheadbyK(haveto

trainKmodelsinsteadofjustone)•  10isthemostcommonchoiceofK



EstimatingtheperformanceofaclassifierK-foldcross-validationPartitionthedata(multi)setSintoKequalpartsS1..SK

withroughlythesameclassdistributionasS.Errorc=0

Fori=1toKdo

;iTrain SSS −←iTest SS ←)( TrainSLearn←α

}

{

),( TestSErrorErrorcErrorc α+←

( )ErrorOutputK

ErrorcError ;⎟⎠

⎞⎜⎝

⎛←



Estimatingclassifierperformance

Recommendedprocedure•  UseK-foldcross-validation(K=5or10)forestimating

performanceestimates(accuracy,precision,recall,pointsonROCcurve,etc.)and95%confidenceintervalsaroundthemean

•  Computemeanvaluesofperformanceestimatesandstandarddeviationsofperformanceestimates

•  Reportmeanvaluesofperformanceestimatesandtheirstandarddeviationsor95%confidenceintervalsaroundthemean

•  Beskeptical–repeatexperimentsseveraltimeswithdifferentrandomsplitsofdataintoKfolds!



Leave-one-outcrossvalidation•  K-foldcrossvalidationwhereK=numberof

samples•  aka“jackknifing”•  pros/cons?•  whenwouldweusethis?



Leave-one-outcross-validation

•  K-foldcrossvalidationwithK=nwherenisthetotalnumberofsamplesavailable

•  nexperiments–usingn-1samplesfortrainingandtheremainingsamplefortesting

•  Leave-one-outcross-validationdoesnotguaranteethesameclassdistributionintrainingandtestdata!

Extremecase:50%class1,50%class2PredictmajorityclasslabelinthetrainingdataTrueerror–50%;

Leave-one-outerrorestimate–100%!!!!!



Leave-one-outcrossvalidation•  Canbeveryexpensiveiftrainingisslowand/or

iftherearealargenumberofexamples•  Usefulindomainswithlimitedtrainingdata:

maximizesthedatawecanusefortraining•  Someclassifierspermittheestimationof

leave-1-outperformancemeasurewithoutactuallyhavingtotrainKmodels



Comparingsystems:sample1split model1 model2

1 87 882 85 843 83 844 80 795 88 896 85 857 83 818 87 869 88 8910 84 85

average: 85 85





1 87 872 92 883 74 794 75 865 82 846

79 877 83 818 83 929 88 8110 77 85avg 82 85





1 84 872 83 863 78 824 80 865 82 846 79 877 83 848 83 869 85 8310 83 85

average: 82 85




Comparingsystemssplit model1 model2

1 84 872 83 863 78 824 80 865 82 846 79 877 83 848 83 869 85 8310 83 85

average: 82 85

split model1 model2

1 87 872 92 883 74 794 75 865 82 846 79 877 83 818 83 929 88 8110 77 85

average: 82 85

What’sthedifference?



Comparingsystemssplit model1 model2

1 84 872 83 863 78 824 80 865 82 846 79 877 83 848 83 869 85 8310 83 85

average: 82 85

stddev 2.3 1.7

split model1 model2

1 87 872 92 883 74 794 75 865 82 846 79 877 83 818 83 929 88 8110 77 85

average: 82 85

stddev 5.9 3.9



Comparingsystems:sample4

split model1 model2

1 80 822 84 873 89 904 78 825 90 916 81 837 80 808 88 899 76 7710 86 88

average 83 85

stddev 4.9 4.7





split model1

model2 model2–model

11 80 82 22 84 87 33 89 90 14 78 82 45 90 91 16 81 83 27 80 80 08 88 89 19 76 77 110 86 88 2

average 83 85stddev 4.9 4.7





split model1 model2 model2–model1

1 80 82 22 84 87 33 89 90 14 78 82 45 90 91 16 81 83 27 80 80 08 88 89 19 76 77 110 86 88 2

average 83 85stddev 4.9 4.7

Model2isALWAYSbetter




split model1 model2 model2–model1

1 80 82 22 84 87 33 89 90 14 78 82 45 90 91 16 81 83 27 80 80 08 88 89 19 76 77 110 86 88 2

average: 83 85

stddev 4.9 4.7

Howdowedecideifmodel2isbetterthanmodel1?



StatisticaltestsSetup:

–  Assumesomedefaulthypothesisaboutthedatathatyou’dliketodisprove,calledthenullhypothesis

–  e.g.model1andmodel2arenotstatisticallydifferentinperformance

Test:–  Calculateateststatisticfromthedata(oftenassuming

somethingaboutthedata)–  Basedonthisstatistic,withsomeprobabilitywecan

rejectthenullhypothesis,thatis,showthatitdoesnothold



t-test

Determineswhethertwosamplescomefromthesameunderlyingdistributionornot

?



t-testNullhypothesis:model1andmodel2accuraciesarenodifferent,i.e.comefromthesamedistributionResult:probabilitythatthedifferenceinaccuraciesisduetorandomchance(lowvaluesarebetter)



Calculatingt-testForoursetup,we’lldowhat’scalleda“pairt-test”

–  Thevaluescanbethoughtofaspairs,wheretheywerecalculatedunderthesameconditions

–  Inourcase,thesametrain/testsplit– Givesmorepowerthantheunpairedt-test(wehave

moreinformation)

Foralmostallexperiments,we’lldoa“two-tailed”versionofthet-testhttp://en.wikipedia.org/wiki/Student's_t-test



p-value•  Theresultofastatisticaltestisoftenap-value•  p-value:theprobabilitythatthenullhypothesis

holds.Specifically,ifwere-ranthisexperimentmultipletimes(sayondifferentdata)whatistheprobabilitythatwewouldrejectthenullhypothesisincorrectly(i.e.theprobabilitywe’dbewrong)

•  Commonvaluestoconsider“significant”:0.05(95%confident),0.01(99%confident)and0.001(99.9%confident)




1 87 882 85 843 83 844 80 795 88 896 85 857 83 818 87 869 88 8910 84 85

average: 85 85


Theyarethesamewith:p=1




1 87 872 92 883 74 794 75 865 82 846 79 877 83 818 83 929 88 8110 77 85

average: 82 85


Theyarethesamewith:p=0.15




1 84 872 83 863 78 824 80 865 82 846 79 877 83 848 83 869 85 8310 83 85

average: 82 85






1 80 822 84 873 89 904 78 825 90 916 81 837 80 808 88 899 76 7710 86 88

average: 83 85





Statisticaltestsontestdata

LabeledData

(datawithlabels)

AllTraining

Data

TestData

TrainingData

DevelopmentData

cross-validationwitht-test

Canwedothathere?



Bootstrapresamplingtestsettwithnsamplesdomtimes:-  samplenexampleswithreplacementfromthe

testsettocreateanewtestsett’-  evaluatemodel(s)ont’

calculatet-test(orotherstatisticaltest)onthecollectionofmresults



Bootstrapresampling

Test’1

sam

ple

with

re

plac

emen

tTestData

Test’m

…

Test’2



Bootstrapresampling

modelA

Test’1

Test’2

Test’m

…

eval

uate

m

odel

on

data

Ascore1

Ascore2

Ascorem

…



Bootstrapresampling

modelB

Test’1

Test’2

Test’m

…

eval

uate

m

odel

on

data

Bscore1

Bscore2

Bscorem

…



Bootstrapresampling

Ascore1

Ascore2

Ascorem

…

Bscore1

Bscore2

Bscorem

…

pairedt-test(orotheranalysis)



Experimentationgoodpractices

Neverlookatyourtestdata!Duringdevelopment

–  Comparedifferentmodels/hyperparametersondevelopmentdata

–  usecross-validationtogetmoreconsistentresults–  Ifyouwanttobeconfidentwithresults,useat-test

andlookforp=0.05(orevenbetter)Forfinalevaluation,usebootstrapresamplingcombinedwithat-testtocomparemodels



Estimatingtheperformanceofaclassifier

ThetrueerrorofahypothesishwithrespecttoatargetfunctionfandaninstancedistributionDis

[ ])()(Pr)( xhxfhErrorDxD ≠≡

∈

ThesampleerrorofabinaryclassifierhwithrespecttoatargetfunctionfandaninstancedistributionDis

otherwise 0),( ; iff 1),(

))()((||

1)(

=≠=

≠≡ ∑∈

bababa

xhxfS

hErrorSx

S

δδ

δ



Estimatingclassifierperformance

( )( )

( ) [ ]

41

81

81

00110110

41

81

21

81

=+=

=+==

≠=

⎭⎬⎫

⎩⎨⎧

=

=

)()()()(Pr

,,,)(

},,,{)(

cXDaXDxfxhherror

xfxh

dcbax

XD

dcbaXDomain

DD



Evaluatingtheperformanceofaclassifier

•  Sampleerrorestimatedfromtrainingdataisanoptimisticestimate

•  Foranunbiasedestimate,hmustbeevaluatedonanindependentsampleS(whichisnotthecaseifSisthetrainingset!)

•  Evenwhentheestimateisunbiased,itcanvaryacrosssamples!•  Ifhmisclassifies8outof100samples

[ ] )()( hErrorhErrorEBias DS −=

0801008 .)( ==hErrorS

Howcloseisthesampleerrortothetrueerror?



Howcloseistheestimatederrortothetrueerror?•  ChooseasampleSofsizenaccordingtodistributionD•  Measure

)(hErrorS

)(hErrorS isarandomvariable(outcomeofarandomexperiment)

?)( about conclude wecan what,)( Given hErrorhError DS

Moregenerally,giventheestimatedperformanceofahypothesis,whatcanwesayaboutitsactualperformance?



Evaluatingperformancewhenwecanaffordtotestonalargeindependenttestset

ThetrueerrorofahypothesishwithrespecttoatargetfunctionfandaninstancedistributionDis

[ ])()(Pr)( xhxfhErrorDxD ≠≡

∈

The sample error of a classifier hwith respect to a target function fand an instance distribution D is

otherwise 0),( ; iff 1),(

))()((||

1)(

=≠=

≠≡ ∑∈

bababa

xhxfS

hErrorSx

S

δδ

δ



EvaluatingClassifierperformance

Sampleerrorestimatedfromtrainingdataisanoptimisticestimate

Foranunbiasedestimate,hmustbeevaluatedonanindependentsampleS(whichisnotthecaseifSisthetrainingset!)

Evenwhentheestimateisunbiased,itcanvaryacrosssamples!Ifhmisclassifies8outof100samples

[ ] )()( hErrorhErrorEBias DS −=

0801008 .)( ==hErrorS

Howcloseisthesampleerrortothetrueerror?



Howcloseisestimatederrortoitstruevalue?ChooseasampleSofsizenaccordingtodistributionDMeasure )(hErrorS

)(hErrorS isarandomvariable(outcomeofarandomexperiment)

?)( about conclude wecan what,)( Given hErrorhError DS

Moregenerally,giventheestimatedperformanceofaclassifier,whatcanwesayaboutitsactualperformance?



Howcloseisestimatedaccuracytoitstruevalue?

Question:Howcloseisp(thetrueprobability)to ?Thisproblemisaninstanceofawell-studiedprobleminstatistics•  Theproblemofestimatingtheproportionofapopulationthat

exhibitssomeproperty,giventheobservedproportionoverarandomsampleofthepopulation.

•  Inourcase,thepropertyofinterestisthathcorrectly(orincorrectly)classifiesasample.

•  TestinghonasinglerandomsamplexdrawnaccordingtoDamountstoperformingarandomexperimentwhichsucceedsifhcorrectlyclassifiesxandfailsotherwise.

p̂




TheoutputofaclassifierwhosetrueerrorispasabinaryrandomvariablewhichcorrespondstotheoutcomeofaBernoullitrialwithasuccessratep(theprobabilityofcorrectprediction)

ThenumberofsuccessesrobservedinNtrialsisarandom

variableYwhichfollowstheBinomialdistribution

rnr pprnr

nrP −−−

= )()!(!

!)( 1



Probabilityofobservingrmisclassifiedexamplesinasampleofsizen:

ErrorS(h)isaRandomVariable

rnr pprnr

nrP −−−

= )()!(!

!)( 1r



Recallbasicstatistics

ConsiderarandomexperimentwithdiscretevaluedoutcomesTheexpectedvalueofthecorrespondingrandomvariableYisThevarianceofYisThestandarddeviationofYis

Myyy ,..., 21

)Pr()( i

M

ii yYyYE =≡ ∑

=1

[ ]2])[()( YEYEYVar −≡

)(YVarY ≡σ




ThemeanofaBernoullitrialwithsuccessratep=pVariance=p(1-p)IfNtrialsaretakenfromthesameBernoulliprocess,the

observedsuccessratehasthesamemeanpandvarianceForlargeN,thedistributionoffollowsaGaussiandistribution

p̂

Npp )1( −

p̂



BinomialProbabilityDistribution

rnr pprnr

nrP −−−

= )()!(!

!)( 1

ProbabilityP(r)ofrheadsinncoinflips,ifp=Pr(heads)• Expected,ormeanvalueofX,E[X],is

∑=

=≡N

inpiiPXE

0)(][

• VarianceofXis

• StandarddeviationofX,σX,is

)(]])[[()( pnpXEXEXVar −=−≡ 12

)(]])[[( pnpXEXEX −=−≡ 12σ



Estimators,Bias,Variance,ConfidenceInterval

npp

hErrorS

)()(

−=

1σ

phErrornrhError

D

S

=

=

)(

)(

nhErrorhError SS

hErrorS

))()(()(

−≈

1σ

AnN%confidenceintervalforsomeparameterpthatistheintervalwhichisexpectedwithprobabilityN%tocontainp

nhErrorhError DD

hErrorS

))()(()(

−=

1σ



Normaldistributionapproximatesbinomial

ErrorS(h)followsaBinomialdistribution,with•  mean•  standarddeviation

nhErrorshError

hErrorsDD

S

))()(()(

−= 1σ

WecanapproximatethisbyaNormaldistributionwiththesamemeanandvariancewhennp(1-p)≥5

)()( hErrorDhErrorS=µ



Normaldistribution2

21 )(1

22)( σ

µ

πσ

−−=x

exp

Expected,ormeanvalueofXisgivenbyE[X]=µVarianceofXisgivenbyVar(X)=σ2StandarddeviationofXisgivenbyσX=σ

TheprobabilitythatXwillfallintheinterval(a,b)isgivenby∫

b

adxxp )(



Howcloseistheestimatedaccuracytoitstruevalue?LettheprobabilitythataGaussianrandomvariableX,withzero

mean,takesavaluebetween–zandz,Pr[-z≤X≤z]=c

Pr[X≥z] z

0.001 3.09

0.005 2.58

0.01 2.33

0.05 1.65

0.10 1.28



Howcloseistheestimatedaccuracytoitstruevalue?

Butdoesnothavezeromeanandunitvariancesowenormalizetoget

p̂

cz

nppppz =

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

<−

−<−

)(ˆPr1



Howcloseistheestimatedaccuracytoitstruevalue?

Tofindconfidencelimits:Givenaparticularconfidencefigurec,usethetabletofindthezcorrespondingtotheprobability½(1-c).Uselinearinterpolationforvaluesnotinthetable

⎥⎦

⎤⎢⎣

⎡+

⎥⎥⎦

⎤

⎢⎢⎣

⎡+−±+

=

nz

nz

np

npz

nzp

p2

2

222

1

42ˆˆˆ

center for big data analytics and discovery informatics artiﬁcial … · 2018. 9. 9. · center...

Documents