center for big data analytics and discovery informatics artificial … · 2018. 9. 9. · center...
TRANSCRIPT
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Evaluating Classifier Performance
VasantHonavarArtificialIntelligenceResearchLaboratory
InformaticsGraduateProgramComputerScienceandEngineeringGraduateProgram
BioinformaticsandGenomicsGraduateProgramNeuroscienceGraduateProgram
CenterforBigDataAnalyticsandDiscoveryInformaticsHuckInstitutesoftheLifeSciences
InstituteforCyberscienceClinicalandTranslationalSciencesInstitute
NortheastBigDataHubPennsylvaniaStateUniversity
[email protected]://faculty.ist.psu.edu/vhonavar
http://ailab.ist.psu.edu
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
WhyEvaluateclassifiers?
• Toknowhowwellaclassifiercanbeexpectedtoperformwhenitisputtouse
• Tochoosethebestmodelfromamongasetofalternatives
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
EvaluatingaClassifier
• Howcanwemeasureperformanceofclassifiers?• Howwellcanaclassifierbeexpectedtoperformonnoveldata,i.e.,
datanotseenduringtraining?• Wecanestimatetheperformance(e.g.,accuracy,sensitivity)ofthe
classifierusinganevaluationdataset(notusedfortraining)• Howcloseistheestimatedperformancetothetrueperformance?
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Classificationerror
• Error=classifyingarecordasbelongingtooneclasswhenitbelongstoanotherclass.
• Errorrate=percentofmisclassifiedsamplesoutofthetotalsamplesinthevalidationdata
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
NaïveBaseline
• Wehopetodobetterthanthenaïvebaseline• Whenthegoalistoidentifyhigh-valuebutrare
outcomes,wemaydowellbydoingworsethanthenaïvebaselineintermsofaccuracy
Naïvebaseline:classifyallsamplesasbelongingtothemostprevalentclass
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
EstimatingClassifierPerformance
N:TotalnumberofinstancesinthedatasetTPj: Numberof Truepositivesforclass j FPj : Numberof Falsepositivesforclass j TNj: Numberof TrueNegativesforclass j FNj: Numberof FalseNegativesforclass j
( )jj
jjj
clabelcclassPNTNTP
Accuracy
=∧==
+=
PerfectclassifierßàAccuracy=1PopularmeasureBiasedinfavorofthemajorityclass!Shouldbeusedwithcaution!
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
ClassifierLearning--MeasuringPerformanceClassLabel
C1 ¬ C1
C1 TP=55 FP=5¬ C1 FN=10 TN=30
355
5305
10085
1003055
6055
55555
6555
105555100
1
1
1
1
=+
=+
=
=+
=+
=
=+
=+
=
=+
=+
=
=+++=
FPTNFPfalsealarm
NTNTPaccuracy
FPTPTPyspecificit
FNTPTPysensitivit
FPTNFNTPN
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
WhenOneClassisMoreImportantthananother
– Taxfraud– Creditdefault– Responsetopromotionaloffer– Detectingelectronicnetworkintrusion– Predictingdelayedflights– Diagnosingcancer– Predictingnuclearreactormeltdown
Inmanycasesitismoreimportanttoidentifymembersofaspecifictargetclass
Insuchcases,wemaytolerategreateroverallerror,inreturnforbetterpredictionsofthemoreimportantclass
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
MeasuringClassifierPerformance:Sensitivity
( )( )
( )jj
j
jj
jj
jj
c classclabelP c classCount
c classclabelCountFNTP
TPensitivityS
===
=
=∧==
+=
|
PerfectclassifieràSensitivity=1ProbabilityofcorrectlylabelingmembersofthetargetclassAlsocalledrecallorhitrate
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
ClassifierLearning--MeasuringPerformanceClassLabel
C1 ¬ C1
C1 TP=55 FP=5¬ C1 FN=10 TN=30
355
5305
10085
1003055
6055
55555
6555
105555100
1
1
1
1
=+
=+
=
=+
=+
=
=+
=+
=
=+
=+
=
=+++=
FPTNFPfalsealarm
NTNTPaccuracy
FPTPTPyspecificit
FNTPTPysensitivit
FPTNFNTPN
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
MeasuringClassifierPerformance:Specificity ( )
( )( ) |
jj
j
jj
jj
jj
clabelcclassP clabelCount
c classclabelCountFPTP
TPpecificityS
===
=
=∧==
+=
PerfectclassifieràSpecificity=1AlsocalledprecisionProbabilitythatapositivepredictioniscorrect
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
MeasuringPerformance:Precision,Recall,andFalseAlarmRate
jj
jjj FPTP
TPySpecificitPrecision
+==
jj
jjj FNTP
TPySensitivitRecall
+==
( )( )
( )jj
j
jj
jj
jj
cclassclabelPclabelCount
cclassclabelCountFPTN
FPFalseAlarm
¬===
¬=
¬=∧==
+=
|
PerfectclassifieràPrecision=1PerfectclassifieràRecall=1
PerfectclassifieràFalseAlarmRate=0
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
ClassifierLearning--MeasuringPerformanceClassLabel
C1 ¬ C1
C1 TP=55 FP=5¬ C1 FN=10 TN=30
355
5305
10085
1003055
6055
55555
6555
105555100
1
1
1
1
=+
=+
=
=+
=+
=
=+
=+
=
=+
=+
=
=+++=
FPTNFPfalsealarm
NTNTPaccuracy
FPTPTPyspecificit
FNTPTPysensitivit
FPTNFNTPN
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
MeasuringPerformance–CorrelationCoefficient
CC j =TPj ×TN j( ) − FPj × FN j( )
TPJ + FN j( ) TPj + FPj( ) TN j + FPj( ) TN j + FN j( ) −1≤ CC j ≤1
CC j =jlabeli − jlabel( ) jclassi − jclass( )
σ JLABELσ JCLASSdi∈D∑
where jlabeli =1 iff the classifier assigns di to class c jjclassi =1 iff the true class of di is class c j
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Bewareofterminologicalconfusionintheliterature!• Somebioinformaticsauthorsuse“accuracy”incorrectlytorefer
torecalli.e.sensitivityorprecisioni.e.specificity• Inmedicalstatistics,specificitysometimesreferstosensitivity
forthenegativeclassi.e.• Someauthorsusefalsealarmratetorefertotheprobabilitythat
apositivepredictionisincorrecti.e.Whenyouwrite• providetheformulaintermsofTP, TN, FP, FN Whenyouread• checktheformulaintermsofTP, TN, FP, FN
jj
j
FPTNTN+
jjj
j PrecisionTPFP
FP−=
+1
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
MeasuringClassifierPerformance• TP,FP,TN,FNprovidetherelevantinformation• Nosinglemeasuretellsthewholestory• Aclassifierwith98%accuracycanbeuselessif98%ofthe
populationdoesnothavecancerandthe2%thatdoaremisclassifiedbytheclassifier
• Useofmultiplemeasuresrecommended• Bewareofterminologicalconfusion!
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Micro-averagedperformancemeasuresPerformanceonarandomsample
⎟⎟⎠
⎞⎜⎜⎝
⎛+⎟⎟
⎠
⎞⎜⎜⎝
⎛+⎟⎟
⎠
⎞⎜⎜⎝
⎛+⎟⎟
⎠
⎞⎜⎜⎝
⎛+
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛×⎟⎟⎠
⎞⎜⎜⎝
⎛−⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛×⎟⎟⎠
⎞⎜⎜⎝
⎛
=
∑∑∑∑∑∑∑∑
∑∑∑∑
jj
jj
jj
jj
jj
jj
jj
jj
jj
jj
jj
jj
FNTNFPTNFPTPFNTP
FNFPTNTPCCgeMicroAvera
∑∑
∑+
=
jj
jj
jj
FPTP
TPPrecision geMicroAvera ∑∑
∑+
=
jj
jj
jj
FNTP
TPRecall geMicroAvera
PrecisiongeMicroAveraFalseAlarmgeMicroAvera 1 −=
• Microaveraginggivesequalimportancetoeachsample• Classeswithlargenumberofinstancesdominate
N
TPAccuracygeMicroAvera j
j∑= Etc.
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Macro-averagedperformancemeasures
∑=j
jnCoeffCorrelatioM
ionCoeffgeCorrelatMacroAvera 1
∑=j
jpecificitySM
ty SpecificigeMacroAvera 1
∑=j
jensitivitySM
ty SensitivigeMacroAvera 1
MacroaveraginggivesequalimportancetoeachoftheMclasses
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
CutoffforclassificationMostmachinelearningalgorithmsclassifyviaa2-stepprocess:Foreachsample,
1. Computeprobabilityofbelongingtoclass“1”2. Comparetocutoffvalue,andclassifyaccordingly
• Defaultcutoffvalueis0.50If>=0.50,classifyas“1”If<0.50,classifyas“0”
• Canusedifferentcutoffvaluesfortradingoffonemeasureagainstanother(moreonthislater)
• Question:HowwouldthisworkinthecaseofKnearestneighbor?
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
• Ifcutoffis0.50:12samplesareclassifiedas“1”• Ifcutoffis0.80:sevensamplesareclassifiedas“1”
ActualClass Prob.of"1" ActualClass Prob.of"1"1 0.996 1 0.5061 0.988 0 0.4711 0.984 0 0.3371 0.980 1 0.2181 0.948 0 0.1991 0.889 0 0.1491 0.848 0 0.0480 0.762 0 0.0381 0.707 0 0.0251 0.681 0 0.0221 0.656 0 0.0160 0.622 0 0.004
CutoffTable
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
ReceiverOperatingCharacteristic(ROC)Curve
• Theconfusionmatrix,andhencethepreviousmeasuresofclassifierperformancearethresholddependent
• Wecanoftentradeoffrecallversusprecision–e.g.,byadjustingclassificationthresholdθ
• Isthereathreshold-independentmeasureofclassifierperformance?– ROCcurveisaplotofSensitivityagainstFalseAlarm
Ratewhichissameas(1-Specificity)whichcharacterizesthistradeoffforagivenclassifier
– ROCcurveisobtainedbyplottingsensitivityagainst(1-specificity)byvaryingtheclassificationthreshold
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Receiveroperatingcharacteristic(ROC)Curve
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
MeasuringPerformanceofClassifiers–ROCcurves
• ROCcurvesofferamorecompletepictureoftheperformanceoftheclassifierasafunctionoftheclassificationthreshold
• AclassifierhisbetterthananotherclassifiergifROC(h)dominatestheROC(g)
• ROC(h)dominatesROC(g)àAreaROC(h)>AreaROC(g)
1
1
0
0
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
ROCCurve
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
MisclassificationCostsMayDiffer
• Thecostofmakingamisclassificationerrormaybehigherforoneclassthantheother(s)
• Lookedatanotherway,thebenefitofmakingacorrectclassificationmaybehigherforoneclassthantheother(s)
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Example–ResponsetoPromotionalOffer
• “Naïverule”(classifyeveryoneas“0”)haserrorrateof1%(seemsgood)
• Usingmachinelearningsupposewecancorrectlyclassifyeight1’sas1’s
• Butatthecostofmisclassifyingtwenty0’sas1’sandtwo1’sas0’s.
• Supposewesendanofferto1000people,with1%averageresponserate
• “1”=response,“0”=nonresponse
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Errorrate=(2+20)=2.2%(higherthannaïverate)
ConfusionMatrix
Predictas1 Predictas0Actual1 8 2Actual0 20 970
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
IntroducingCosts&BenefitsSuppose:• Profitfroma“1”is$10• Costofsendingofferis$1Then:• Undernaïverule,allareclassifiedas“0”,sono
offersaresent:nocost,noprofit• UnderDMpredictions,28offersaresent.
8respondwithprofitof$10each20failtorespond,cost$1each972receivenothing(nocost,noprofit)
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
ProfitMatrix
Predictas1 Predictas0Actual1 $80 0Actual0 ($20) 0
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
EvaluatingaClassifier
• Whatwehavedonesofaristoestimatetheclassifier’sperformanceonsomeavailabledata.
• Howwellcanaclassifierbeexpectedtoperformonnoveldata?
• Performanceestimatedontrainingdataisoftenoptimisticrelativetoperformanceonnoveldata
• Wecanestimatetheperformance(e.g.,accuracy,sensitivity)oftheclassifierusingevaluationdata(notusedfortraining)
• Howcloseistheestimatedperformancetothetrueperformance?
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Evaluationofaclassifierwithlimiteddata
• Holdoutmethod–usepartofthedatafortraining,andtherestfortesting
• Wemaybeluckyorunlucky–trainingdataortestdatamaynotberepresentative
• Solution–Runmultipleexperimentswithdisjointtrainingandtestdatasetsinwhicheachclassisrepresentedinroughlythesameproportionasintheentiredataset
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
ClassifierevaluationData Label
0
0
1
1
0
1
0
Trainingdata
Testingdata
Labe
led
data
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
ClassifierevaluationData Label
0
0
1
1
0
1
0
Trainingdata
Testingdata
trainaclassifier
model
Labe
led
data
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Classifierevaluation
Data Label
1
0
Pretendlikewedon’tknowthelabels
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Classifierevaluation
Data Label
1
0
model
Classify
1
1
Pretendlikewedon’tknowthelabels
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Classifierevaluation
Data Label
1
0
model
Pretendlikewedon’tknowthelabels
Classify
1
1
Comparepredictedlabelstoactuallabels
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Comparingalgorithms
Data Label
1
0
model1 1
1
model2 10
Ismodel2betterthanmodel1?
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Comparingalgorithms
model1 1
1
model2 1
0
Predicted
1
0
Label
1
0
LabelPredicted
Evaluation
score1
score2
model2betterifscore2>score1
Whenwouldwewanttodothistypeofcomparison?
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Ismodel2better?Model1:85%accuracyModel2:80%accuracy
Model1:85.5%accuracyModel2:85.0%accuracy
Model1:0%accuracyModel2:100%accuracy
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Comparingscores:significance• Justcomparingscoresononedatasetisn’t
enough!• Wedon’tjustwanttoknowwhichsystemis
betterononeparticulardataset,wewanttoknowifmodel1isbetterthanmodel2ingeneral
• Putanotherway,wewanttobeconfidentthatthedifferenceisrealandnotjustduetorandomchance
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Howdoweknowhowvariableamodel’saccuracyis?
Variance
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Varianceofperformance
• Weneedmultipleaccuracyscores!• Howcanwegetthem?
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
RepeatedexperimentationData Label
0
0
1
1
0
1
0
Trainingdata
Testingdata
Labe
led
data
Insteadofoneevaluationwithaparticularsplitoftrainingandtestdata,runmultipleevaluations,withdifferentsplitsoftrainingandtestdata
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Repeatedexperimentation
Data Label
0
0
1
1
0
1
Trai
ning
dat
a
Data Label
0
0
1
1
0
1
0
0
1
1
0
1
Data Label
…
=evaluation=train
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
K-foldcrossvalidationTr
aini
ngd
ata
breakintonequal-sizedparts
…
repeatforallparts/splits:trainonK-1partsevaluateontheother
…
split1 split2
…
split3
…
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
K-foldcrossvalidation
…
split
1
split
2
…sp
lit3
…
…
evaluate
score1
score2
score3
…
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
K-foldcrossvalidation
• Betterutilizationoflabeleddata• Morerobust:don’tjustrelyononeevaluationsetto
evaluatetheapproach(orforoptimizingparameters)• MultipliesthecomputationaloverheadbyK(haveto
trainKmodelsinsteadofjustone)• 10isthemostcommonchoiceofK
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
EstimatingtheperformanceofaclassifierK-foldcross-validationPartitionthedata(multi)setSintoKequalpartsS1..SK
withroughlythesameclassdistributionasS.Errorc=0
Fori=1toKdo
;iTrain SSS −←iTest SS ←)( TrainSLearn←α
}
{
),( TestSErrorErrorcErrorc α+←
( )ErrorOutputK
ErrorcError ;⎟⎠
⎞⎜⎝
⎛←
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Estimatingclassifierperformance
Recommendedprocedure• UseK-foldcross-validation(K=5or10)forestimating
performanceestimates(accuracy,precision,recall,pointsonROCcurve,etc.)and95%confidenceintervalsaroundthemean
• Computemeanvaluesofperformanceestimatesandstandarddeviationsofperformanceestimates
• Reportmeanvaluesofperformanceestimatesandtheirstandarddeviationsor95%confidenceintervalsaroundthemean
• Beskeptical–repeatexperimentsseveraltimeswithdifferentrandomsplitsofdataintoKfolds!
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Leave-one-outcrossvalidation• K-foldcrossvalidationwhereK=numberof
samples• aka“jackknifing”• pros/cons?• whenwouldweusethis?
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Leave-one-outcross-validation
• K-foldcrossvalidationwithK=nwherenisthetotalnumberofsamplesavailable
• nexperiments–usingn-1samplesfortrainingandtheremainingsamplefortesting
• Leave-one-outcross-validationdoesnotguaranteethesameclassdistributionintrainingandtestdata!
Extremecase:50%class1,50%class2PredictmajorityclasslabelinthetrainingdataTrueerror–50%;
Leave-one-outerrorestimate–100%!!!!!
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Leave-one-outcrossvalidation• Canbeveryexpensiveiftrainingisslowand/or
iftherearealargenumberofexamples• Usefulindomainswithlimitedtrainingdata:
maximizesthedatawecanusefortraining• Someclassifierspermittheestimationof
leave-1-outperformancemeasurewithoutactuallyhavingtotrainKmodels
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Comparingsystems:sample1split model1 model2
1 87 882 85 843 83 844 80 795 88 896 85 857 83 818 87 869 88 8910 84 85
average: 85 85
Ismodel2betterthanmodel1?
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Comparingsystems:sample2split model1 model2
1 87 872 92 883 74 794 75 865 82 846
79 877 83 818 83 929 88 8110 77 85avg 82 85
Ismodel2betterthanmodel1?
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Comparingsystems:sample3split model1 model2
1 84 872 83 863 78 824 80 865 82 846 79 877 83 848 83 869 85 8310 83 85
average: 82 85
Ismodel2betterthanmodel1?
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Comparingsystemssplit model1 model2
1 84 872 83 863 78 824 80 865 82 846 79 877 83 848 83 869 85 8310 83 85
average: 82 85
split model1 model2
1 87 872 92 883 74 794 75 865 82 846 79 877 83 818 83 929 88 8110 77 85
average: 82 85
What’sthedifference?
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Comparingsystemssplit model1 model2
1 84 872 83 863 78 824 80 865 82 846 79 877 83 848 83 869 85 8310 83 85
average: 82 85
stddev 2.3 1.7
split model1 model2
1 87 872 92 883 74 794 75 865 82 846 79 877 83 818 83 929 88 8110 77 85
average: 82 85
stddev 5.9 3.9
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Comparingsystems:sample4
split model1 model2
1 80 822 84 873 89 904 78 825 90 916 81 837 80 808 88 899 76 7710 86 88
average 83 85
stddev 4.9 4.7
Ismodel2betterthanmodel1?
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Comparingsystems:sample4
split model1
model2 model2–model
11 80 82 22 84 87 33 89 90 14 78 82 45 90 91 16 81 83 27 80 80 08 88 89 19 76 77 110 86 88 2
average 83 85stddev 4.9 4.7
Ismodel2betterthanmodel1?
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Comparingsystems:sample4
split model1 model2 model2–model1
1 80 82 22 84 87 33 89 90 14 78 82 45 90 91 16 81 83 27 80 80 08 88 89 19 76 77 110 86 88 2
average 83 85stddev 4.9 4.7
Model2isALWAYSbetter
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Comparingsystems:sample4
split model1 model2 model2–model1
1 80 82 22 84 87 33 89 90 14 78 82 45 90 91 16 81 83 27 80 80 08 88 89 19 76 77 110 86 88 2
average: 83 85
stddev 4.9 4.7
Howdowedecideifmodel2isbetterthanmodel1?
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
StatisticaltestsSetup:
– Assumesomedefaulthypothesisaboutthedatathatyou’dliketodisprove,calledthenullhypothesis
– e.g.model1andmodel2arenotstatisticallydifferentinperformance
Test:– Calculateateststatisticfromthedata(oftenassuming
somethingaboutthedata)– Basedonthisstatistic,withsomeprobabilitywecan
rejectthenullhypothesis,thatis,showthatitdoesnothold
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
t-test
Determineswhethertwosamplescomefromthesameunderlyingdistributionornot
?
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
t-testNullhypothesis:model1andmodel2accuraciesarenodifferent,i.e.comefromthesamedistributionResult:probabilitythatthedifferenceinaccuraciesisduetorandomchance(lowvaluesarebetter)
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Calculatingt-testForoursetup,we’lldowhat’scalleda“pairt-test”
– Thevaluescanbethoughtofaspairs,wheretheywerecalculatedunderthesameconditions
– Inourcase,thesametrain/testsplit– Givesmorepowerthantheunpairedt-test(wehave
moreinformation)
Foralmostallexperiments,we’lldoa“two-tailed”versionofthet-testhttp://en.wikipedia.org/wiki/Student's_t-test
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
p-value• Theresultofastatisticaltestisoftenap-value• p-value:theprobabilitythatthenullhypothesis
holds.Specifically,ifwere-ranthisexperimentmultipletimes(sayondifferentdata)whatistheprobabilitythatwewouldrejectthenullhypothesisincorrectly(i.e.theprobabilitywe’dbewrong)
• Commonvaluestoconsider“significant”:0.05(95%confident),0.01(99%confident)and0.001(99.9%confident)
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Comparingsystems:sample1split model1 model2
1 87 882 85 843 83 844 80 795 88 896 85 857 83 818 87 869 88 8910 84 85
average: 85 85
Ismodel2betterthanmodel1?
Theyarethesamewith:p=1
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Comparingsystems:sample2split model1 model2
1 87 872 92 883 74 794 75 865 82 846 79 877 83 818 83 929 88 8110 77 85
average: 82 85
Ismodel2betterthanmodel1?
Theyarethesamewith:p=0.15
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Comparingsystems:sample3split model1 model2
1 84 872 83 863 78 824 80 865 82 846 79 877 83 848 83 869 85 8310 83 85
average: 82 85
Ismodel2betterthanmodel1?
Theyarethesamewith:p=0.007
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Comparingsystems:sample4split model1 model2
1 80 822 84 873 89 904 78 825 90 916 81 837 80 808 88 899 76 7710 86 88
average: 83 85
Ismodel2betterthanmodel1?
Theyarethesamewith:p=0.001
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Statisticaltestsontestdata
LabeledData
(datawithlabels)
AllTraining
Data
TestData
TrainingData
DevelopmentData
cross-validationwitht-test
Canwedothathere?
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Bootstrapresamplingtestsettwithnsamplesdomtimes:- samplenexampleswithreplacementfromthe
testsettocreateanewtestsett’- evaluatemodel(s)ont’
calculatet-test(orotherstatisticaltest)onthecollectionofmresults
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Bootstrapresampling
Test’1
sam
ple
with
re
plac
emen
tTestData
Test’m
…
Test’2
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Bootstrapresampling
modelA
Test’1
Test’2
Test’m
…
eval
uate
m
odel
on
data
Ascore1
Ascore2
Ascorem
…
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Bootstrapresampling
modelB
Test’1
Test’2
Test’m
…
eval
uate
m
odel
on
data
Bscore1
Bscore2
Bscorem
…
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Bootstrapresampling
Ascore1
Ascore2
Ascorem
…
Bscore1
Bscore2
Bscorem
…
pairedt-test(orotheranalysis)
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Experimentationgoodpractices
Neverlookatyourtestdata!Duringdevelopment
– Comparedifferentmodels/hyperparametersondevelopmentdata
– usecross-validationtogetmoreconsistentresults– Ifyouwanttobeconfidentwithresults,useat-test
andlookforp=0.05(orevenbetter)Forfinalevaluation,usebootstrapresamplingcombinedwithat-testtocomparemodels
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Estimatingtheperformanceofaclassifier
ThetrueerrorofahypothesishwithrespecttoatargetfunctionfandaninstancedistributionDis
[ ])()(Pr)( xhxfhErrorDxD ≠≡
∈
ThesampleerrorofabinaryclassifierhwithrespecttoatargetfunctionfandaninstancedistributionDis
otherwise 0),( ; iff 1),(
))()((||
1)(
=≠=
≠≡ ∑∈
bababa
xhxfS
hErrorSx
S
δδ
δ
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Estimatingclassifierperformance
( )( )
( ) [ ]
41
81
81
00110110
41
81
21
81
=+=
=+==
≠=
⎭⎬⎫
⎩⎨⎧
=
=
)()()()(Pr
,,,)(
},,,{)(
cXDaXDxfxhherror
xfxh
dcbax
XD
dcbaXDomain
DD
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Evaluatingtheperformanceofaclassifier
• Sampleerrorestimatedfromtrainingdataisanoptimisticestimate
• Foranunbiasedestimate,hmustbeevaluatedonanindependentsampleS(whichisnotthecaseifSisthetrainingset!)
• Evenwhentheestimateisunbiased,itcanvaryacrosssamples!• Ifhmisclassifies8outof100samples
[ ] )()( hErrorhErrorEBias DS −=
0801008 .)( ==hErrorS
Howcloseisthesampleerrortothetrueerror?
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Howcloseistheestimatederrortothetrueerror?• ChooseasampleSofsizenaccordingtodistributionD• Measure
)(hErrorS
)(hErrorS isarandomvariable(outcomeofarandomexperiment)
?)( about conclude wecan what,)( Given hErrorhError DS
Moregenerally,giventheestimatedperformanceofahypothesis,whatcanwesayaboutitsactualperformance?
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Evaluatingperformancewhenwecanaffordtotestonalargeindependenttestset
ThetrueerrorofahypothesishwithrespecttoatargetfunctionfandaninstancedistributionDis
[ ])()(Pr)( xhxfhErrorDxD ≠≡
∈
The sample error of a classifier hwith respect to a target function fand an instance distribution D is
otherwise 0),( ; iff 1),(
))()((||
1)(
=≠=
≠≡ ∑∈
bababa
xhxfS
hErrorSx
S
δδ
δ
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
EvaluatingClassifierperformance
Sampleerrorestimatedfromtrainingdataisanoptimisticestimate
Foranunbiasedestimate,hmustbeevaluatedonanindependentsampleS(whichisnotthecaseifSisthetrainingset!)
Evenwhentheestimateisunbiased,itcanvaryacrosssamples!Ifhmisclassifies8outof100samples
[ ] )()( hErrorhErrorEBias DS −=
0801008 .)( ==hErrorS
Howcloseisthesampleerrortothetrueerror?
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Howcloseisestimatederrortoitstruevalue?ChooseasampleSofsizenaccordingtodistributionDMeasure )(hErrorS
)(hErrorS isarandomvariable(outcomeofarandomexperiment)
?)( about conclude wecan what,)( Given hErrorhError DS
Moregenerally,giventheestimatedperformanceofaclassifier,whatcanwesayaboutitsactualperformance?
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Howcloseisestimatedaccuracytoitstruevalue?
Question:Howcloseisp(thetrueprobability)to ?Thisproblemisaninstanceofawell-studiedprobleminstatistics• Theproblemofestimatingtheproportionofapopulationthat
exhibitssomeproperty,giventheobservedproportionoverarandomsampleofthepopulation.
• Inourcase,thepropertyofinterestisthathcorrectly(orincorrectly)classifiesasample.
• TestinghonasinglerandomsamplexdrawnaccordingtoDamountstoperformingarandomexperimentwhichsucceedsifhcorrectlyclassifiesxandfailsotherwise.
p̂
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Howcloseisestimatedaccuracytoitstruevalue?
TheoutputofaclassifierwhosetrueerrorispasabinaryrandomvariablewhichcorrespondstotheoutcomeofaBernoullitrialwithasuccessratep(theprobabilityofcorrectprediction)
ThenumberofsuccessesrobservedinNtrialsisarandom
variableYwhichfollowstheBinomialdistribution
rnr pprnr
nrP −−−
= )()!(!
!)( 1
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Probabilityofobservingrmisclassifiedexamplesinasampleofsizen:
ErrorS(h)isaRandomVariable
rnr pprnr
nrP −−−
= )()!(!
!)( 1r
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Recallbasicstatistics
ConsiderarandomexperimentwithdiscretevaluedoutcomesTheexpectedvalueofthecorrespondingrandomvariableYisThevarianceofYisThestandarddeviationofYis
Myyy ,..., 21
)Pr()( i
M
ii yYyYE =≡ ∑
=1
[ ]2])[()( YEYEYVar −≡
)(YVarY ≡σ
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Howcloseisestimatedaccuracytoitstruevalue?
ThemeanofaBernoullitrialwithsuccessratep=pVariance=p(1-p)IfNtrialsaretakenfromthesameBernoulliprocess,the
observedsuccessratehasthesamemeanpandvarianceForlargeN,thedistributionoffollowsaGaussiandistribution
p̂
Npp )1( −
p̂
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
BinomialProbabilityDistribution
rnr pprnr
nrP −−−
= )()!(!
!)( 1
ProbabilityP(r)ofrheadsinncoinflips,ifp=Pr(heads)• Expected,ormeanvalueofX,E[X],is
∑=
=≡N
inpiiPXE
0)(][
• VarianceofXis
• StandarddeviationofX,σX,is
)(]])[[()( pnpXEXEXVar −=−≡ 12
)(]])[[( pnpXEXEX −=−≡ 12σ
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Estimators,Bias,Variance,ConfidenceInterval
npp
hErrorS
)()(
−=
1σ
phErrornrhError
D
S
=
=
)(
)(
nhErrorhError SS
hErrorS
))()(()(
−≈
1σ
AnN%confidenceintervalforsomeparameterpthatistheintervalwhichisexpectedwithprobabilityN%tocontainp
nhErrorhError DD
hErrorS
))()(()(
−=
1σ
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Normaldistributionapproximatesbinomial
ErrorS(h)followsaBinomialdistribution,with• mean• standarddeviation
nhErrorshError
hErrorsDD
S
))()(()(
−= 1σ
WecanapproximatethisbyaNormaldistributionwiththesamemeanandvariancewhennp(1-p)≥5
)()( hErrorDhErrorS=µ
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Normaldistribution2
21 )(1
22)( σ
µ
πσ
−−=x
exp
Expected,ormeanvalueofXisgivenbyE[X]=µVarianceofXisgivenbyVar(X)=σ2StandarddeviationofXisgivenbyσX=σ
TheprobabilitythatXwillfallintheinterval(a,b)isgivenby∫
b
adxxp )(
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Howcloseistheestimatedaccuracytoitstruevalue?LettheprobabilitythataGaussianrandomvariableX,withzero
mean,takesavaluebetween–zandz,Pr[-z≤X≤z]=c
Pr[X≥z] z
0.001 3.09
0.005 2.58
0.01 2.33
0.05 1.65
0.10 1.28
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Howcloseistheestimatedaccuracytoitstruevalue?
Butdoesnothavezeromeanandunitvariancesowenormalizetoget
p̂
cz
nppppz =
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
<−
−<−
)(ˆPr1
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory
Fall2018 VasantGHonavar
Howcloseistheestimatedaccuracytoitstruevalue?
Tofindconfidencelimits:Givenaparticularconfidencefigurec,usethetabletofindthezcorrespondingtotheprobability½(1-c).Uselinearinterpolationforvaluesnotinthetable
⎥⎦
⎤⎢⎣
⎡+
⎥⎥⎦
⎤
⎢⎢⎣
⎡+−±+
=
nz
nz
np
npz
nzp
p2
2
222
1
42ˆˆˆ