data analysis 1 mark stamp. topics experimental design o training set, test set, n-fold cross...
TRANSCRIPT
1
Data Analysis
Data Analysis
Mark Stamp
2
Topics
Experimental designo Training set, test set, n-fold cross
validation, thresholding, imbalance, etc.
Accuracyo False positive, false negative, etc.
ROC curveso Area under the ROC curve (AUC)o Partial AUC (sometimes written as
AUCp)Data Analysis
3
Objective
Assume that we have a proposed method for detecting malware
We want to determine how well it performs on specific dataseto We want to quantify effectiveness
Ideally, compare to previous worko But, often difficult to directly compare
Comparisons to AV products?
Data Analysis
4
Basic Assumptions
We have a set of known malwareo All from a single (metamorphic)
“family”… o …or, at least all of a similar typeo For broader “families”, more difficult
Also, a representative non-family seto Often, assumed to be benign fileso The more diverse, the more difficult
Much depends on problem specificsData Analysis
5
Experimental Design Want to test malware detection
scoreo Refer to malware dataset as match
seto And benign dataset is nomatch set
Partition match set into… o Training set used to determine
parameters of the scoring functiono Test set reserved to test scoring
function generated from training set Note: Cannot test on training setData Analysis
6
Training and Scoring
Two phases: Training and scoring Training phase
o Train a model using training set Scoring phase
o Score data in test set and score nomatch (benign) set
Analyze results from scoring phaseo Assume representative of general case
Data Analysis
7
Scatterplots Train a model on the training set Apply score to test and nomatch
setso Can visualize result as scatterplot
Data Analysis
score
test case
match scores
nomatch scores
8
Experimental Design
A couple of potential problems…o How to partition match set?o How to get most out of limited data
set? Why are these things concerns?
o When we partition match set, might get biased training/test sets, and …
o … more data points is “more better” Cross validation solves these
problemsData Analysis
9
n-fold Cross Validation
Partition match set into n equal subsetso Denote subsets as S1,S2,…,Sn
Let training set be S2 S3… Sn o And test set is S1
Repeat with training set S1 S3… Sn o And test set S2
And so on, for each of n “folds” o In our work, we usually select n = 5 Data Analysis
10
n-fold Cross Validation
Benefits of cross validation? Any bias in match data smoothed
outo Since bias only affects one/few of the
Si
Obtain lots more match scoreso Usually, no shortage of nomatch datao But match data can be very limited
And it’s easy to do, so why not?o Best of all, it sounds so fancy…
Data Analysis
11
Thresholding
Threshold based on test vs nomatcho After training and scoring phases
Ideal is complete separationo I.e., no overlap in scatterploto Usually, that doesn’t happeno So, where to set the threshold?
In practical use, thresholding criticalo At research stage, more of a
distraction
Data Analysis
12
Thresholding
Where to set threshold?o Left case is easy, right case, not so
much
Data Analysis
score
test case
score
test case
13
Quantifying Success
We need a way to quantify “better”o Ideas?
Data Analysis
score
test case
score
test case
14
Accuracy
Given scatterplot and a threshold… We have following 4 cases
o True positive correctly classified as +
o False positive incorrectly classified +
o True negative correctly classified as −
o False negative incorrectly classified −
TP, FP, TN, FN, respectively, for shorto Append “R” to each for “rate”
Data Analysis
15
Sensitivity and Specificity
The TPR also known as sensitivity while TNR is known as specificity
Consider a medical testo Sensitivity is percentage of sick people
who “pass” the test (as they should)o Specificity is percentage of healthy
people who “fail” the test (as they should)
Inherent tradeoff between TPR/TNRo Note that these depend on thresholdData Analysis
16
Accuracy
Let P be number of positive cases tested and N negative cases testedo Note: P is size of test set, N nomatch
seto Also, P = TP + FN and N = TN + FP
Finally, Accuracy = (TP + TN) / (P + N) o Note that accuracy ranges from 0 to 1o Accuracy of 1 is the ideal caseo Accuracy 0? Don’t give up your day
job…Data Analysis
17
Balanced Accuracy
Often, there is a large imbalance between test set and nomatch seto Test set is small relative to nomatch
set Define
Balanced accuracy = (TPR + TNR) / 2 = 0.5 TP/P
+ 0.5 TN/No Errors on both sets weighted same
Consider imbalance issue again later
Data Analysis
18
Accuracy
Accuracy tells us something…o But it depends on where threshold is
seto How should we set the threshold?o Seems we are going around in circles
like a dog chasing its tail Bottom line? Still don’t have a good
way to compare different techniqueso Next slide, please…Data Analysis
19
ROC Curves Receiver Operating Characteristic
o Originated from electrical engineeringo But now widely used in many fields
What is an ROC curve?o Plot TPR vs FPR by varying threshold
thru the range of scoreso That is, FPR on x-axis, TPR on y-axis o Equivalently, 1 – specificity vs
sensitivity o What the … ?
Data Analysis
20
ROC Curve
Suppose threshold is set at yellow lineo Above yellow,
classified as positive, o Below yellow is
negative In this case,
o TPR = 1.0o FPR = 1.0 – TNR
= 1.0 – 0.0 = 1.0Data Analysis
score
test case
TPR
FPR 1
1
0
21
ROC Curve
Suppose threshold set at yellow lineo Above yellow,
classified as positive, o Below yellow is
negative In this case,
o TPR = 1.0o FPR = 1.0 – TNR
= 1.0 – 0.2 = 0.8Data Analysis
score
test case
TPR
FPR 1
1
0
22
ROC Curve
Suppose threshold set at yellow lineo Above yellow,
classified as positive, o Below yellow is
negative In this case,
o TPR = 1.0o FPR = 1.0 – TNR
= 1.0 – 0.4 = 0.6Data Analysis
score
test case
TPR
FPR 1
1
0
23
ROC Curve
Suppose threshold set at yellow lineo Above yellow,
classified as positive, o Below yellow is
negative In this case,
o TPR = 1.0o FPR = 1.0 – TNR
= 1.0 – 0.6 = 0.4Data Analysis
score
test case
TPR
FPR 1
1
0
24
ROC Curve
Suppose threshold set at yellow lineo Above yellow,
classified as positive, o Below yellow is
negative In this case,
o TPR = 0.8o FPR = 1.0 – TNR
= 1.0 – 0.6 = 0.4Data Analysis
score
test case
TPR
FPR 1
1
0
25
ROC Curve
Suppose threshold set at yellow lineo Above yellow,
classified as positive, o Below yellow is
negative In this case,
o TPR = 0.6o FPR = 1.0 – TNR
= 1.0 – 0.6 = 0.4Data Analysis
score
test case
TPR
FPR 1
1
0
26
ROC Curve
Suppose threshold set at yellow lineo Above yellow,
classified as positive, o Below yellow is
negative In this case,
o TPR = 0.6o FPR = 1.0 – TNR
= 1.0 – 0.6 = 0.2Data Analysis
score
test case
TPR
FPR 1
1
0
27
ROC Curve
Suppose threshold set at yellow lineo Above yellow,
classified as positive, o Below yellow is
negative In this case,
o TPR = 0.4o FPR = 1.0 – TNR
= 1.0 – 0.6 = 0.2Data Analysis
score
test case
TPR
FPR 1
1
0
28
ROC Curve
Suppose threshold set at yellow lineo Above yellow,
classified as positive, o Below yellow is
negative In this case,
o TPR = 0.4o FPR = 1.0 – TNR
= 1.0 – 0.6 = 0.0Data Analysis
score
test case
TPR
FPR 1
1
0
29
ROC Curve
Suppose threshold set at yellow lineo Above yellow,
classified as positive, o Below yellow is
negative In this case,
o TPR = 0.2o FPR = 1.0 – TNR
= 1.0 – 0.6 = 0.0Data Analysis
score
test case
TPR
FPR 1
1
0
30
ROC Curve
Suppose threshold set at yellow lineo Above yellow,
classified as positive, o Below yellow is
negative In this case,
o TPR = 0.0o FPR = 1.0 – TNR
= 1.0 – 0.6 = 0.0Data Analysis
score
test case
TPR
FPR 1
1
0
31
ROC Curve
Connect the dots… This is ROC curve What good is it?
o Captures info wrt all possible thresholds
o Removes threshold as a factor in the analysis
What does it all mean?Data Analysis
TPR
FPR 1
1
0
32
ROC Curve
Random classifier? o Yellow 45 degree line
Perfect classifier?o Red lines (Why?)
Above 45 degree line? o Better than randomo The closer to the red,
the closer to perfectData Analysis
TPR
FPR 1
1
0
33
Area Under the Curve (AUC)
ROC curve lives within a 1x1 square
Random classifier?o AUC ≈ 0.5
Perfect classifier (red)?o AUC = 1.0
Example curve (blue)?o AUC = 0.8
Data Analysis
TPR
FPR 1
1
0
34
Area Under the Curve (AUC)
Area under ROC curve quantifies successo 0.5 like flipping a coino 1.0 perfection
achieved AUC of ROC curve
o Enables us to compare different techniques
o And no need to worry about threshold
Data Analysis
TPR
FPR 1
1
0
35
Partial AUC
Might only consider cases where FPR < p
“Partial” AUC is AUCp o Area up to FPR of po Normalized by p
In this example,AUC0.4 = 0.2 / 0.4 = 0.5
AUC0.2 = 0.08/0.2 = 0.4 Data Analysis
TPR
FPR 1
1
0
36
Imbalance Problem
Suppose we train model for given malware family
In practice, we expect to score many more non-family files than family o Number of negative cases is largeo Number of positive cases is small
So what? Let’s consider an exampleData Analysis
37
Imbalance Problem
In practice, we need threshold For a given threshold, suppose
sensitivity = 0.99, specificity = 0.98o Then TPR = 0.99 and FPR = 0.02
Assume 1 in 1000 tested is malwareo Of the type our model trained to
detect Suppose we scan, say, 100k files
o What do we find?
Data Analysis
38
Imbalance Problem
Assuming TPR = 0.99 and FPR = 0.02o And 1 in 1000 is malware
After scanning 100k files…o Detect 99 of 100 actual malware (TP)o Misclassify 1 malware as benign (FN)o Correctly classify 97902 (out of
99900) benign as benign (TN)o Misclassify 1998 benign as malware
(FP) Data Analysis
39
Imbalance Problem
We have 97903 classified as benigno Of those, 97902 are actually benigno And 97902/97903 > 0.9999
We classified 2097 as malwareo Of these, only 99 are actual malwareo But 99/2097 < 0.05
Remember the “boy who cried wolf”?o Here, we have detector that cries
wolf…Data Analysis
40
Imbalance Solution?
What to do? There is inherent tradeoff between
sensitivity and specificity Suppose we can adjust threshold
soo TPR = 0.92 and FPR = 0.0003
As before…o We have 1 in 1000 is malwareo And we test 100k files
Data Analysis
41
Imbalance Solution?
Assuming TPR = 0.92 and FPR = 0.0003o And 1 in 1000 is malware
After scanning 100k files…o Detect 92 of 100 actual malware (TP)o Misclassify 8 malware as benign (FN)o Correctly classify 99870 (out of 99900)
benign as benign (TN)o Misclassify 30 benign as malware (FP)
Data Analysis
42
Imbalance Solution?
We have 99878 classified as benigno Of those, all but 8 are actually benigno And 99870/99878 > 0.9999
We classified 122 as malwareo Of these, 92 are actual malwareo And 92/122 > 0.75
Can adjust threshold to further reduce “crying wolf” effect
Data Analysis
43
Imbalance Problem
A better alternative? Instead of increasing FPR to lower
TPR o Perform secondary testing on files that
are initially classified as malwareo We can thus weed out most FP cases
This gives us best of both worldso Low FPR, few benign reported as
malware No free lunch, so what’s the cost?Data Analysis
44
Bottom Line Design your experiments properly
o Use n-fold cross validation (e.g., n = 5)o Generally, cross validation is important
Thresholding is important in practiceo But not so useful for analyzing resultso Accuracy not so informative either
Use ROC curves and compute AUCo Sometimes, partial AUC is better
Imbalance problem may be significant issue
Data Analysis
45
References
A.P. Bradley, The use of the area under the ROC curve in the evaluation of machine learning algorithms, Pattern Recognition, 30:1145-1159, 1997
Data Analysis