data analysis 1 mark stamp. topics experimental design o training set, test set, n-fold cross...

45
Data Analysis Data Analysis 1 Mark Stamp

Upload: cornelius-caldwell

Post on 12-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

1

Data Analysis

Data Analysis

Mark Stamp

Page 2: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

2

Topics

Experimental designo Training set, test set, n-fold cross

validation, thresholding, imbalance, etc.

Accuracyo False positive, false negative, etc.

ROC curveso Area under the ROC curve (AUC)o Partial AUC (sometimes written as

AUCp)Data Analysis

Page 3: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

3

Objective

Assume that we have a proposed method for detecting malware

We want to determine how well it performs on specific dataseto We want to quantify effectiveness

Ideally, compare to previous worko But, often difficult to directly compare

Comparisons to AV products?

Data Analysis

Page 4: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

4

Basic Assumptions

We have a set of known malwareo All from a single (metamorphic)

“family”… o …or, at least all of a similar typeo For broader “families”, more difficult

Also, a representative non-family seto Often, assumed to be benign fileso The more diverse, the more difficult

Much depends on problem specificsData Analysis

Page 5: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

5

Experimental Design Want to test malware detection

scoreo Refer to malware dataset as match

seto And benign dataset is nomatch set

Partition match set into… o Training set used to determine

parameters of the scoring functiono Test set reserved to test scoring

function generated from training set Note: Cannot test on training setData Analysis

Page 6: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

6

Training and Scoring

Two phases: Training and scoring Training phase

o Train a model using training set Scoring phase

o Score data in test set and score nomatch (benign) set

Analyze results from scoring phaseo Assume representative of general case

Data Analysis

Page 7: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

7

Scatterplots Train a model on the training set Apply score to test and nomatch

setso Can visualize result as scatterplot

Data Analysis

score

test case

match scores

nomatch scores

Page 8: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

8

Experimental Design

A couple of potential problems…o How to partition match set?o How to get most out of limited data

set? Why are these things concerns?

o When we partition match set, might get biased training/test sets, and …

o … more data points is “more better” Cross validation solves these

problemsData Analysis

Page 9: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

9

n-fold Cross Validation

Partition match set into n equal subsetso Denote subsets as S1,S2,…,Sn

Let training set be S2 S3… Sn o And test set is S1

Repeat with training set S1 S3… Sn o And test set S2

And so on, for each of n “folds” o In our work, we usually select n = 5 Data Analysis

Page 10: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

10

n-fold Cross Validation

Benefits of cross validation? Any bias in match data smoothed

outo Since bias only affects one/few of the

Si

Obtain lots more match scoreso Usually, no shortage of nomatch datao But match data can be very limited

And it’s easy to do, so why not?o Best of all, it sounds so fancy…

Data Analysis

Page 11: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

11

Thresholding

Threshold based on test vs nomatcho After training and scoring phases

Ideal is complete separationo I.e., no overlap in scatterploto Usually, that doesn’t happeno So, where to set the threshold?

In practical use, thresholding criticalo At research stage, more of a

distraction

Data Analysis

Page 12: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

12

Thresholding

Where to set threshold?o Left case is easy, right case, not so

much

Data Analysis

score

test case

score

test case

Page 13: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

13

Quantifying Success

We need a way to quantify “better”o Ideas?

Data Analysis

score

test case

score

test case

Page 14: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

14

Accuracy

Given scatterplot and a threshold… We have following 4 cases

o True positive correctly classified as +

o False positive incorrectly classified +

o True negative correctly classified as −

o False negative incorrectly classified −

TP, FP, TN, FN, respectively, for shorto Append “R” to each for “rate”

Data Analysis

Page 15: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

15

Sensitivity and Specificity

The TPR also known as sensitivity while TNR is known as specificity

Consider a medical testo Sensitivity is percentage of sick people

who “pass” the test (as they should)o Specificity is percentage of healthy

people who “fail” the test (as they should)

Inherent tradeoff between TPR/TNRo Note that these depend on thresholdData Analysis

Page 16: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

16

Accuracy

Let P be number of positive cases tested and N negative cases testedo Note: P is size of test set, N nomatch

seto Also, P = TP + FN and N = TN + FP

Finally, Accuracy = (TP + TN) / (P + N) o Note that accuracy ranges from 0 to 1o Accuracy of 1 is the ideal caseo Accuracy 0? Don’t give up your day

job…Data Analysis

Page 17: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

17

Balanced Accuracy

Often, there is a large imbalance between test set and nomatch seto Test set is small relative to nomatch

set Define

Balanced accuracy = (TPR + TNR) / 2 = 0.5 TP/P

+ 0.5 TN/No Errors on both sets weighted same

Consider imbalance issue again later

Data Analysis

Page 18: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

18

Accuracy

Accuracy tells us something…o But it depends on where threshold is

seto How should we set the threshold?o Seems we are going around in circles

like a dog chasing its tail Bottom line? Still don’t have a good

way to compare different techniqueso Next slide, please…Data Analysis

Page 19: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

19

ROC Curves Receiver Operating Characteristic

o Originated from electrical engineeringo But now widely used in many fields

What is an ROC curve?o Plot TPR vs FPR by varying threshold

thru the range of scoreso That is, FPR on x-axis, TPR on y-axis o Equivalently, 1 – specificity vs

sensitivity o What the … ?

Data Analysis

Page 20: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

20

ROC Curve

Suppose threshold is set at yellow lineo Above yellow,

classified as positive, o Below yellow is

negative In this case,

o TPR = 1.0o FPR = 1.0 – TNR

= 1.0 – 0.0 = 1.0Data Analysis

score

test case

TPR

FPR 1

1

0

Page 21: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

21

ROC Curve

Suppose threshold set at yellow lineo Above yellow,

classified as positive, o Below yellow is

negative In this case,

o TPR = 1.0o FPR = 1.0 – TNR

= 1.0 – 0.2 = 0.8Data Analysis

score

test case

TPR

FPR 1

1

0

Page 22: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

22

ROC Curve

Suppose threshold set at yellow lineo Above yellow,

classified as positive, o Below yellow is

negative In this case,

o TPR = 1.0o FPR = 1.0 – TNR

= 1.0 – 0.4 = 0.6Data Analysis

score

test case

TPR

FPR 1

1

0

Page 23: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

23

ROC Curve

Suppose threshold set at yellow lineo Above yellow,

classified as positive, o Below yellow is

negative In this case,

o TPR = 1.0o FPR = 1.0 – TNR

= 1.0 – 0.6 = 0.4Data Analysis

score

test case

TPR

FPR 1

1

0

Page 24: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

24

ROC Curve

Suppose threshold set at yellow lineo Above yellow,

classified as positive, o Below yellow is

negative In this case,

o TPR = 0.8o FPR = 1.0 – TNR

= 1.0 – 0.6 = 0.4Data Analysis

score

test case

TPR

FPR 1

1

0

Page 25: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

25

ROC Curve

Suppose threshold set at yellow lineo Above yellow,

classified as positive, o Below yellow is

negative In this case,

o TPR = 0.6o FPR = 1.0 – TNR

= 1.0 – 0.6 = 0.4Data Analysis

score

test case

TPR

FPR 1

1

0

Page 26: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

26

ROC Curve

Suppose threshold set at yellow lineo Above yellow,

classified as positive, o Below yellow is

negative In this case,

o TPR = 0.6o FPR = 1.0 – TNR

= 1.0 – 0.6 = 0.2Data Analysis

score

test case

TPR

FPR 1

1

0

Page 27: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

27

ROC Curve

Suppose threshold set at yellow lineo Above yellow,

classified as positive, o Below yellow is

negative In this case,

o TPR = 0.4o FPR = 1.0 – TNR

= 1.0 – 0.6 = 0.2Data Analysis

score

test case

TPR

FPR 1

1

0

Page 28: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

28

ROC Curve

Suppose threshold set at yellow lineo Above yellow,

classified as positive, o Below yellow is

negative In this case,

o TPR = 0.4o FPR = 1.0 – TNR

= 1.0 – 0.6 = 0.0Data Analysis

score

test case

TPR

FPR 1

1

0

Page 29: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

29

ROC Curve

Suppose threshold set at yellow lineo Above yellow,

classified as positive, o Below yellow is

negative In this case,

o TPR = 0.2o FPR = 1.0 – TNR

= 1.0 – 0.6 = 0.0Data Analysis

score

test case

TPR

FPR 1

1

0

Page 30: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

30

ROC Curve

Suppose threshold set at yellow lineo Above yellow,

classified as positive, o Below yellow is

negative In this case,

o TPR = 0.0o FPR = 1.0 – TNR

= 1.0 – 0.6 = 0.0Data Analysis

score

test case

TPR

FPR 1

1

0

Page 31: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

31

ROC Curve

Connect the dots… This is ROC curve What good is it?

o Captures info wrt all possible thresholds

o Removes threshold as a factor in the analysis

What does it all mean?Data Analysis

TPR

FPR 1

1

0

Page 32: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

32

ROC Curve

Random classifier? o Yellow 45 degree line

Perfect classifier?o Red lines (Why?)

Above 45 degree line? o Better than randomo The closer to the red,

the closer to perfectData Analysis

TPR

FPR 1

1

0

Page 33: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

33

Area Under the Curve (AUC)

ROC curve lives within a 1x1 square

Random classifier?o AUC ≈ 0.5

Perfect classifier (red)?o AUC = 1.0

Example curve (blue)?o AUC = 0.8

Data Analysis

TPR

FPR 1

1

0

Page 34: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

34

Area Under the Curve (AUC)

Area under ROC curve quantifies successo 0.5 like flipping a coino 1.0 perfection

achieved AUC of ROC curve

o Enables us to compare different techniques

o And no need to worry about threshold

Data Analysis

TPR

FPR 1

1

0

Page 35: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

35

Partial AUC

Might only consider cases where FPR < p

“Partial” AUC is AUCp o Area up to FPR of po Normalized by p

In this example,AUC0.4 = 0.2 / 0.4 = 0.5

AUC0.2 = 0.08/0.2 = 0.4 Data Analysis

TPR

FPR 1

1

0

Page 36: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

36

Imbalance Problem

Suppose we train model for given malware family

In practice, we expect to score many more non-family files than family o Number of negative cases is largeo Number of positive cases is small

So what? Let’s consider an exampleData Analysis

Page 37: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

37

Imbalance Problem

In practice, we need threshold For a given threshold, suppose

sensitivity = 0.99, specificity = 0.98o Then TPR = 0.99 and FPR = 0.02

Assume 1 in 1000 tested is malwareo Of the type our model trained to

detect Suppose we scan, say, 100k files

o What do we find?

Data Analysis

Page 38: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

38

Imbalance Problem

Assuming TPR = 0.99 and FPR = 0.02o And 1 in 1000 is malware

After scanning 100k files…o Detect 99 of 100 actual malware (TP)o Misclassify 1 malware as benign (FN)o Correctly classify 97902 (out of

99900) benign as benign (TN)o Misclassify 1998 benign as malware

(FP) Data Analysis

Page 39: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

39

Imbalance Problem

We have 97903 classified as benigno Of those, 97902 are actually benigno And 97902/97903 > 0.9999

We classified 2097 as malwareo Of these, only 99 are actual malwareo But 99/2097 < 0.05

Remember the “boy who cried wolf”?o Here, we have detector that cries

wolf…Data Analysis

Page 40: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

40

Imbalance Solution?

What to do? There is inherent tradeoff between

sensitivity and specificity Suppose we can adjust threshold

soo TPR = 0.92 and FPR = 0.0003

As before…o We have 1 in 1000 is malwareo And we test 100k files

Data Analysis

Page 41: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

41

Imbalance Solution?

Assuming TPR = 0.92 and FPR = 0.0003o And 1 in 1000 is malware

After scanning 100k files…o Detect 92 of 100 actual malware (TP)o Misclassify 8 malware as benign (FN)o Correctly classify 99870 (out of 99900)

benign as benign (TN)o Misclassify 30 benign as malware (FP)

Data Analysis

Page 42: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

42

Imbalance Solution?

We have 99878 classified as benigno Of those, all but 8 are actually benigno And 99870/99878 > 0.9999

We classified 122 as malwareo Of these, 92 are actual malwareo And 92/122 > 0.75

Can adjust threshold to further reduce “crying wolf” effect

Data Analysis

Page 43: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

43

Imbalance Problem

A better alternative? Instead of increasing FPR to lower

TPR o Perform secondary testing on files that

are initially classified as malwareo We can thus weed out most FP cases

This gives us best of both worldso Low FPR, few benign reported as

malware No free lunch, so what’s the cost?Data Analysis

Page 44: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

44

Bottom Line Design your experiments properly

o Use n-fold cross validation (e.g., n = 5)o Generally, cross validation is important

Thresholding is important in practiceo But not so useful for analyzing resultso Accuracy not so informative either

Use ROC curves and compute AUCo Sometimes, partial AUC is better

Imbalance problem may be significant issue

Data Analysis

Page 45: Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o

45

References

A.P. Bradley, The use of the area under the ROC curve in the evaluation of machine learning algorithms, Pattern Recognition, 30:1145-1159, 1997

Data Analysis