data analysis 1 mark stamp. topics experimental design o training set, test set, n-fold cross...

1

Data Analysis

Data Analysis

Mark Stamp

2

Topics

Experimental designo Training set, test set, n-fold cross

validation, thresholding, imbalance, etc.

Accuracyo False positive, false negative, etc.

ROC curveso Area under the ROC curve (AUC)o Partial AUC (sometimes written as

AUCp)Data Analysis

3

Objective

Assume that we have a proposed method for detecting malware

We want to determine how well it performs on specific dataseto We want to quantify effectiveness

Ideally, compare to previous worko But, often difficult to directly compare

Comparisons to AV products?

Data Analysis

4

Basic Assumptions

We have a set of known malwareo All from a single (metamorphic)

“family”… o …or, at least all of a similar typeo For broader “families”, more difficult

Also, a representative non-family seto Often, assumed to be benign fileso The more diverse, the more difficult

Much depends on problem specificsData Analysis

5

Experimental Design Want to test malware detection

scoreo Refer to malware dataset as match

seto And benign dataset is nomatch set

Partition match set into… o Training set used to determine

parameters of the scoring functiono Test set reserved to test scoring

function generated from training set Note: Cannot test on training setData Analysis

6

Training and Scoring

Two phases: Training and scoring Training phase

o Train a model using training set Scoring phase

o Score data in test set and score nomatch (benign) set

Analyze results from scoring phaseo Assume representative of general case

Data Analysis

7

Scatterplots Train a model on the training set Apply score to test and nomatch

setso Can visualize result as scatterplot

Data Analysis

score

test case

match scores

nomatch scores

8

Experimental Design

A couple of potential problems…o How to partition match set?o How to get most out of limited data

set? Why are these things concerns?

o When we partition match set, might get biased training/test sets, and …

o … more data points is “more better” Cross validation solves these

problemsData Analysis

9

n-fold Cross Validation

Partition match set into n equal subsetso Denote subsets as S1,S2,…,Sn

Let training set be S2 S3… Sn o And test set is S1

Repeat with training set S1 S3… Sn o And test set S2

And so on, for each of n “folds” o In our work, we usually select n = 5 Data Analysis

10

n-fold Cross Validation

Benefits of cross validation? Any bias in match data smoothed

outo Since bias only affects one/few of the

Si

Obtain lots more match scoreso Usually, no shortage of nomatch datao But match data can be very limited

And it’s easy to do, so why not?o Best of all, it sounds so fancy…

Data Analysis

11

Thresholding

Threshold based on test vs nomatcho After training and scoring phases

Ideal is complete separationo I.e., no overlap in scatterploto Usually, that doesn’t happeno So, where to set the threshold?

In practical use, thresholding criticalo At research stage, more of a

distraction

Data Analysis

12

Thresholding

Where to set threshold?o Left case is easy, right case, not so

much

Data Analysis

score

test case

score

test case

13

Quantifying Success

We need a way to quantify “better”o Ideas?

Data Analysis

score

test case

score

test case

14

Accuracy

Given scatterplot and a threshold… We have following 4 cases

o True positive correctly classified as +

o False positive incorrectly classified +

o True negative correctly classified as −

o False negative incorrectly classified −

TP, FP, TN, FN, respectively, for shorto Append “R” to each for “rate”

Data Analysis

15

Sensitivity and Specificity

The TPR also known as sensitivity while TNR is known as specificity

Consider a medical testo Sensitivity is percentage of sick people

who “pass” the test (as they should)o Specificity is percentage of healthy

people who “fail” the test (as they should)

Inherent tradeoff between TPR/TNRo Note that these depend on thresholdData Analysis

16

Accuracy

Let P be number of positive cases tested and N negative cases testedo Note: P is size of test set, N nomatch

seto Also, P = TP + FN and N = TN + FP

Finally, Accuracy = (TP + TN) / (P + N) o Note that accuracy ranges from 0 to 1o Accuracy of 1 is the ideal caseo Accuracy 0? Don’t give up your day

job…Data Analysis

17

Balanced Accuracy

Often, there is a large imbalance between test set and nomatch seto Test set is small relative to nomatch

set Define

Balanced accuracy = (TPR + TNR) / 2 = 0.5 TP/P

+ 0.5 TN/No Errors on both sets weighted same

Consider imbalance issue again later

Data Analysis

18

Accuracy

Accuracy tells us something…o But it depends on where threshold is

seto How should we set the threshold?o Seems we are going around in circles

like a dog chasing its tail Bottom line? Still don’t have a good

way to compare different techniqueso Next slide, please…Data Analysis

19

ROC Curves Receiver Operating Characteristic

o Originated from electrical engineeringo But now widely used in many fields

What is an ROC curve?o Plot TPR vs FPR by varying threshold

thru the range of scoreso That is, FPR on x-axis, TPR on y-axis o Equivalently, 1 – specificity vs

sensitivity o What the … ?

Data Analysis

20

ROC Curve

Suppose threshold is set at yellow lineo Above yellow,

classified as positive, o Below yellow is

negative In this case,

o TPR = 1.0o FPR = 1.0 – TNR

= 1.0 – 0.0 = 1.0Data Analysis

score

test case

TPR

FPR 1

1

0

21

ROC Curve

Suppose threshold set at yellow lineo Above yellow,



o TPR = 1.0o FPR = 1.0 – TNR


score

test case

TPR

FPR 1

1

0

22

ROC Curve




o TPR = 1.0o FPR = 1.0 – TNR


score

test case

TPR

FPR 1

1

0

23

ROC Curve




o TPR = 1.0o FPR = 1.0 – TNR


score

test case

TPR

FPR 1

1

0

24

ROC Curve




o TPR = 0.8o FPR = 1.0 – TNR


score

test case

TPR

FPR 1

1

0

25

ROC Curve




o TPR = 0.6o FPR = 1.0 – TNR


score

test case

TPR

FPR 1

1

0

26

ROC Curve




o TPR = 0.6o FPR = 1.0 – TNR


score

test case

TPR

FPR 1

1

0

27

ROC Curve




o TPR = 0.4o FPR = 1.0 – TNR


score

test case

TPR

FPR 1

1

0

28

ROC Curve




o TPR = 0.4o FPR = 1.0 – TNR


score

test case

TPR

FPR 1

1

0

29

ROC Curve




o TPR = 0.2o FPR = 1.0 – TNR


score

test case

TPR

FPR 1

1

0

30

ROC Curve




o TPR = 0.0o FPR = 1.0 – TNR


score

test case

TPR

FPR 1

1

0

31

ROC Curve

Connect the dots… This is ROC curve What good is it?

o Captures info wrt all possible thresholds

o Removes threshold as a factor in the analysis

What does it all mean?Data Analysis

TPR

FPR 1

1

0

32

ROC Curve

Random classifier? o Yellow 45 degree line

Perfect classifier?o Red lines (Why?)

Above 45 degree line? o Better than randomo The closer to the red,

the closer to perfectData Analysis

TPR

FPR 1

1

0

33

Area Under the Curve (AUC)

ROC curve lives within a 1x1 square

Random classifier?o AUC ≈ 0.5

Perfect classifier (red)?o AUC = 1.0

Example curve (blue)?o AUC = 0.8

Data Analysis

TPR

FPR 1

1

0

34

Area Under the Curve (AUC)

Area under ROC curve quantifies successo 0.5 like flipping a coino 1.0 perfection

achieved AUC of ROC curve

o Enables us to compare different techniques

o And no need to worry about threshold

Data Analysis

TPR

FPR 1

1

0

35

Partial AUC

Might only consider cases where FPR < p

“Partial” AUC is AUCp o Area up to FPR of po Normalized by p

In this example,AUC0.4 = 0.2 / 0.4 = 0.5

AUC0.2 = 0.08/0.2 = 0.4 Data Analysis

TPR

FPR 1

1

0

36

Imbalance Problem

Suppose we train model for given malware family

In practice, we expect to score many more non-family files than family o Number of negative cases is largeo Number of positive cases is small

So what? Let’s consider an exampleData Analysis

37

Imbalance Problem

In practice, we need threshold For a given threshold, suppose

sensitivity = 0.99, specificity = 0.98o Then TPR = 0.99 and FPR = 0.02

Assume 1 in 1000 tested is malwareo Of the type our model trained to

detect Suppose we scan, say, 100k files

o What do we find?

Data Analysis

38

Imbalance Problem

Assuming TPR = 0.99 and FPR = 0.02o And 1 in 1000 is malware

After scanning 100k files…o Detect 99 of 100 actual malware (TP)o Misclassify 1 malware as benign (FN)o Correctly classify 97902 (out of

99900) benign as benign (TN)o Misclassify 1998 benign as malware

(FP) Data Analysis

39

Imbalance Problem

We have 97903 classified as benigno Of those, 97902 are actually benigno And 97902/97903 > 0.9999

We classified 2097 as malwareo Of these, only 99 are actual malwareo But 99/2097 < 0.05

Remember the “boy who cried wolf”?o Here, we have detector that cries

wolf…Data Analysis

40

Imbalance Solution?

What to do? There is inherent tradeoff between

sensitivity and specificity Suppose we can adjust threshold

soo TPR = 0.92 and FPR = 0.0003

As before…o We have 1 in 1000 is malwareo And we test 100k files

Data Analysis

41

Imbalance Solution?

Assuming TPR = 0.92 and FPR = 0.0003o And 1 in 1000 is malware

After scanning 100k files…o Detect 92 of 100 actual malware (TP)o Misclassify 8 malware as benign (FN)o Correctly classify 99870 (out of 99900)

benign as benign (TN)o Misclassify 30 benign as malware (FP)

Data Analysis

42

Imbalance Solution?

We have 99878 classified as benigno Of those, all but 8 are actually benigno And 99870/99878 > 0.9999

We classified 122 as malwareo Of these, 92 are actual malwareo And 92/122 > 0.75

Can adjust threshold to further reduce “crying wolf” effect

Data Analysis

43

Imbalance Problem

A better alternative? Instead of increasing FPR to lower

TPR o Perform secondary testing on files that

are initially classified as malwareo We can thus weed out most FP cases

This gives us best of both worldso Low FPR, few benign reported as

malware No free lunch, so what’s the cost?Data Analysis

44

Bottom Line Design your experiments properly

o Use n-fold cross validation (e.g., n = 5)o Generally, cross validation is important

Thresholding is important in practiceo But not so useful for analyzing resultso Accuracy not so informative either

Use ROC curves and compute AUCo Sometimes, partial AUC is better

Imbalance problem may be significant issue

Data Analysis

45

References

A.P. Bradley, The use of the area under the ROC curve in the evaluation of machine learning algorithms, Pattern Recognition, 30:1145-1159, 1997

Data Analysis

http://espace.library.uq.edu.au/eserv.php?pid=UQ:8925&dsID=pr-t.pdf



data analysis 1 mark stamp. topics experimental design o training set, test set, n-fold cross...

Documents

test set s2

limited data set

nomatch setpartition

training set s1 s3 sn

match scoresusually

match cases

test dataset

set of known malwareall