detection of metamorphic malware variants using global control flow analysis

37
1 © 2013 Applied Communication Sciences. A Business of the SI All Rights Reserved. Registered Trademark of TT Government Solutions, Inc. © 2013 Applied Communication Sciences. A Business of the SI. All Rights Reserved. Registered Trademark of TT Government Solutions, Inc. . Detection of Metamorphic Malware Variants Using Global Control Flow Analysis Shane R Snyder CERDEC US Army Aberdeen Proving Ground, MD Hira Agrawal Lisa Bahler Mike Little Josephine Micallef Systems & Security Research Applied Communication Sciences Basking Ridge, NJ

Upload: diella

Post on 24-Feb-2016

63 views

Category:

Documents


0 download

DESCRIPTION

Detection of Metamorphic Malware Variants Using Global Control Flow Analysis. Hira Agrawal Lisa Bahler Mike Little Josephine Micallef Systems & Security Research Applied Communication Sciences Basking Ridge, NJ. Shane R Snyder CERDEC US Army Aberdeen Proving Ground, MD. The Problem. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

1© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

© 2013 Applied Communication Sciences. A Business of the SI. All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

.

Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

Shane R SnyderCERDECUS ArmyAberdeen Proving Ground, MD

Hira AgrawalLisa BahlerMike LittleJosephine MicallefSystems & Security ResearchApplied Communication SciencesBasking Ridge, NJ

Page 2: Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

2© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

• Detect zero-day vulnerabilities arising from new variants of old malware A vast majority of existing malware is made up of

variations of old malware There is a substantial lag between the time a new

variant is discovered and the time its signature is added to the local AV signature database

Machines remain vulnerable during this time window, which may often be long

The Problem

Page 3: Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

3© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Current Solutions are Inadequate

• Current techniques derive syntactic malware signatures—based on control structures such as specific byte sequences, flow graphs, and call graphs

• These signatures are easily defeated using automated program diversification techniques such as those employed by new metamorphic transformation engines− Adding spurious sub graphs

within flow graphs − Adding spurious functions and

calls to those functions− Inlining and outlinig code into-

and out-of existing functions

Page 4: Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

4© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Approach

Malware Family A:Variant A1

Variant A2

Variant Aa

Malware Family B:Variant B1

Variant B2

Variant Bb

Malware Family X:Variant X1

Variant X2

Variant Xx

Current Approaches

Signature A1

Signature A2

Signature Aa

Signature B1

Signature B2

Signature Bb

Signature X1

Signature X2

Signature Xx

Abstract Signature A

Abstract Signature B

Abstract Signature X

Proposed Approach

Even though the variants differ in their control and data flow, they share a common set of core elements!

Page 5: Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

5© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

5

if xS1;S2;

elseS3;S4;

S5;

a

b

c

d

OriginalVersion

a

b c

d

Flow Graph ofthe Original Version

1

d

b c

S5

s1; s2 s3; s4Signature of theOriginal Version

3

l

m o

q

r

s

u

vx y

n pt w

f1()

f2()

f3()

f4()

main()

Flow Graph of the Sample Variant

2

5

s1; s2 s3; s4r u

qS5

Signature of theSample Variant

4

A Sample Variant

lm

oq

rs

uv

x

y

n

p

t

w

if xf1();

elsef3();

S5;

f1()S1;f2();

f2()S2;

f3()S3;f4();

f4()S4;

Abstract, Semantic Signatures

Page 6: Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

6© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Predicate Node

Non-Predicate Node

(v) Merged-Dominator Graph

a

b c

d

a

b c

d

(ii) Flow Graph (iii) Pre Dominator Tree

a

b c

d

(iv) Post-Dominator Tree

d

b c

a

if xS1;S2;

elseS3;S4;

S5;

a

b

c

d

(i) Malware

(vii) Malware Signature(Super-block Dominator

Tree Projected OverNon-Predicate Nodes)

d

b c

s5

s1; s2 s3; s4

(vi) Super-blockDominator Graph

and Tree

a, d

b c

x,s5

s1; s2 s3; s4

Construction of Local Signatures

Key

Page 7: Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

7© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

main

f1

f2 f4

f3

(iv) Call Graph

l

m o

q

r

s

u

vx y

n pt w

f1

f2

f3

f4

main

(ii) Flow Graphs

lm

oq

rs

x

n

pt

uv

y

w

if x f1();else f3();S5;

f2() S2;

f1() S1; f2()f3() S3; f4();f4() S4;

(i) Malware Variant

q

m,n o,p

r,s,t

x

u,v,w

y

(v) Intermediate Dominator Graph

q

m,n,r,s,t,x

o,p,u,v,w,y

(vi) Mega Block Dominator Graph

and Tree

s1; s2 s3; s4r,x, u,y

q S5

(vii) Variant Signature(Projected Mega Block Dominator Tree)

Construction of Global Signatures

Page 8: Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

8© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

c

h

a

r

f

j

g i

b d

e

a

r

j

g

b

System or library callnode

Othernodes

Entry or exit node

KEY

OriginalFlow Graph

ProjectedFlow Graph

Flow Graph Projection

Page 9: Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

9© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Flow Graph Projection Steps

c

h

a

r

f

j

g i

b d

e

Identify and mark all nodes

that represent system-

or library-calls

c

h

a

r

f

j

g i

b d

e

Mark the entry and

exit nodes, if not

already marked, as

relevant

c

h

a,d,e

r

f

j

g i

b

Merge all cycles

made up entirely of irrelevant

nodes

a,d,e,f

r

j

g,h

b,c

i

Merge any irrelevant node that

has a single predecessor

node with the latter

node

a,d,e,f

r

j,i

g,h

b,c

Merge any irrelevant node that

has a single

successor node, with that node

a

r

j

g

b

Assign the label of the

first instruction

of each node as

the label of that node

Page 10: Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

10© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

f1

f2

f6

f5

f4

f3

f1

f4

f5

Relevant function, which contains a system/library call

Other functions

KEY

OriginalCall Graph

ProjectedCall Graph

Function that directly or indirectly calls a relevant function

Call Graph Projection

Page 11: Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

11© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Call Graph Projection Steps

f1

f2

f6

f5

f4

f3

Mark the root node of

the call graph, if not

already marked, as

relevant

f1

f2

f6

f5

f4

f3

Identify and mark any node

that represents

function containing a relevant node as relevant

f1

f2

f6

f5

f4

f3

Mark all predecessors of all marked

nodes, as well as all their call sites, as relevant

f1

f4

f5,f6

Remove all irrelevant

nodes from which no relevant

nodes may be reached

f1

f2 f4

f3 f5,f6

Merge any irrelevant node that

has a single predecessor node, with that node

f1

f4

f5

Make the label of the first

instruction in each

node, as the label of that node

Page 12: Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

12© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Concept of Operations1. Offline Analysis

Instance of knownmalware binary

Graph basedrepresentation

(a) Analyze and transform

(d) distance < neighbor’s threshold

MalwareBinary

2. Online Analysis

New, unknown binary instance

Graph basedrepresentation

(c) Compare signatures

Graph-based SignatureFind the nearest neighbor of the new binary in the malware library and compute

the distance between them

Graph-based SignatureGraph-based SignatureAbstract signatures

(b) Generate abstract signatures

For efficiency, the chosen “distance”measure must satisfy the triangle inequality!

Benign instance!no

Malware instance!yes

Abstract signature of the new binary

Page 13: Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

13© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

T1

u

v x

w y

T3

u

w x

z y

T2

z

u

v x

w y

insert (z,1,w)

Edit Distance, De (Ti,Tj) = The length the shortest Edit Script (Ti,Tj)

delete (v,1,u)v

z

or, (i) relabel v to w (ii) relabel w to z

De(T1,T3) = 2

Tree Edit Distance

Edit Script (T1,T3) : (i) insert z as the first child of w (ii) delete v, the 1st child u

v

or, (i) delete v, the 1st child u (ii) insert z as the first child of w

Page 14: Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

14© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

De(T1,T2) = 1

De(T2,T3) = 1

De(T1,T3) = 2

T1

u

v x

w y

T3

u

w x

z y

T2

z

u

v x

w y

Tree Edit Distance (cont’d)

De satisfies the triangle inequality!

De(T1,T3) ≤ De(T1,T2) + De(T2,T3)

Page 15: Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

15© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

T1 (5 nodes) T2 (6 nodes)

u

v x

w y

z

u

v x

w y

Tree Edit Distance (cont’d)

T4 (106 nodes)

u

x

y

t

⁞ (+100 nodes)z

v

w

T3 (105 nodes)

u

w x

v y

t

⁞(+100 nodes)

De(T1,T2) = 1

De(T3,T4) = 1 !!!

Page 16: Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

16© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

T1 (5 nodes) T2 (6 nodes)

u

v x

w y

z

u

v x

w y

Normalized Tree Edit Distance

Dn(T1,T2) = 1/11

T4 (106 nodes)

u

x

y

t

⁞ (+100 nodes)z

v

w

T3 (105 nodes)

u

w x

v y

z

⁞(+100 nodes)

Dn(T3,T4) = 1/211 « 1/11

Dn(Ti,Tj) = De(Ti,Tj) / |Ti|+|Tj|

Page 17: Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

17© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

T1 (5 nodes) T2 (6 nodes)

u

v x

w y

v

u

v x

w y

Normalized Tree Edit Distance (cont’d)

Dn(T1,T2) = 1/11 = 10/110 Dn(T2,T3) = 1/11 = 10/110

Dn(T1,T3) = 2/10 = 22/110 > 20/110

Dn(T1,T3) > Dn(T1,T2) + Dn(T2,T3)

T3 (5 nodes)

u

w x

v y

Dn does NOT satisfy the triangle inequality!

Page 18: Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

18© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

T1 (5 nodes) T2 (6 nodes)

u

v x

w y

v

u

v x

w y

Normalized, Metric Tree Edit Distance

Dnm(T1,T2) = 2*De(T1,T2) / (|T1|+|T2| + De(T1,T2))

Dnm(T1, T2) = 2*1/(11+1) = 2/12 Dnm(T2, T3) = 2*1/(11+1) = 2/12Dnm(T1, T3) = 2*2/(10+2) = 4/12

Dnm(T1,T3) ≤ Dnm(T1,T2) + Dnm(T2,T3)

T3 (5 nodes)

u

w x

v y

Dnm satisfies the triangle inequality!

Page 19: Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

19© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Detecting Variants Missed by AV• We found a malware family with five variants

−Virus.Win32.Thorin−Virus.Win32.Thorin.b−Virus.Win32.Thorin.c−Virus.Win32.Thorin.d−Virus.Win32.Thorin.e

• A major AV product failed to detect the last variant—Virus.Win32.Thorin.e

• MAA correctly flags it as malware—based on abstract signature it derives from the first variant

• It detects the other three variants as well—without requiring a separate, dedicated signature for each of them, as AV products often do

Page 20: Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

20© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Checking for False Positives & Negatives

Platform Number of Families

Number of Malware

Family Size

Largest Smallest

DOS 2 2 1 1FreeBSD 2 7 6 1id1242 1 1 1 1IIS 1 4 4 4IRC 2 2 1 1Linux 48 67 5 1MSIL 2 2 1 1MSWord 3 3 1 1Multi 11 21 4 1QNX 1 2 2 2Svat 1 1 1 1VBS 2 2 1 1Win32 822 2742 89 1Win9x 133 473 62 1WinHLP 3 5 3 1Unknown 39 39 1 1Total 1034 3373 89 1

We chose ~500 Win32 samples from families of size 10 or more.

Page 21: Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

21© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Fivefold Cross Validation Test

Test Run # of Test Binaries

True Positives False Negatives

Count % Count %

1 104 103 99% 1 1%

2 109 103 94% 6 6%

3 104 94 90% 10 10%

4 106 99 93% 7 7%

5 105 105 100% 0 0%

Total 528 504 95% 24 5%

Page 22: Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

22© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Checking Against Benign Samples• We randomly picked a Windows 7 system folder

containing over 400 executables, ranging in size from 8 KB to over 70 KB.

• The false positives rate decreases predictably as the distance threshold is reduced.

Distance Threshold MisclassificationsCount %

0.20 103 < 25%0.19 63 < 16%0.17 28 < 7%0.15 15 < 4%0.10 4 < 1 %0.05 1 < 0.25%

Page 23: Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

23© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

The Need for Identifying Sub Families

Bagle

Klez

Mydoom

Mimail

Netsky

Roron

Malware Library Distance Matrix

Page 24: Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

24© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Request Latency

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100 1030

4,000

8,000

12,000

16,000

20,000

Total Profiling Uncompression UnpackingDisassembly SigGeneration SigMatching

Request #

Page 25: Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

25© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Request Latency (cont’d)

Latency Time(s) Fraction

Total 5,427 100%

Disassembly 3,425 63%

Profiling 594 11%

Unpacking 240 4%

Signature Matching 205 4%

Signature Generation 185 3%

Uncompression 26 0.5%

Page 26: Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

26© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

System Overhead

1 15 29 43 57 71 85 99 1131271411551691831972112252392532670

20406080

100

CPU Utilization Memory Utilization

Measurement Point in Time

Util

izat

ion

(%)

Average Utilization Idle Active CPU 2% 54% Memory 30% 35%

Page 27: Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

27© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Next Steps

• Liner AESA for Faster Signature Library Generation

• Identification of Malware Sub Families

• Automated Removal of “Redundant” Variants

• Family Specific Distance Thresholds

• Malware Fragment Matching

• Smart Label Matching

Page 28: Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

28© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Powerhouse Research. Practical Solutions.

Applied Communication Sciences and design logo is a registered trademark of TT Government Solutions, Inc.

Page 29: Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

29© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Backup Slides

Page 30: Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

30© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

• Some malware obfuscations inject redundant, unrelated instructions, which do not affect its external behavior but change its abstract signature

• Data flow analysis can help detect and eliminate such instructions− It can help determine direct- and indirect data

dependencies of relevant instructions− Programs make such calls to affect their environment—

files, network, registry, other processes, etc.− Malware must make such calls to accomplish its goals − Nodes that do not have an influence on any

system/library call can then be removed from abstract signatures

Incorporate Data Flow in Signatures

Page 31: Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

31© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

if xu = ...; v = ...;

elseu = -u; v = -v;

write(u*v);

a

b

c

dOriginal Malware

Control Flow Based AbstractSignature of Original Malware

d

b c

Obfuscated Code

if xu = ...; v = ...;

y = 1024; if !x

u = -u; v = -v;

while (y > 0) y -= u*v;

write(u*v);

a

b

c

d

m

pq

n

Control Flow Based AbstractSignature of Malware Variant

m,d

b c q

The Need for Data Flow Analysis

Page 32: Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

32© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Control & Data Flow Based AbstractSignature of Original Malware

d

b cControl & Data Flow Based Abstract

Signature of Malware Variant

if xu = ...; v = ...;

y = 1024; if !x

u = -u; v = -v;

while (y > 0) y -= u*v;

write(u*v);

a

b

c

d

m

pq

n

Obfuscated Code

Control Flow Based AbstractSignature of Malware Variant

m,d

b c q

d

d

b c

The Need for Data Flow Analysis (cont’d)

Page 33: Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

33© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Goal: Given (1) a database, D, of malware binaries, whose distances from one another are known, and (2) a new, query binary, q, which is not in D, find the nearest neighbor, n, of q in D.

Determining the Nearest Neighbor

Naïve Method: (1) Compute q’s distance from every malware in D. (2) The one with the shortest distance is the nearest neighbor of q.

Problem: Computing edit distances between graphical structures, including trees, is an expensive operation.

q

D

Solution: Exploit the triangle inequality to avoid many distance computations.

Page 34: Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

34© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Determining the Nearest Neighbor (cont’d)

New, query binary

Malware binary whose distance from the query binary has been computed

Malware binary whose distance from the query binary is, currently, unknown

Currently known nearest malware neighbor of the query binary

Malware binary that has been removed from further consideration as it cannot possibly be the nearest neighbor of the query binary in D

Malware binary whose distance from the query binary is to be computed next

Page 35: Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

35© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Each side of a triangle is ≥ the difference between the other two sides!

Each side of a triangle is ≤ the sum of the other two sides!

Exploiting Triangle Inequality

c

a b

p

q r

Suppose b ≥ a b ≤ a + c b – a ≤ c c ≥ b – a c ≥ | b – a |

Otherwise (b < a) a ≤ b + c a – b ≤ c c ≥ a – b c ≥ | b – a |

Pick any two sides, say a & b

p1

q r

p2 pi⁞

n

qr ≥ | q pi – pi r | for all pi qr ≥ max | q pi – pi r | for all pi

lowerbound (qr)If lowerbound (q r) ≥ q n, then q r ≥ q n There is no need to compute q r !

Page 36: Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

36© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Associate a LowerBound with each pi, and initialize it to zero.

Every time pi q is computed, Update nearest neighbor, n, and the corresponding d Update LowerBound(pj q) for all j as: LowerBound(pj q) = max(LowerBound(pj q), | pj n – n q | )

if (lowerBound(pj q) > d) then Remove pj from further consideration!

Exploiting Triangle Inequality (cont’d)pj

pi

q nd

pk

pj

q nd

pk

q nd

pk

Page 37: Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

37© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Powerhouse Research. Practical Solutions.

Applied Communication Sciences and design logo is a registered trademark of TT Government Solutions, Inc.