detection of metamorphic malware variants using global control flow analysis

Post on 24-Feb-2016

63 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Detection of Metamorphic Malware Variants Using Global Control Flow Analysis. Hira Agrawal Lisa Bahler Mike Little Josephine Micallef Systems & Security Research Applied Communication Sciences Basking Ridge, NJ. Shane R Snyder CERDEC US Army Aberdeen Proving Ground, MD. The Problem. - PowerPoint PPT Presentation

TRANSCRIPT

1© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

© 2013 Applied Communication Sciences. A Business of the SI. All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

.

Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

Shane R SnyderCERDECUS ArmyAberdeen Proving Ground, MD

Hira AgrawalLisa BahlerMike LittleJosephine MicallefSystems & Security ResearchApplied Communication SciencesBasking Ridge, NJ

2© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

• Detect zero-day vulnerabilities arising from new variants of old malware A vast majority of existing malware is made up of

variations of old malware There is a substantial lag between the time a new

variant is discovered and the time its signature is added to the local AV signature database

Machines remain vulnerable during this time window, which may often be long

The Problem

3© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Current Solutions are Inadequate

• Current techniques derive syntactic malware signatures—based on control structures such as specific byte sequences, flow graphs, and call graphs

• These signatures are easily defeated using automated program diversification techniques such as those employed by new metamorphic transformation engines− Adding spurious sub graphs

within flow graphs − Adding spurious functions and

calls to those functions− Inlining and outlinig code into-

and out-of existing functions

4© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Approach

Malware Family A:Variant A1

Variant A2

Variant Aa

Malware Family B:Variant B1

Variant B2

Variant Bb

Malware Family X:Variant X1

Variant X2

Variant Xx

Current Approaches

Signature A1

Signature A2

Signature Aa

Signature B1

Signature B2

Signature Bb

Signature X1

Signature X2

Signature Xx

Abstract Signature A

Abstract Signature B

Abstract Signature X

Proposed Approach

Even though the variants differ in their control and data flow, they share a common set of core elements!

5© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

5

if xS1;S2;

elseS3;S4;

S5;

a

b

c

d

OriginalVersion

a

b c

d

Flow Graph ofthe Original Version

1

d

b c

S5

s1; s2 s3; s4Signature of theOriginal Version

3

l

m o

q

r

s

u

vx y

n pt w

f1()

f2()

f3()

f4()

main()

Flow Graph of the Sample Variant

2

5

s1; s2 s3; s4r u

qS5

Signature of theSample Variant

4

A Sample Variant

lm

oq

rs

uv

x

y

n

p

t

w

if xf1();

elsef3();

S5;

f1()S1;f2();

f2()S2;

f3()S3;f4();

f4()S4;

Abstract, Semantic Signatures

6© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Predicate Node

Non-Predicate Node

(v) Merged-Dominator Graph

a

b c

d

a

b c

d

(ii) Flow Graph (iii) Pre Dominator Tree

a

b c

d

(iv) Post-Dominator Tree

d

b c

a

if xS1;S2;

elseS3;S4;

S5;

a

b

c

d

(i) Malware

(vii) Malware Signature(Super-block Dominator

Tree Projected OverNon-Predicate Nodes)

d

b c

s5

s1; s2 s3; s4

(vi) Super-blockDominator Graph

and Tree

a, d

b c

x,s5

s1; s2 s3; s4

Construction of Local Signatures

Key

7© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

main

f1

f2 f4

f3

(iv) Call Graph

l

m o

q

r

s

u

vx y

n pt w

f1

f2

f3

f4

main

(ii) Flow Graphs

lm

oq

rs

x

n

pt

uv

y

w

if x f1();else f3();S5;

f2() S2;

f1() S1; f2()f3() S3; f4();f4() S4;

(i) Malware Variant

q

m,n o,p

r,s,t

x

u,v,w

y

(v) Intermediate Dominator Graph

q

m,n,r,s,t,x

o,p,u,v,w,y

(vi) Mega Block Dominator Graph

and Tree

s1; s2 s3; s4r,x, u,y

q S5

(vii) Variant Signature(Projected Mega Block Dominator Tree)

Construction of Global Signatures

8© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

c

h

a

r

f

j

g i

b d

e

a

r

j

g

b

System or library callnode

Othernodes

Entry or exit node

KEY

OriginalFlow Graph

ProjectedFlow Graph

Flow Graph Projection

9© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Flow Graph Projection Steps

c

h

a

r

f

j

g i

b d

e

Identify and mark all nodes

that represent system-

or library-calls

c

h

a

r

f

j

g i

b d

e

Mark the entry and

exit nodes, if not

already marked, as

relevant

c

h

a,d,e

r

f

j

g i

b

Merge all cycles

made up entirely of irrelevant

nodes

a,d,e,f

r

j

g,h

b,c

i

Merge any irrelevant node that

has a single predecessor

node with the latter

node

a,d,e,f

r

j,i

g,h

b,c

Merge any irrelevant node that

has a single

successor node, with that node

a

r

j

g

b

Assign the label of the

first instruction

of each node as

the label of that node

10© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

f1

f2

f6

f5

f4

f3

f1

f4

f5

Relevant function, which contains a system/library call

Other functions

KEY

OriginalCall Graph

ProjectedCall Graph

Function that directly or indirectly calls a relevant function

Call Graph Projection

11© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Call Graph Projection Steps

f1

f2

f6

f5

f4

f3

Mark the root node of

the call graph, if not

already marked, as

relevant

f1

f2

f6

f5

f4

f3

Identify and mark any node

that represents

function containing a relevant node as relevant

f1

f2

f6

f5

f4

f3

Mark all predecessors of all marked

nodes, as well as all their call sites, as relevant

f1

f4

f5,f6

Remove all irrelevant

nodes from which no relevant

nodes may be reached

f1

f2 f4

f3 f5,f6

Merge any irrelevant node that

has a single predecessor node, with that node

f1

f4

f5

Make the label of the first

instruction in each

node, as the label of that node

12© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Concept of Operations1. Offline Analysis

Instance of knownmalware binary

Graph basedrepresentation

(a) Analyze and transform

(d) distance < neighbor’s threshold

MalwareBinary

2. Online Analysis

New, unknown binary instance

Graph basedrepresentation

(c) Compare signatures

Graph-based SignatureFind the nearest neighbor of the new binary in the malware library and compute

the distance between them

Graph-based SignatureGraph-based SignatureAbstract signatures

(b) Generate abstract signatures

For efficiency, the chosen “distance”measure must satisfy the triangle inequality!

Benign instance!no

Malware instance!yes

Abstract signature of the new binary

13© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

T1

u

v x

w y

T3

u

w x

z y

T2

z

u

v x

w y

insert (z,1,w)

Edit Distance, De (Ti,Tj) = The length the shortest Edit Script (Ti,Tj)

delete (v,1,u)v

z

or, (i) relabel v to w (ii) relabel w to z

De(T1,T3) = 2

Tree Edit Distance

Edit Script (T1,T3) : (i) insert z as the first child of w (ii) delete v, the 1st child u

v

or, (i) delete v, the 1st child u (ii) insert z as the first child of w

14© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

De(T1,T2) = 1

De(T2,T3) = 1

De(T1,T3) = 2

T1

u

v x

w y

T3

u

w x

z y

T2

z

u

v x

w y

Tree Edit Distance (cont’d)

De satisfies the triangle inequality!

De(T1,T3) ≤ De(T1,T2) + De(T2,T3)

15© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

T1 (5 nodes) T2 (6 nodes)

u

v x

w y

z

u

v x

w y

Tree Edit Distance (cont’d)

T4 (106 nodes)

u

x

y

t

⁞ (+100 nodes)z

v

w

T3 (105 nodes)

u

w x

v y

t

⁞(+100 nodes)

De(T1,T2) = 1

De(T3,T4) = 1 !!!

16© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

T1 (5 nodes) T2 (6 nodes)

u

v x

w y

z

u

v x

w y

Normalized Tree Edit Distance

Dn(T1,T2) = 1/11

T4 (106 nodes)

u

x

y

t

⁞ (+100 nodes)z

v

w

T3 (105 nodes)

u

w x

v y

z

⁞(+100 nodes)

Dn(T3,T4) = 1/211 « 1/11

Dn(Ti,Tj) = De(Ti,Tj) / |Ti|+|Tj|

17© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

T1 (5 nodes) T2 (6 nodes)

u

v x

w y

v

u

v x

w y

Normalized Tree Edit Distance (cont’d)

Dn(T1,T2) = 1/11 = 10/110 Dn(T2,T3) = 1/11 = 10/110

Dn(T1,T3) = 2/10 = 22/110 > 20/110

Dn(T1,T3) > Dn(T1,T2) + Dn(T2,T3)

T3 (5 nodes)

u

w x

v y

Dn does NOT satisfy the triangle inequality!

18© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

T1 (5 nodes) T2 (6 nodes)

u

v x

w y

v

u

v x

w y

Normalized, Metric Tree Edit Distance

Dnm(T1,T2) = 2*De(T1,T2) / (|T1|+|T2| + De(T1,T2))

Dnm(T1, T2) = 2*1/(11+1) = 2/12 Dnm(T2, T3) = 2*1/(11+1) = 2/12Dnm(T1, T3) = 2*2/(10+2) = 4/12

Dnm(T1,T3) ≤ Dnm(T1,T2) + Dnm(T2,T3)

T3 (5 nodes)

u

w x

v y

Dnm satisfies the triangle inequality!

19© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Detecting Variants Missed by AV• We found a malware family with five variants

−Virus.Win32.Thorin−Virus.Win32.Thorin.b−Virus.Win32.Thorin.c−Virus.Win32.Thorin.d−Virus.Win32.Thorin.e

• A major AV product failed to detect the last variant—Virus.Win32.Thorin.e

• MAA correctly flags it as malware—based on abstract signature it derives from the first variant

• It detects the other three variants as well—without requiring a separate, dedicated signature for each of them, as AV products often do

20© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Checking for False Positives & Negatives

Platform Number of Families

Number of Malware

Family Size

Largest Smallest

DOS 2 2 1 1FreeBSD 2 7 6 1id1242 1 1 1 1IIS 1 4 4 4IRC 2 2 1 1Linux 48 67 5 1MSIL 2 2 1 1MSWord 3 3 1 1Multi 11 21 4 1QNX 1 2 2 2Svat 1 1 1 1VBS 2 2 1 1Win32 822 2742 89 1Win9x 133 473 62 1WinHLP 3 5 3 1Unknown 39 39 1 1Total 1034 3373 89 1

We chose ~500 Win32 samples from families of size 10 or more.

21© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Fivefold Cross Validation Test

Test Run # of Test Binaries

True Positives False Negatives

Count % Count %

1 104 103 99% 1 1%

2 109 103 94% 6 6%

3 104 94 90% 10 10%

4 106 99 93% 7 7%

5 105 105 100% 0 0%

Total 528 504 95% 24 5%

22© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Checking Against Benign Samples• We randomly picked a Windows 7 system folder

containing over 400 executables, ranging in size from 8 KB to over 70 KB.

• The false positives rate decreases predictably as the distance threshold is reduced.

Distance Threshold MisclassificationsCount %

0.20 103 < 25%0.19 63 < 16%0.17 28 < 7%0.15 15 < 4%0.10 4 < 1 %0.05 1 < 0.25%

23© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

The Need for Identifying Sub Families

Bagle

Klez

Mydoom

Mimail

Netsky

Roron

Malware Library Distance Matrix

24© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Request Latency

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100 1030

4,000

8,000

12,000

16,000

20,000

Total Profiling Uncompression UnpackingDisassembly SigGeneration SigMatching

Request #

25© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Request Latency (cont’d)

Latency Time(s) Fraction

Total 5,427 100%

Disassembly 3,425 63%

Profiling 594 11%

Unpacking 240 4%

Signature Matching 205 4%

Signature Generation 185 3%

Uncompression 26 0.5%

26© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

System Overhead

1 15 29 43 57 71 85 99 1131271411551691831972112252392532670

20406080

100

CPU Utilization Memory Utilization

Measurement Point in Time

Util

izat

ion

(%)

Average Utilization Idle Active CPU 2% 54% Memory 30% 35%

27© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Next Steps

• Liner AESA for Faster Signature Library Generation

• Identification of Malware Sub Families

• Automated Removal of “Redundant” Variants

• Family Specific Distance Thresholds

• Malware Fragment Matching

• Smart Label Matching

28© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Powerhouse Research. Practical Solutions.

Applied Communication Sciences and design logo is a registered trademark of TT Government Solutions, Inc.

29© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Backup Slides

30© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

• Some malware obfuscations inject redundant, unrelated instructions, which do not affect its external behavior but change its abstract signature

• Data flow analysis can help detect and eliminate such instructions− It can help determine direct- and indirect data

dependencies of relevant instructions− Programs make such calls to affect their environment—

files, network, registry, other processes, etc.− Malware must make such calls to accomplish its goals − Nodes that do not have an influence on any

system/library call can then be removed from abstract signatures

Incorporate Data Flow in Signatures

31© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

if xu = ...; v = ...;

elseu = -u; v = -v;

write(u*v);

a

b

c

dOriginal Malware

Control Flow Based AbstractSignature of Original Malware

d

b c

Obfuscated Code

if xu = ...; v = ...;

y = 1024; if !x

u = -u; v = -v;

while (y > 0) y -= u*v;

write(u*v);

a

b

c

d

m

pq

n

Control Flow Based AbstractSignature of Malware Variant

m,d

b c q

The Need for Data Flow Analysis

32© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Control & Data Flow Based AbstractSignature of Original Malware

d

b cControl & Data Flow Based Abstract

Signature of Malware Variant

if xu = ...; v = ...;

y = 1024; if !x

u = -u; v = -v;

while (y > 0) y -= u*v;

write(u*v);

a

b

c

d

m

pq

n

Obfuscated Code

Control Flow Based AbstractSignature of Malware Variant

m,d

b c q

d

d

b c

The Need for Data Flow Analysis (cont’d)

33© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Goal: Given (1) a database, D, of malware binaries, whose distances from one another are known, and (2) a new, query binary, q, which is not in D, find the nearest neighbor, n, of q in D.

Determining the Nearest Neighbor

Naïve Method: (1) Compute q’s distance from every malware in D. (2) The one with the shortest distance is the nearest neighbor of q.

Problem: Computing edit distances between graphical structures, including trees, is an expensive operation.

q

D

Solution: Exploit the triangle inequality to avoid many distance computations.

34© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Determining the Nearest Neighbor (cont’d)

New, query binary

Malware binary whose distance from the query binary has been computed

Malware binary whose distance from the query binary is, currently, unknown

Currently known nearest malware neighbor of the query binary

Malware binary that has been removed from further consideration as it cannot possibly be the nearest neighbor of the query binary in D

Malware binary whose distance from the query binary is to be computed next

35© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Each side of a triangle is ≥ the difference between the other two sides!

Each side of a triangle is ≤ the sum of the other two sides!

Exploiting Triangle Inequality

c

a b

p

q r

Suppose b ≥ a b ≤ a + c b – a ≤ c c ≥ b – a c ≥ | b – a |

Otherwise (b < a) a ≤ b + c a – b ≤ c c ≥ a – b c ≥ | b – a |

Pick any two sides, say a & b

p1

q r

p2 pi⁞

n

qr ≥ | q pi – pi r | for all pi qr ≥ max | q pi – pi r | for all pi

lowerbound (qr)If lowerbound (q r) ≥ q n, then q r ≥ q n There is no need to compute q r !

36© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Associate a LowerBound with each pi, and initialize it to zero.

Every time pi q is computed, Update nearest neighbor, n, and the corresponding d Update LowerBound(pj q) for all j as: LowerBound(pj q) = max(LowerBound(pj q), | pj n – n q | )

if (lowerBound(pj q) > d) then Remove pj from further consideration!

Exploiting Triangle Inequality (cont’d)pj

pi

q nd

pk

pj

q nd

pk

q nd

pk

37© 2013 Applied Communication Sciences. A Business of the SI

All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.

Powerhouse Research. Practical Solutions.

Applied Communication Sciences and design logo is a registered trademark of TT Government Solutions, Inc.

top related