intrusion detection and malware analysis · a whirlwind tour of string matching today: naive string...

29
Intrusion Detection and Malware Analysis String matching algorithms Pavel Laskov Wilhelm Schickard Institute for Computer Science

Upload: others

Post on 08-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Intrusion Detection and Malware Analysis · A whirlwind tour of string matching Today: Naive string matching Fundamental preprocessing Knuth-Morris-Pratt Set matching Other problems/algorithms:

Intrusion Detection and Malware AnalysisString matching algorithms

Pavel LaskovWilhelm Schickard Institute for Computer Science

Page 2: Intrusion Detection and Malware Analysis · A whirlwind tour of string matching Today: Naive string matching Fundamental preprocessing Knuth-Morris-Pratt Set matching Other problems/algorithms:

Exact string matching

Problem (Exact string matching)Given a string P called a pattern and a longer string T called text,find all occurrences, if any, of pattern P in text T.

ExampleLet P = aba and T = bbabaxababay, then P occurs in T starting atpositions 3, 7 and 9 (notice the possible overlap of P in T).

Remark (Substrings vs. subsequences)It is a general convention to denote contiguous patterns asstrings whereas non-contiguous patterns (in the left-to-rightorder) are referred to as sequences.

Page 3: Intrusion Detection and Malware Analysis · A whirlwind tour of string matching Today: Naive string matching Fundamental preprocessing Knuth-Morris-Pratt Set matching Other problems/algorithms:

Inexact matching and alignment

Problem (Inexact string matching)Given a string P and a text T find all strings S in T that contain atmost k “errors” with respect to P.

ExampleLet P = aba and T = bbabaxababay, then inexact match of P in Twith respect to 1 character substitution occurs at positions 1, 3, 5,7 and 9.

Problem (Sequence alignment)Alignment of two strings S1 and S2 is obtained by inserting spaceseither into or at the ends of S1 and S2 so that every character ofone string corresponds to exactly one character of the other one.

Example

q a c _ d b dq a w x _ b _

Page 4: Intrusion Detection and Malware Analysis · A whirlwind tour of string matching Today: Naive string matching Fundamental preprocessing Knuth-Morris-Pratt Set matching Other problems/algorithms:

A whirlwind tour of string matching

Today:Naive string matchingFundamentalpreprocessingKnuth-Morris-PrattSet matching

Other problems/algorithms:Regular expression matchingRabin-Karp fingerprintsSuffix trees/arraysInexact matchingSequence alignment

Page 5: Intrusion Detection and Malware Analysis · A whirlwind tour of string matching Today: Naive string matching Fundamental preprocessing Knuth-Morris-Pratt Set matching Other problems/algorithms:

Naive string matching

Align left end of P with the left end of T and comparecharacters of P and T until a mismatch is found or P isexhausted.Shift P by one character and restart from the left of P.Continue until the right end of P shifts past the right end of T.Running time: O(mn)

Page 6: Intrusion Detection and Malware Analysis · A whirlwind tour of string matching Today: Naive string matching Fundamental preprocessing Knuth-Morris-Pratt Set matching Other problems/algorithms:

Speeding up the naive method

Naive algorithm:

T: xabxyabxyabxz

P: abxyabxz

*

abxyabxz

^^^^^^^*

abxyabxz

*

abxyabxz

*

abxyabxz

*

abxyabxz

^^^^^^^^

Larger shifts:

T: xabxyabxyabxz

P: abxyabxz

*

abxyabxz

^^^^^^^*

abxyabxz

^^^^^^^^

Saving partial matches:

T: xabxyabxyabxz

P: abxyabxz

*

abxyabxz

^^^^^^^*

abxyabxz

^^^^^

Page 7: Intrusion Detection and Malware Analysis · A whirlwind tour of string matching Today: Naive string matching Fundamental preprocessing Knuth-Morris-Pratt Set matching Other problems/algorithms:

Speeding up the naive method

Naive algorithm:

T: xabxyabxyabxz

P: abxyabxz

*

abxyabxz

^^^^^^^*

abxyabxz

*

abxyabxz

*

abxyabxz

*

abxyabxz

^^^^^^^^

Larger shifts:

T: xabxyabxyabxz

P: abxyabxz

*

abxyabxz

^^^^^^^*

abxyabxz

^^^^^^^^

Saving partial matches:

T: xabxyabxyabxz

P: abxyabxz

*

abxyabxz

^^^^^^^*

abxyabxz

^^^^^

Page 8: Intrusion Detection and Malware Analysis · A whirlwind tour of string matching Today: Naive string matching Fundamental preprocessing Knuth-Morris-Pratt Set matching Other problems/algorithms:

Speeding up the naive method

Naive algorithm:

T: xabxyabxyabxz

P: abxyabxz

*

abxyabxz

^^^^^^^*

abxyabxz

*

abxyabxz

*

abxyabxz

*

abxyabxz

^^^^^^^^

Larger shifts:

T: xabxyabxyabxz

P: abxyabxz

*

abxyabxz

^^^^^^^*

abxyabxz

^^^^^^^^

Saving partial matches:

T: xabxyabxyabxz

P: abxyabxz

*

abxyabxz

^^^^^^^*

abxyabxz

^^^^^

Page 9: Intrusion Detection and Malware Analysis · A whirlwind tour of string matching Today: Naive string matching Fundamental preprocessing Knuth-Morris-Pratt Set matching Other problems/algorithms:

Preprocessing

General idea: spend some modest amount of time onlearning about the internal structure of P or T in order tosave some time during search.Different preprocessing techniques were employed invarious original string matching algorithms.Similarity of preprocessing techniques can be expressed interms of a fundamental preprocessing which is independentof a search algorithm.

Page 10: Intrusion Detection and Malware Analysis · A whirlwind tour of string matching Today: Naive string matching Fundamental preprocessing Knuth-Morris-Pratt Set matching Other problems/algorithms:

Fundamental preprocessing

DefinitionGiven a string S and a position i > 1, let Zi(S) be the length ofthe longest substring of S that starts at i and matches a prefix ofS. This substring is called a Z-box.

� �

�-boxes

DefinitionFor any index i, ri is the right-most end of Z-boxes beginning at orbefore i; li is the left end of the corresponding Z-box.

Page 11: Intrusion Detection and Malware Analysis · A whirlwind tour of string matching Today: Naive string matching Fundamental preprocessing Knuth-Morris-Pratt Set matching Other problems/algorithms:

Fundamental preprocessing

DefinitionGiven a string S and a position i > 1, let Zi(S) be the length ofthe longest substring of S that starts at i and matches a prefix ofS. This substring is called a Z-box.

� �

�-boxes

� ���� �

DefinitionFor any index i, ri is the right-most end of Z-boxes beginning at orbefore i; li is the left end of the corresponding Z-box.

Page 12: Intrusion Detection and Malware Analysis · A whirlwind tour of string matching Today: Naive string matching Fundamental preprocessing Knuth-Morris-Pratt Set matching Other problems/algorithms:

Linear time fundamental preprocessing

Goal: Compute Zi for each successive position i startingfrom i = 2.All Z-values are kept, as well as the recent pair r, l.Initialization: explicitly compute Z2 by scanning S[2 . . . |S|]and comparing it with S[1 . . . |S|].Recursion: given Z2, . . . , Zk−1, rk−1, lk−1, compute Zk, rk, lk.

Page 13: Intrusion Detection and Malware Analysis · A whirlwind tour of string matching Today: Naive string matching Fundamental preprocessing Knuth-Morris-Pratt Set matching Other problems/algorithms:

Recursive step of the Z-algorithm

Key insight: character in the position k also appears in theposition k′ = k− lk−1 + 1; this also holds for entire substringsS[k . . . rk−1] and S[k′ . . . Zlk−1

].

S :α α

β β

k ′ Zlk−1 lk−1 k rk−1

If Zk′ ≤ |β|, then Zk = Zk′ , and r, l remain unchanged.

S :α α

β βγ γ γ

k ′ Zlk−1 lk−1 k rk−1Zk ′

If Zk′ ≥ |β| then the entire β is a prefix of S. Keep scanninguntil a mismatch occurs, set l to k and r to the characterbefore a mismatch.

S :α α

β β βγ γ ?k ′ Zlk−1 lk−1 k rk−1Zk ′

Page 14: Intrusion Detection and Malware Analysis · A whirlwind tour of string matching Today: Naive string matching Fundamental preprocessing Knuth-Morris-Pratt Set matching Other problems/algorithms:

Recursive step of the Z-algorithm

Key insight: character in the position k also appears in theposition k′ = k− lk−1 + 1; this also holds for entire substringsS[k . . . rk−1] and S[k′ . . . Zlk−1

].

S :α α

β β

k ′ Zlk−1 lk−1 k rk−1

If Zk′ ≤ |β|, then Zk = Zk′ , and r, l remain unchanged.

S :α α

β βγ γ γ

k ′ Zlk−1 lk−1 k rk−1Zk ′

If Zk′ ≥ |β| then the entire β is a prefix of S. Keep scanninguntil a mismatch occurs, set l to k and r to the characterbefore a mismatch.

S :α α

β β βγ γ ?k ′ Zlk−1 lk−1 k rk−1Zk ′

Page 15: Intrusion Detection and Malware Analysis · A whirlwind tour of string matching Today: Naive string matching Fundamental preprocessing Knuth-Morris-Pratt Set matching Other problems/algorithms:

Recursive step of the Z-algorithm

Key insight: character in the position k also appears in theposition k′ = k− lk−1 + 1; this also holds for entire substringsS[k . . . rk−1] and S[k′ . . . Zlk−1

].

S :α α

β β

k ′ Zlk−1 lk−1 k rk−1

If Zk′ ≤ |β|, then Zk = Zk′ , and r, l remain unchanged.

S :α α

β βγ γ γ

k ′ Zlk−1 lk−1 k rk−1Zk ′

If Zk′ ≥ |β| then the entire β is a prefix of S. Keep scanninguntil a mismatch occurs, set l to k and r to the characterbefore a mismatch.

S :α α

β β βγ γ ?k ′ Zlk−1 lk−1 k rk−1Zk ′

Page 16: Intrusion Detection and Malware Analysis · A whirlwind tour of string matching Today: Naive string matching Fundamental preprocessing Knuth-Morris-Pratt Set matching Other problems/algorithms:

Running time of the Z-algorithm

TheoremAll values Zi(S) can be computed in O(|S|) time.

Proof.For each “≤”-iteration k, a constant-time work is needed foreach iteration.In each “≥”-iteration, the value rk strictly increases, howevernot beyond the end of S. Hence the overall amount of work isbound by |S|.

Page 17: Intrusion Detection and Malware Analysis · A whirlwind tour of string matching Today: Naive string matching Fundamental preprocessing Knuth-Morris-Pratt Set matching Other problems/algorithms:

Knuth-Morris-Pratt algorithm

General idea:

Keep scanning the pattern P and text T left-to-right untilmismatch is found.Shift P such that its prefix overlaps with its suffix before themismatch position.

��

��

�before shift

�after shift

Continue scanning from the mismatching position.

Page 18: Intrusion Detection and Malware Analysis · A whirlwind tour of string matching Today: Naive string matching Fundamental preprocessing Knuth-Morris-Pratt Set matching Other problems/algorithms:

Shifts in KMP

Let the shift pointer spi be the length of the longest prefix of Pwhich is the suffix of P[1 . . . i] (0 if no such prefix exists).In KMP, if a mismatch is found in position i + 1 in P, P can beshifted by i− spi positions to the right.For each i in P, spi is the length of a Z-box ending at i.Computing spi: after initializing all sp’s to 0, loop backwardsover j and set spj+Zj−1 = Zj.

Page 19: Intrusion Detection and Malware Analysis · A whirlwind tour of string matching Today: Naive string matching Fundamental preprocessing Knuth-Morris-Pratt Set matching Other problems/algorithms:

Correctness of the shift rule

TheoremFor any alignment of P and T, if characters 1 through i of P matchtheir counterparts in T but character i + 1 mismatches T(k), thenP can be shifted by i− spi positions to the right without passingany occurrence of P in T.

Proof.Let β denote the prefix of P after shift of length sp. By definition ofsp, β matches its counterpart in T. Suppose there exist a missedoccurrence of P in T which begins earlier than a shifted P, i.e. itbegins with a prefix αβ. Then αβ is a suffix of P before shift whichis a prefix of (missed) P. However, the longest such suffix musthave been of |β|, which is a contradiction.

Page 20: Intrusion Detection and Malware Analysis · A whirlwind tour of string matching Today: Naive string matching Fundamental preprocessing Knuth-Morris-Pratt Set matching Other problems/algorithms:

Complete KMP pseudocode

procedure KNUTH-MORRIS-PRATT(T, P) . |P| = n; |T| = mPreprocess P to find F(k) = spk−1 + 1 for all k from 1 to n + 1.c← 1 . Text running indexp← 1 . Pattern running indexwhile c + (n− p) ≤ m do

while P(p) = T(c) and p ≤ n do . Scan until mismatchp← p + 1c← c + 1

end whileif p = n + 1 then . End of p

Report an occurrence of P at the position c− n in Tend ifif p = 1 then . Mismatch at position 1 of p

c← c + 1end ifp← F(p) . Shift P by p− sp

end whileend procedure

Page 21: Intrusion Detection and Malware Analysis · A whirlwind tour of string matching Today: Naive string matching Fundamental preprocessing Knuth-Morris-Pratt Set matching Other problems/algorithms:

Exact pattern set matching

ProblemGiven a set of patterns P = {P1, . . . , Pz}, find all occurrences ofsome pattern from P in text T.

Possible solutions:Run a standard single pattern matching algorithm (e.g.KMP) z times: O(z(m + n)).Build a suffix tree for T and scan each pattern in P against it:O(m + zn).Build a keyword tree for P and run the Aho-Corasickalgorithm; O(m + n + k), where k is the number of matches.

Page 22: Intrusion Detection and Malware Analysis · A whirlwind tour of string matching Today: Naive string matching Fundamental preprocessing Knuth-Morris-Pratt Set matching Other problems/algorithms:

Keyword tree (trie)

A keyword tree K corresponding to a set of patterns P is a treesatisfying the following conditions:

Each edge is labeled with exactly one character.Any two edges out of the same node have different labels.Every pattern in P maps to some node v in K such thispattern is spelled out by edge labels on the path from theroot to v.Every leaf in K corresponds to some pattern in P .

Page 23: Intrusion Detection and Malware Analysis · A whirlwind tour of string matching Today: Naive string matching Fundamental preprocessing Knuth-Morris-Pratt Set matching Other problems/algorithms:

Keyword tree example

Keyword tree for P = {“error”,“potato”,“pottery”,“other”,“otter”}:

p

o

t

a

t

o

2

o

t

h

e

r

4

t

e

r

y

3

t

e

r

5

er

ro

r

1

Page 24: Intrusion Detection and Malware Analysis · A whirlwind tour of string matching Today: Naive string matching Fundamental preprocessing Knuth-Morris-Pratt Set matching Other problems/algorithms:

Naive set matching using keyword tree

Follow a unique path in K that matches a substring of Tstarting from a fixed position l until either a marked node or amismatch is encountered.Move to the next position l and repeat until T is exhausted.Running time: O(mn).

Page 25: Intrusion Detection and Malware Analysis · A whirlwind tour of string matching Today: Naive string matching Fundamental preprocessing Knuth-Morris-Pratt Set matching Other problems/algorithms:

Failure links

DefinitionFor any node v in a keyword tree, letL(v) denote the label sequence on the path from root to v,lp(v) be the longest suffix of L(v) which is a prefix of somepattern in P , andnv be the unique node in K labeled with the suffix of L(v) oflength lp(v).

DefinitionA failure link is a pair of nodes (v, nv).

Page 26: Intrusion Detection and Malware Analysis · A whirlwind tour of string matching Today: Naive string matching Fundamental preprocessing Knuth-Morris-Pratt Set matching Other problems/algorithms:

Failure link example

Keyword tree for P = {“error”,“potato”,“pottery”,“other”,“otter”}augmented with failure links:

p

o

t

a

t

o

2

o

t

h

e

r

4

t

e

r

y

3

t

e

r

5

er

ro

r

1

Page 27: Intrusion Detection and Malware Analysis · A whirlwind tour of string matching Today: Naive string matching Fundamental preprocessing Knuth-Morris-Pratt Set matching Other problems/algorithms:

Aho-Corasick algorithm

procedure AHO-CORASICK SEARCH(T, P,K)c← 1 . Text running indexl← 1 . Starting position of current matchw← root of K . Running keyword tree node pointerrepeat

while there is an edge (w, w′) labeled with T(c) doif w′ is marked with i then

Report an occurrence of Pi at the position l in Tend ifw← w′

c← c + 1end whilew← nw . Follow a failure linkl← c− lp(w) . Jump to the next search start

until c > mend procedure

Page 28: Intrusion Detection and Malware Analysis · A whirlwind tour of string matching Today: Naive string matching Fundamental preprocessing Knuth-Morris-Pratt Set matching Other problems/algorithms:

Computing failure links

function FAILURE_LINK(v, K)v′ ← parent of vx← character on the edge (v′, v)w← nv′ . Begin with the failure link of v’s parentwhile there is no edge of w labeled with x, and w 6= root do

w← nw . Follow failure links until an edge labeled x foundend whileif there is an edge (w, w′) out of w labeled with x then

nv ← w′

elsenv ← root

end ifend function

Page 29: Intrusion Detection and Malware Analysis · A whirlwind tour of string matching Today: Naive string matching Fundamental preprocessing Knuth-Morris-Pratt Set matching Other problems/algorithms:

Recommended reading

D. Gusfield.Algorithms on strings, trees, and sequences.Cambridge University Press, 1997.