suffix tree applications

Post on 06-Apr-2018

222 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 1/48

Applications

• Exact string and substring matching

• Longest common substrings

• Finding and representing repeatedsubstrings efficiently

• Applications that lead to alternative, space

efficient implementations – Matching statistics

 – Suffix Arrays

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 2/48

String and substrings

• Exact String matching: –  Input

• Pattern P of length n

• Text T of length m

 –  Output• Position of all occurrences of P in T

• Solution method –  Preprocess to create suffix tree for T

• O(m) time, O(m) space

 –  Maximally match P in suffix tree

 –  Output all leaf positions below match point

• O(n+k) time where k is number of matches

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 3/48

String and substrings

• Exact set matching: –  Input

• Set of patterns {Pi} of total length n

• Text T of length m

 –  Output• Position of all occurrences of each pattern Pi in T

• Solution method –  Preprocess to create suffix tree for T

• O(m) time, O(m) space

 –  Maximally match each Pi in suffix tree

 –  Output all leaf positions below match point

• O(n+k) time where k is number of total matches

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 4/48

Comparison with Aho-Corasick 

• Aho-Corasick  –  O(n) preprocess time and space

• to build keyword tree of set of patterns P

 –  O(m+k) search time

• Suffix Tree Approach –  O(m) preprocess time and space

• to build suffix tree of T

 –  O(n+k) search time

 –  Using matching statistics to be defined, can make thistradeoff similar to that of Aho-Corasick 

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 5/48

String and substrings

• Substring problem: –  Input

• Set of patterns {Pi} of total length n

• Text T of length m (m < n now)

 –  Output• Position of all occurrences of T in each pattern Pi 

• Solution method

 –  Preprocess to create generalized suffix tree for {Pi}• O(n) time, O(n) space

 –  Maximally match T in generalized suffix tree

 –  Output all leaf positions below match point

• O(m+k) time where k is number of total matches

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 6/48

Common Substrings

• Longest Common Substring problem: –  Input

• Strings S and T

 –  Output• longest common substring of S and T (and position in S and T)

• Solution method –  Preprocess to create generalized suffix tree for {S,T}

 –  Mark each node by whether or not its subtree contains aleaf node of S, T, or both

• Simple postfix tree traversal algorithm to do this

 –  Path label of node with greatest string depth is the

longest common substring of S and T

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 7/48

Common Substrings

• Common substrings of length k problem: –  Input

• Strings S and T

• Integer k 

 –  Output• all substrings of S and T (and position in S and T) of length at

least k 

• Solution method –  Same as previous problem

 –  Look for all nodes with 2 leaf labels of string depth atleast k 

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 8/48

Longest Common Substrings of 

>2 Strings• Definition: For a given set of K strings, l(j) for 2

<= j <= K is the length of the longest substringcommon to at least j of the K strings

• Example: {sanddollar, sandlot, handler, grand,pantry} –  j l(j) one string

 –  2 4 sand –  3 3 and

 –  4 3 and

 –  5 2 an

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 9/48

Problem definition and solution

• Longest common substrings of >2 strings: – Input

• Strings S1, …, SK (total length n) – Output

• l(j) (and pointers to substrings) for 2 <= j <= K

• Solution – Build a generalized suffix tree for the K strings

• each string has a unique end character, so each leaf shows up only once

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 10/48

Solution continued

 –  Build a generalized suffix tree for the K strings• each string has a unique end character, so each leaf shows up

only once

 –  C(v): number of distinct leaf labels in subtree rooted at

node v –  Given C(v) values and string-depth values, do a simple

traversal of tree to find these K-1 values and pointers tolocations in substrings

 –  Computing C(v) efficiently• # of leaves is not correct as some leaves may have same label

• length K bit vector, 1 bit per string in set

• OR your way up the tree

 –  Each OR op takes O(K) time which give O(Kn) running time

• Can be improved to be O(n) later

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 11/48

Repeated substrings

• Given a single string S

• Definitions

 –  maximal pair in S is a pair of identical substrings a andb in S such that the character to the immediate left(right) of a is different than the character to theimmediate left (right) of b.

• Add unique characters to front and end of S to include prefixes

and suffixes. –  Representation: (p1, p2, n’) 

• starting positions and length of the maximal pair

 –  R(S) is the set of all triples representing maximal pairsin S

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 12/48

Example

• S = xabcyiiizabcqabcyrxar

• 123456789012345678901

 – (2, 10, 3) is a maximal pair

 – (10, 14, 3) is a maximal pair

 – (2, 14, 3) is not a maximal pair

• (2, 14, 4) is a maximal pair

 – Note positions 2 and 14 are the start positions

of two distinct maximal pairs

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 13/48

More definitions

• A maximal repeat a is a substring in S that is the

substring defined by a maximal pair of S

• R’(S) is the set of maximal repeats • Previous example

 –  abc and abcy are maximal repeats of S

 –  abc is represented only once

 – |R’(S)| is smaller than R(S) as abc shows up twice in

the second set but only once in the first set

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 14/48

Even more definitions

• A supermaximal repeat a is a maximal

repeat of S that never occurs as a substring

of another maximal repeat of S• Previous example

 – abcy is a supermaximal repeat of S

 – abc is NOT a supermaximal repeat of S

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 15/48

Problem definition

• Maximal repeats

 – Input

• String S (length n)

 – Output

• R’(S) 

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 16/48

Properties of maximal repeats

• Construct suffix tree for S

• Lemma

 –  If a is a maximal repeat in S, then a is the path-label of an internal node v in T•  a does not end in the middle of an edge

• (captures next character after a is distinct)

• Corollary –  There are at most n maximal repeats

• n leaves

• all internal nodes except the root have at least two children

• therefore, at most n internal nodes

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 17/48

More properties of maximal

repeats• Definitions

 –  Character S(i-1) is the left character of i

 –  The left character of a leaf of a suffix tree T is the leftcharacter of the suffix position represented by that leaf 

 –  A node v of T is called left diverse if at least 2 leaves in

v’s subtree have different left characters 

• Theorem –  String a labeling the path to an internal node v of T is a

maximal repeat if and only if v is left diverse

• Capture that character before a is different

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 18/48

Identifying left diverse nodes

• Bottom up procedure –  All nodes will have a left character label

 –  Leaf node:• Label leaves with their left character

 –  Internal node v:• If any child is left diverse, so is v

• If two children have different left character labels, v is left

diverse• Otherwise, take on left character value of children

• Compact representation –  There is a compact tree T that consists only of left

diverse nodes that represents all maximal repeats

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 19/48

Problem definition

• Supermaximal repeats –  Input

• String S (length n)

 –  Output• The set of supermaximal repeats of S

• Key property –  A left diverse node v represents a supermaximal repeat

if and only if all of v’s children are leaves, and each hasa distinct left character

 –  Prove this

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 20/48

Matching Statistics

• Setting –  Text T of length m

 –  Pattern P of length n

• Definition –  For 1 <= i <=m, matching statistic ms(i) is the length of 

the longest substring beginning at T(i) that matches asubstring somewhere in P

• With matching statistics, one can solve severalproblems with less space than a suffix tree –  Exact matching example: P occurs at i in T if and only

if ms(i) = |P|

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 21/48

Why study matching statistics

• With matching statistics, one can solve

several problems with less space than a

suffix tree – Exact matching example

• We’ll show an O(n) preprocessing time and O(m)

search time solution matching the traditional

methods

• Key observation: P matches substring beginning at i

in T if and only if ms(i) = |P|

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 22/48

Construction Problem

• Input

 – Text T of length m

 – Pattern P

• Output

 – Compute ms(i) for 1 <=i <= m

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 23/48

Solution

• Compute suffix tree of P retaining suffix links

• ms(1): match T against tree

• ms(i+1) given ms(i) –  we are at some node v in the tree

• If it is internal, follow suffix link to s(v)

• Else if it is a leaf, go up one level to parent w

 –  If we is an internal node, follow suffix link to s(w)

 –  Traverse downwards using skip/count trick until we havematched all the characters in edge label (w,v)

• Now match against T character by character till we have amismatch and can output ms(i+1)

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 24/48

Adding location of substring in P

• p(i): a location in P such that the substringat p(i) matches substring starting at T(i) for

exactly ms(i) positions• Before computing ms(i) values, mark each

node in T with the leaf number of one of itsleaves

• Simply output this value when outputtingms(i) values

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 25/48

Applying matching statistics to

LCS problem• Input

 – strings S and T

• Output – longest common substring of S and T

• Solution method

 – Compute suffix tree for shortest string, say S – Compute ms(i) values for T

 – Maximal ms(i) value identifies LCS

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 26/48

Suffix Arrays

• Setting –  Text T of length m

• Definition –  A suffix array for T, called Pos, is an array of integersin the range 1 to m specifying the lexicographic orderof the m suffixes of string T

• Add terminating character $ which is lexically smallest

• Example– T = m i s s i s s i p p i

– i 1 2 3 4 5 6 7 8 9 0 1

– Pos(i) 5 4 119 3 108 2 7 6 1

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 27/48

Computing Suffix Arrays

• Input –  Text T of length m

• Output –  Pos array

• Solution –  Compute suffix tree of T

 –  Do a lexical depth-first traversal of T labeling Pos(i)with leafs in order of encountering them

 –  Edge (v,u) is lexically smaller than edge (v,w) iff firstcharacter of (v,u) is lexically smaller than first characterof (v,w)

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 28/48

Using Suffix Arrays

• Input

 – Text T of length m

 – Pattern P of length n

• Output

 – All occurrences of P in T

• Solution

 – Compute suffix array Pos for T

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 29/48

Properties of Suffix Arrays

• If P is in T, then all these locations will be

grouped consecutively in Pos

• O(n log m) solution to matching problem – Using binary search, find smallest index i’ such

that P exactly matches the n characters of suffix

Pos(i’)  – Similarly, find largest index i such that P

exactly matches the n characters of suffix Pos(i)

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 30/48

Speeding up binary search

• Let L and R denote current left and rightboundaries of current search interval –  Initialization: L= 1, R = m

• Let l and r denote length of longest prefix of Pos(L) and Pos(R) that match a prefix of P,respectively

• Define M = ceiling((L+R)/2) –  Define mlr = min(l,r)

 –  Can begin comparison of Pos(M) at position mlr+1

• In practice, this is sufficient to achieve O(n + log

m) search time, but worst case is W(n log m)

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 31/48

Longest common prefixes

• Definition: Lcp(i,j) is the length of the

longest common prefix of the suffixes

beginning at Pos(i) and Pos(j).• Mississippi Example

 – Pos(3) = 5 (issippi)

 – Pos(4) = 2 (ississippi) – Lcp(3,4) = 4

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 32/48

Getting to max(l,r) with Lcp’s 

• L, R, M, l, r defined as before

• If l=r, compare P against Pos(m) starting atposition l+1 = r+1

• Suppose l > r –  If Lcp(L,M) > l, the common prefix of suffix Pos(L)

and suffix Pos(M) is longer than the common prefix of P and Pos(L)

 –  Therefore, P agrees with suffix Pos(M) up throughposition l but disagrees in position l+1

 –  Furthermore, Pos(M) suffix is lexically smaller than P

 –  Update: L = M, l and r unchanged

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 33/48

Getting to max(l,r) with Lcp’s 

• Suppose l > r

 –  If Lcp(L,M) < l, the common prefix of suffix Pos(L)

and suffix Pos(M) is shorter than the common prefix of 

P and Pos(L)

 –  Therefore, P agrees with suffix Pos(M) up through

position Lcp(L,M).

 –  The Lcp(L,M)+1 characters of P and L are lexically

smaller than the corresponding character of Pos(M)

 –  Update: R = M, r = Lcp(L,M)

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 34/48

Getting to max(l,r) with Lcp’s 

• Suppose l > r

 –  If Lcp(L,M) = l, the common prefix of suffix Pos(L)

and suffix Pos(M) is equal to the common prefix of P

and Pos(L)

 –  Therefore, P agrees with suffix Pos(M) up through

position l and maybe even further

 –  Need to compare P(l+1) to corresponding position in

Pos(M)

 –  Update: Will update R or L according to final

determination of comparisons

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 35/48

O(n + log m) bound

• Since we begin at max(l,r), we never

compare a matched position in P more than

once• Redundant comparisons of P are eliminated

to at most once per binary search phase

giving us O(n + log m)

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 36/48

Computing Lcp values quickly

• We want to get them in O(m) time

• However, there are potentially O(m2)

different possible pairs of Lcp values

• Crucial point

 – Since this is binary search, there are only O(m)

values that are ever needed, and these have a lotof structure

 – See Figure 7.7 for an example

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 37/48

Process for needed Lcp values

• Lcp(i,i+1): string depth of lowest commonancestor encountered during lexical depth-

first traversal of suffix tree from Pos(i) leaf to Pos(i+1) leaf 

• Other Lcp values – Lcp(i,j): mink in 1 to j-1 Lcp(k,k+1)

 – Take min of Lcp values of children in thebinary tree of needed Lcp values (not the suffixtree)

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 38/48

Lowest common ancestor

• 1-time input

 –  Tree T (not necessarily a suffix tree)

• Later input

 –  2 nodes, v and w, of T

• Output

 –  lowest common ancestor of v,w in T

• Goal –  linear preprocess time

 –  O(1) query time

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 39/48

Longest Common Extension

• 1-time input

 –  Strings S1 and S2 

• Later input

 –  index positions i and j

• Output

 –  length of longest substring of S1 beginning at i thatmatches substring of S

2beginning at j

• Goal

 –  linear preprocess time

 –  O(1) query time

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 40/48

Illustration

• Relationship to longest common substring

 – Similar, but now start positions are fixed

S1 

S2 

i

 j

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 41/48

Solution

• Linear Preprocessing – Create general suffix tree for S1 and S2 

 – Compute string depth at each node – Process tree to allow for constant time LCA

queries

 – Establish pointers to all leaf nodes in tree

• Constant time query processing – Find u = lca(v,w)

 – Output string depth of u

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 42/48

More space-efficient solution

• Linear Preprocessing (Assume |S2| < |S1|) –  Create general suffix tree for S2 

 –  Compute matching statistic ms(i) and p(i) for S1 • length of longest match of substring starting at i in S1 with

some substring in S2

• p(i) is the starting point of a location in S2 that matches

 –  Process tree to allow for constant time LCA queries

 –  Establish pointers to all leaf nodes in tree

• Constant time query processing –  Find u = lca(p(v), w) in suffix tree for S2 

 –  Output min(ms(v), string depth of u)• why is this correct?

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 43/48

Related Problem

• Maximal Palindromes

• Input –  String S

• Output –  Location of all maximal palindromes in S

• Solution

 –  Longest common extensions of specific pairs of positions in S and Sr 

 –  O(S) solution

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 44/48

Common substrings revisited

• Longest common substrings of >2 strings:

 –  Input

• Strings S1

, …, SK

(total length n)

 –  Output

• l(j) (and pointers to substrings) for 2 <= j <= K

• Problem with previous solution

 –  O(kn) time to compute C(v) values –  C(v): number of distinct leaf labels in subtree rooted at

node v

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 45/48

Definitions

• S(v): total number of leaves in v’s subtree 

• U(v): number of “duplicate suffixes” from

same string that occur in v’s subtree • C(v) = S(v) - U(v)

• ni(v) = number of leaves with identifier i in

the subtree rooted at node v• ni = total number of leaves with identifier i

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 46/48

Key Concepts

• Definitions – S(v): total number of leaves in v’s subtree 

 – U(v): number of “duplicate suffixes” from same string

that occur in v’s subtree  –  ni(v) = number of leaves with identifier i in the subtreerooted at node v

 –  ni = total number of leaves with identifier i

• Observations –  U(v) = S max((ni(v) - 1), 0)

 –  C(v) = S(v) - U(v)

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 47/48

Solution

• Computing U(v) values –  DF traversal of tree numbering leaves in order that they

are encountered

 –  For each string label i• Let Li be the list of leaves with identifier i, in increasing order of their dfs numbers

• Compute lca of consecutive pair of leaves in Li for all pairs of consecutive leaves in Li

• For each node v, let h(v) denote the number of times it is the lcacomputed from step above

 –  Key property• ni(v) = Si h(w) where w is in v’s subtree 

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 48/48

Solution

• Computing U(v) values

 –  DF traversal of tree numbering leaves in order that they

are encountered

 –  Set h(v) to 0 for all nodes v –  For each string label i

• Compute lca v of consecutive pair of leaves in Li for all pairs of 

consecutive leaves in Li

• Increment h(v) by 1 –  Propagate h(v) values up the tree by addition to set U(v)

 –  Set C(v) = S(v) - U(v)

top related