suffix tree applications

8/3/2019 Suffix Tree Applications

http://slidepdf.com/reader/full/suffix-tree-applications 1/48

Applications

• Exact string and substring matching

• Longest common substrings

• Finding and representing repeatedsubstrings efficiently

• Applications that lead to alternative, space

efficient implementations – Matching statistics

– Suffix Arrays

String and substrings

• Exact String matching: – Input

• Pattern P of length n

• Text T of length m

– Output• Position of all occurrences of P in T

• Solution method – Preprocess to create suffix tree for T

• O(m) time, O(m) space

– Maximally match P in suffix tree

– Output all leaf positions below match point

• O(n+k) time where k is number of matches

• Exact set matching: – Input

• Set of patterns {Pi} of total length n

• Text T of length m

– Output• Position of all occurrences of each pattern Pi in T

• Solution method – Preprocess to create suffix tree for T

• O(m) time, O(m) space

– Maximally match each Pi in suffix tree

• O(n+k) time where k is number of total matches

Comparison with Aho-Corasick

• Aho-Corasick – O(n) preprocess time and space

• to build keyword tree of set of patterns P

– O(m+k) search time

• Suffix Tree Approach – O(m) preprocess time and space

• to build suffix tree of T

– O(n+k) search time

– Using matching statistics to be defined, can make thistradeoff similar to that of Aho-Corasick

• Substring problem: – Input

• Set of patterns {Pi} of total length n

• Text T of length m (m < n now)

– Output• Position of all occurrences of T in each pattern Pi

• Solution method

– Preprocess to create generalized suffix tree for {Pi}• O(n) time, O(n) space

– Maximally match T in generalized suffix tree

• O(m+k) time where k is number of total matches

Common Substrings

• Longest Common Substring problem: – Input

• Strings S and T

– Output• longest common substring of S and T (and position in S and T)

• Solution method – Preprocess to create generalized suffix tree for {S,T}

– Mark each node by whether or not its subtree contains aleaf node of S, T, or both

• Simple postfix tree traversal algorithm to do this

– Path label of node with greatest string depth is the

longest common substring of S and T

Common Substrings

• Common substrings of length k problem: – Input

• Strings S and T

• Integer k

– Output• all substrings of S and T (and position in S and T) of length at

least k

• Solution method – Same as previous problem

– Look for all nodes with 2 leaf labels of string depth atleast k

Longest Common Substrings of

>2 Strings• Definition: For a given set of K strings, l(j) for 2

<= j <= K is the length of the longest substringcommon to at least j of the K strings

• Example: {sanddollar, sandlot, handler, grand,pantry} – j l(j) one string

– 2 4 sand – 3 3 and

– 4 3 and

– 5 2 an

Problem definition and solution

• Longest common substrings of >2 strings: – Input

• Strings S1, …, SK (total length n) – Output

• l(j) (and pointers to substrings) for 2 <= j <= K

• Solution – Build a generalized suffix tree for the K strings

• each string has a unique end character, so each leaf shows up only once

Solution continued

– Build a generalized suffix tree for the K strings• each string has a unique end character, so each leaf shows up

only once

– C(v): number of distinct leaf labels in subtree rooted at

node v – Given C(v) values and string-depth values, do a simple

traversal of tree to find these K-1 values and pointers tolocations in substrings

– Computing C(v) efficiently• # of leaves is not correct as some leaves may have same label

• length K bit vector, 1 bit per string in set

• OR your way up the tree

– Each OR op takes O(K) time which give O(Kn) running time

• Can be improved to be O(n) later

Repeated substrings

• Given a single string S

• Definitions

– maximal pair in S is a pair of identical substrings a andb in S such that the character to the immediate left(right) of a is different than the character to theimmediate left (right) of b.

• Add unique characters to front and end of S to include prefixes

and suffixes. – Representation: (p1, p2, n’)

• starting positions and length of the maximal pair

– R(S) is the set of all triples representing maximal pairsin S

Example

• S = xabcyiiizabcqabcyrxar

• 123456789012345678901

– (2, 10, 3) is a maximal pair

– (10, 14, 3) is a maximal pair

– (2, 14, 3) is not a maximal pair

• (2, 14, 4) is a maximal pair

– Note positions 2 and 14 are the start positions

of two distinct maximal pairs

More definitions

• A maximal repeat a is a substring in S that is the

substring defined by a maximal pair of S

• R’(S) is the set of maximal repeats • Previous example

– abc and abcy are maximal repeats of S

– abc is represented only once

– |R’(S)| is smaller than R(S) as abc shows up twice in

the second set but only once in the first set

Even more definitions

• A supermaximal repeat a is a maximal

repeat of S that never occurs as a substring

of another maximal repeat of S• Previous example

– abcy is a supermaximal repeat of S

– abc is NOT a supermaximal repeat of S

Problem definition

• Maximal repeats

– Input

• String S (length n)

– Output

• R’(S)

Properties of maximal repeats

• Construct suffix tree for S

• Lemma

– If a is a maximal repeat in S, then a is the path-label of an internal node v in T• a does not end in the middle of an edge

• (captures next character after a is distinct)

• Corollary – There are at most n maximal repeats

• n leaves

• all internal nodes except the root have at least two children

• therefore, at most n internal nodes

More properties of maximal

repeats• Definitions

– Character S(i-1) is the left character of i

– The left character of a leaf of a suffix tree T is the leftcharacter of the suffix position represented by that leaf

– A node v of T is called left diverse if at least 2 leaves in

v’s subtree have different left characters

• Theorem – String a labeling the path to an internal node v of T is a

maximal repeat if and only if v is left diverse

• Capture that character before a is different

Identifying left diverse nodes

• Bottom up procedure – All nodes will have a left character label

– Leaf node:• Label leaves with their left character

– Internal node v:• If any child is left diverse, so is v

• If two children have different left character labels, v is left

diverse• Otherwise, take on left character value of children

• Compact representation – There is a compact tree T that consists only of left

diverse nodes that represents all maximal repeats

Problem definition

• Supermaximal repeats – Input

• String S (length n)

– Output• The set of supermaximal repeats of S

• Key property – A left diverse node v represents a supermaximal repeat

if and only if all of v’s children are leaves, and each hasa distinct left character

– Prove this

Matching Statistics

• Setting – Text T of length m

– Pattern P of length n

• Definition – For 1 <= i <=m, matching statistic ms(i) is the length of

the longest substring beginning at T(i) that matches asubstring somewhere in P

• With matching statistics, one can solve severalproblems with less space than a suffix tree – Exact matching example: P occurs at i in T if and only

if ms(i) = |P|

Why study matching statistics

• With matching statistics, one can solve

several problems with less space than a

suffix tree – Exact matching example

• We’ll show an O(n) preprocessing time and O(m)

search time solution matching the traditional

methods

• Key observation: P matches substring beginning at i

in T if and only if ms(i) = |P|

Construction Problem

• Input

– Text T of length m

– Pattern P

• Output

– Compute ms(i) for 1 <=i <= m

Solution

• Compute suffix tree of P retaining suffix links

• ms(1): match T against tree

• ms(i+1) given ms(i) – we are at some node v in the tree

• If it is internal, follow suffix link to s(v)

• Else if it is a leaf, go up one level to parent w

– If we is an internal node, follow suffix link to s(w)

– Traverse downwards using skip/count trick until we havematched all the characters in edge label (w,v)

• Now match against T character by character till we have amismatch and can output ms(i+1)

Adding location of substring in P

• p(i): a location in P such that the substringat p(i) matches substring starting at T(i) for

exactly ms(i) positions• Before computing ms(i) values, mark each

node in T with the leaf number of one of itsleaves

• Simply output this value when outputtingms(i) values

Applying matching statistics to

LCS problem• Input

– strings S and T

• Output – longest common substring of S and T

• Solution method

– Compute suffix tree for shortest string, say S – Compute ms(i) values for T

– Maximal ms(i) value identifies LCS

Suffix Arrays

• Setting – Text T of length m

• Definition – A suffix array for T, called Pos, is an array of integersin the range 1 to m specifying the lexicographic orderof the m suffixes of string T

• Add terminating character $ which is lexically smallest

• Example– T = m i s s i s s i p p i

– i 1 2 3 4 5 6 7 8 9 0 1

– Pos(i) 5 4 119 3 108 2 7 6 1

Computing Suffix Arrays

• Input – Text T of length m

• Output – Pos array

• Solution – Compute suffix tree of T

– Do a lexical depth-first traversal of T labeling Pos(i)with leafs in order of encountering them

– Edge (v,u) is lexically smaller than edge (v,w) iff firstcharacter of (v,u) is lexically smaller than first characterof (v,w)

Using Suffix Arrays

• Input

– Text T of length m

– Pattern P of length n

• Output

– All occurrences of P in T

• Solution

– Compute suffix array Pos for T

Properties of Suffix Arrays

• If P is in T, then all these locations will be

grouped consecutively in Pos

• O(n log m) solution to matching problem – Using binary search, find smallest index i’ such

that P exactly matches the n characters of suffix

Pos(i’) – Similarly, find largest index i such that P

exactly matches the n characters of suffix Pos(i)

Speeding up binary search

• Let L and R denote current left and rightboundaries of current search interval – Initialization: L= 1, R = m

• Let l and r denote length of longest prefix of Pos(L) and Pos(R) that match a prefix of P,respectively

• Define M = ceiling((L+R)/2) – Define mlr = min(l,r)

– Can begin comparison of Pos(M) at position mlr+1

• In practice, this is sufficient to achieve O(n + log

m) search time, but worst case is W(n log m)

Longest common prefixes

• Definition: Lcp(i,j) is the length of the

longest common prefix of the suffixes

beginning at Pos(i) and Pos(j).• Mississippi Example

– Pos(3) = 5 (issippi)

– Pos(4) = 2 (ississippi) – Lcp(3,4) = 4

Getting to max(l,r) with Lcp’s

• L, R, M, l, r defined as before

• If l=r, compare P against Pos(m) starting atposition l+1 = r+1

• Suppose l > r – If Lcp(L,M) > l, the common prefix of suffix Pos(L)

and suffix Pos(M) is longer than the common prefix of P and Pos(L)

– Therefore, P agrees with suffix Pos(M) up throughposition l but disagrees in position l+1

– Furthermore, Pos(M) suffix is lexically smaller than P

– Update: L = M, l and r unchanged

• Suppose l > r

– If Lcp(L,M) < l, the common prefix of suffix Pos(L)

and suffix Pos(M) is shorter than the common prefix of

P and Pos(L)

– Therefore, P agrees with suffix Pos(M) up through

position Lcp(L,M).

– The Lcp(L,M)+1 characters of P and L are lexically

smaller than the corresponding character of Pos(M)

– Update: R = M, r = Lcp(L,M)

• Suppose l > r

– If Lcp(L,M) = l, the common prefix of suffix Pos(L)

and suffix Pos(M) is equal to the common prefix of P

and Pos(L)

– Therefore, P agrees with suffix Pos(M) up through

position l and maybe even further

– Need to compare P(l+1) to corresponding position in

Pos(M)

– Update: Will update R or L according to final

determination of comparisons

O(n + log m) bound

• Since we begin at max(l,r), we never

compare a matched position in P more than

once• Redundant comparisons of P are eliminated

to at most once per binary search phase

giving us O(n + log m)

Computing Lcp values quickly

• We want to get them in O(m) time

• However, there are potentially O(m2)

different possible pairs of Lcp values

• Crucial point

– Since this is binary search, there are only O(m)

values that are ever needed, and these have a lotof structure

– See Figure 7.7 for an example

Process for needed Lcp values

• Lcp(i,i+1): string depth of lowest commonancestor encountered during lexical depth-

first traversal of suffix tree from Pos(i) leaf to Pos(i+1) leaf

• Other Lcp values – Lcp(i,j): mink in 1 to j-1 Lcp(k,k+1)

– Take min of Lcp values of children in thebinary tree of needed Lcp values (not the suffixtree)

Lowest common ancestor

• 1-time input

– Tree T (not necessarily a suffix tree)

• Later input

– 2 nodes, v and w, of T

• Output

– lowest common ancestor of v,w in T

• Goal – linear preprocess time

– O(1) query time

Longest Common Extension

• 1-time input

– Strings S1 and S2

• Later input

– index positions i and j

• Output

– length of longest substring of S1 beginning at i thatmatches substring of S

2beginning at j

• Goal

– linear preprocess time

– O(1) query time

Illustration

• Relationship to longest common substring

– Similar, but now start positions are fixed

Solution

• Linear Preprocessing – Create general suffix tree for S1 and S2

– Compute string depth at each node – Process tree to allow for constant time LCA

queries

– Establish pointers to all leaf nodes in tree

• Constant time query processing – Find u = lca(v,w)

– Output string depth of u

suffix tree applications

Documents

genome-scale disk-based suffix tree indexing

dst final...

a new suffix tree similarity measure for document clustering...

fault tree handbook with aerospace applications · fault...

presented by dr. shazzad hosain asst. prof. eecs, nsu linear...

applications of weighted tree automata and tree ... ·...

exact string matching, suffix trees, and applications

lecture 07 trees succinct trees-recent · standard suffix...

event clusters detection on flickr images using a...

2 suffix tree: definition suffix tree t על מחרוזת s...

algorithms in bioinformatics: a practical introduction...

a suffix tree approach to anti-spam email filtering

alignment of long sequences - biostatistics and medical...

augmenting suffix trees, with applications

cpu usage pattern discovery using suffix tree for...

an enhanced suffix tree approach to measure semantic...

suffix trees andsuffix trees and suffix arrays ·...

cse 549: suffix tries & suffix...

generalization of a suffix tree for rna structural pattern...

information retrieval and...