1 suffix trees © jeff parker, 2009. 2 outline an introduction to the suffix tree some sample...

53
1 Suffix Trees © Jeff Parker, 2009

Upload: robert-morrison

Post on 18-Dec-2015

223 views

Category:

Documents


5 download

TRANSCRIPT

1

Suffix Trees

© Jeff Parker, 2009

2

Outline

An introduction to the Suffix Tree

Some sample applications

How to build a Suffix Tree efficiently

3

Problems

We have a corpus of informationGenesProteins

What to see what to sequences have in commonWant to be able to find matches for a gene or protein.Model this as a search for a pattern in a text.

Problem is hard becauseStrings are very long The set of possible matches is large

Today we will focus on exact matches

4

Pattern MatchingThe basis for the simplest (exact) pattern match follows

Algorithm Line up text and pattern

Compare the two

If they match

Report the position of match

Else

Slide pattern to right and try again

Text

Pattern

5

Compare pattern at this position

// Does the pattern match the text at this position?boolean compare(String text, int pos, String pattern){

for (int i = 0; i < pattern length; i++)

if (text[pos + i] =/= pattern[i])return false;

return true;}

6

Simple Pattern Match

// Where is pattern pat in string text?int findMatch (String text, String pat){

int pos = 0;while (pos <= text.length - pat.length){

if (compare(text, pos, pattern))return pos;

pos++; // Slide pattern right one space}return -1;

}

7

AnalysisFor pattern of length N and a text of length M

This algorithm behaves well in practice: O(N + M)

The worst case is bad: O(NM)

We can do better if we preprocess

Preprocess Pattern: Boyer-Moore, Knuth, Morris, Pratt

Preprocess text: Suffix Tree

8

O(|pattern|) Pattern MatchingRather than view the problem as moving the pattern, rephrase

9

Faster Pattern MatchingIs our pattern the prefix of a suffix of the text string S? Take all suffixes…

10

Faster Pattern MatchingTake all suffixes and slide left

11

Faster Pattern MatchingWant to find a string that has pattern as prefix

12

Sort suffixes

13

Build TrieAllows O(N) search

for pattern

14

Suffix Trie

Multi-way tree

Each branch is labeled with char

If the trie is ready, match takes O(|pattern|) time

Example: text S is ababc

s1 = ababc

s2 = babc

s3 = abc

s4 = bc

s5 = c 1

a

a

c

b

b

b

b

cc

c

c

3

5

2

4

a

15

Suffix Trie

Suffix trie takes O(|S|2) space Each step of search for match takes constant time

If no branch matches char, we failLeaf holds name of suffixWe may have multiple matches

String ab occurs twicePrefix of s1 and s3

1

a

a

c

b

b

b

b

cc

c

c

3

5

2

4

a

s1 = ababc

s2 = babc

s3 = abc

s4 = bc

s5 = c

16

Suffix Tree

Nodes that mark a split are called essentialRemove non-essential nodes, and label edges with string

1

a

a

c

b

b

b

b

cc

c

c

3

5

2

4

a

5

1 3 2 4

abc

ab

c

c

cabc

b

17

PropertiesTree has

|S| leaves and 2|S|-1 edges|S|-1 interior nodes

Algorithm for search is the same: walk the tree matching edgesWhile this has less nodes, not clear that we need

Any less storage? Sum of length of strings can still be O(N2)Any speedup building tree?

Storage is easier to address

5

1 3 2 4

abc

ab

c

c

cabc

b

18

Worst Case StorageHere are some trees that need O(N2) storage when stored as tries

abcdefgWe can get a trie that need O(N2) storage with a limited alphabet:

anbnanbnc

1 2

abcdefgbcdefg

3cdefg defg efg

4 5

19

Efficient StorageWe store the whole string once, and keep pointers to that string in nodes

We have constant space per node and O(|S|) nodes, thus linear space

5

1 3 2 4

abc

ab

c

c

cabc

b

1, 2

a

1

b

2

a

3

b

4

c

5

2, 2

5, 5

3, 5

5, 5

3, 5

5, 5

sibling

child

20

Applications: Longest RepeatAs well as searching for a string, we can answer questions such asWhat is the longest string that is duplicated?

What is the longest string that occurs k times?Internal nodes mark repeating substringsKeep track of the splits, and remember the deepest.

In our example, s1 and s3 share ab

5

1 3 2 4

abc

ab

c

c

cabc

b

ababc

21

Longest Common SubstringGiven two strings S and T, find the longest common substringBuild the suffix tree for the string S$T

Mark leaves of suffixes that begin in S redMark leaves of suffixes that begin in T black

Make bottom up traverse, looking for lowest split that has leaves in both sets

5

1 3 2 4

abc

ab

c

c

cabc

b

22

Applications: Longest PalindromeGiven two strings S, find the longest common palindromeBuild the suffix tree for the string S$S-1

Mark leaves of suffixes that begin in S redMark leaves of suffixes that begin in S-1 black

Look for lowest split that has leaves in both sets

5

1 3 2 4

abc

ab

c

c

cabc

b

23

Linear Time ConstructionThere is a long history of work

mississippi

ississippi

ssissippisissippiissippissippisippiippippipii

Weiner 1973

24

Linear Time ConstructionThere is a long history of work

mississippi

ississippi

ssissippisissippiissippissippisippiippippipii

Weiner 1973

McCreight 1976

25

Linear Time ConstructionThere is a long history of work

mississippi

ississippi

ssissippisissippiissippissippisippiippippipii

Weiner 1973

McCreight 1976

Ukkonen 1992

26

McCreight

Add the suffixes from longest to shortest

We add a termination symbol, such as $, that does not appear in text

This forces each addition to split the existing tree

We can split (add a node and two edges) in constant time

Can we find the place to do the splitting in constant time?

Suffix links give amortized linear time. But first understand alg.

ababc

2

1

babcababc

1 2

babcab

1

abc c

3

27

UkkonenOnline algorithm: we don’t need to know all of string

Grow all suffixes together. In step k, add S[k] to end of each suffix

At some point, string sk will split from tree (s2 breaks loose in step 2)

After that, sk will never split again (though something may split from it)

A split for sk may mean an similar split for sk+1

3 splits when adding c: s3 splits from s1, s4 from s2 and s5 from root

a...

1

ab...

1

b...

aba..

1

ba..a..

2

abab.

1

bab.

ab.b.

abc

ab 5

1 2 2 2

3 4

bc

cabc

c

28

ReviewIntroduce graphical notation for implicit nodes

aba means both suffixes “a” and “aba” are on edge

a...

1

ab...

1

b...

aba..

1

ba..

2

abab.

1

bab.

abc

ab 5

1 2 2 2

3 4

bc

cabc

c

ababc$ = s1babc$ = s2abc$ = s3bc$ = s4c$ = s5$ = s6

29

Mississippimississippi$ = s1

ississippi$ = s2

ssissippi$ = s3

sissippi$ = s4

issippi$ = s5

ssippi$ = s6

sippi$ = s7

ippi$ = s8

ppi$ = s9

pi$ = s10

i$ = s11

$ = s12

m...

1

mi...

1 2

i...

mis...

1 2

is...

3

s...

miss...

1 2

iss...

3

ss...

s4 is an implicit node

s4 is the active path

Def: First non-leaf suffix remaining

30

Mississippimississippi$ = s1

ississippi$ = s2

ssissippi$ = s3

sissippi$ = s4

issippi$ = s5

ssippi$ = s6

sippi$ = s7

ippi$ = s8

ppi$ = s9

pi$ = s10

i$ = s11

$ = s12

miss...

1 2

iss...

3

ss...

s4 is an implicit node (red s in s3 edge)

s4 is the active path

Def: First non-leaf suffix remaining

When we add s[5] = i, active path s4 splits

s5 becomes the active point.

missi...

1 2

si...

s

issi...

i...

3 4

31

Mississippimississippi$ = s1

ississippi$ = s2

ssissippi$ = s3

sissippi$ = s4

issippi$ = s5

ssippi$ = s6

sippi$ = s7

ippi$ = s8

ppi$ = s9

pi$ = s10

i$ = s11

$ = s12

s5 is the active path

(First non-leaf suffix remaining)

At end there are 3 non-leaf-suffixes (s5, s6, s7)

missi...

1 2

si...

s

issi...i...

missis...

1 2

sis...

s

issis...is...

mississ...

2

siss...

s

ississ...

iss... 4

4

3 4

3

3

1

32

mississippi$ = s1

ississippi$ = s2

ssissippi$ = s3

sissippi$ = s4

issippi$ = s5

ssippi$ = s6

sippi$ = s7

ippi$ = s8

ppi$ = s9

pi$ = s10

i$ = s11

$ = s12

Add i

Add p. Have never seen p, so all 4 (now 5) trailing suffixes split

s10, at root, becomes active path

Mississippi

mississi...

2

sissi...

s

ississi...

issi... 4 3 1

1 2 5 7 9 4 6 3

mississip...

s

p...

8

p...

p...

i

ssi

ssip...

sii

ssip...p...

p...ssip...

33

mississippi$ = s1

ississippi$ = s2

ssissippi$ = s3

sissippi$ = s4

issippi$ = s5

ssippi$ = s6

sippi$ = s7

ippi$ = s8

ppi$ = s9

pi$ = s10

i$ = s11

$ = s12Redraw last diagram. About to add a second p.

s10 is active path, and it is at root

Mississippi

2 5 7 9 4 6 3

8

issip

i

ssi

p

p

pmississip

1

s

i

si

p ssipp

ssip

34

mississippi$ = s1

ississippi$ = s2

ssissippi$ = s3

sissippi$ = s4

issippi$ = s5

ssippi$ = s6

sippi$ = s7

ippi$ = s8

ppi$ = s9

pi$ = s10

i$ = s11

$ = s12

Active path is still s10 It is trailing s9

Mississippi

2 5 7 9 4 6 3

8

issipp

i

ssi

pp

pp

ppmississipp

1

s

i

si

pp ssipppp

ssipp

35

mississippi$ = s1

ississippi$ = s2

ssissippi$ = s3

sissippi$ = s4

issippi$ = s5

ssippi$ = s6

sippi$ = s7

ippi$ = s8

ppi$ = s9

pi$ = s10

i$ = s11

$ = s12

Add i. Forces split of s10 from s9. Active path is now s11

Mississippi

2 5 7 9 4 6 3

8

issippi

i

ssi

ppi

ppi

pmississippi

1

s

i

si

ppi ssippippi

ssippi

10

pii

36

Algorithm

We are building a tree, adding character S[k] to every suffixWe traverse the boundary path - the growing edge of tree

Boundary path includesSuffixes that have already become leavesSuffixes that currently end in implicit interior nodes

We add character S[k] to the end of each suffixIn general we have O(N) suffixes on boundary path, and we add each of N characters to

each suffix on the boundary path, and we must navigate from suffix to suffix, which may be O(N) steps apart.

How can we do this in O(N) time?

37

Algorithm

We have O(N) suffixes on boundary path,

We add each of N characters to each suffix on the boundary path,

We navigate from suffix to suffix, which may be O(N) steps apart.

How can we do this in O(N) time?

Ans: We cheat. Here are three big ideas (will explain each in detail)

1) Once a path has split off, updating it is free, so we ignore it

2) Rather than “walk” the boundary edge as we add a new character, we only need to watch one representative: the active path - the longest suffix that is not yet a leaf

3) When we do need to walk the boundary path there is a cheap way to walk from suffix to suffix, by creating suffix links

38

Leaves are Cheap

1) Once a path has split off, “updating” it is freeWe represent a leaf that splits at character S[k] as the string

S[k..whatever]If some later suffix is following our path, it is up to him to find the point

of difference

S5 is following S2, but S2 is a leaf and does not careWe don’t even need to know the length of the string (whatever)

mississi...

2

s

ississi...

issi... 4 3 1

mississippi$ = s1

ississippi$ = s2

ssissippi$ = s3

sissippi$ = s4

issippi$ = s5

ssippi$ = s6

39

Active Path2) We can focus our attention on the longest suffix that has not yet broken

free, called the active path. This represents rest of boundary path

Assume active path is the suffix Si and we are have just added char S[k]

Assume that Si is a prefix of suffix Sj up to this point

Then Si+1 is a prefix of suffix Sj+1 and so on

Proof: Si+1 is just Si without character S[i]The converse is not true.

Si may leave the tree while Si+1 remains in the tree

S[i..k]Si

S[j..k]Sj

S[i+1..k]Si+1

S[j+1..k]Sj+1This means that we only need to watch S5

mississi...

2

s

ississi...

issi... 4 3 1

40

mississippi$ = s1

ississippi$ = s2

ssissippi$ = s3

sissippi$ = s4

issippi$ = s5

ssippi$ = s6

sippi$ = s7

ippi$ = s8

ppi$ = s9

pi$ = s10

i$ = s11

$ = s12

Add p. Have never seen p, so s5, s6, s7, s8 and s9 all split.

s10, which is currently at the root, becomes the new active path

Review example

mississi...

2

sissi...

s

ississi...

issi... 4 3 1

1 2 5 7 9 4 6 3

mississip...

s

p...

8

p...

p...

i

ssi

ssip...

sii

ssip...p...

p...ssip...

S5 is a prefix of S2

S6 is a prefix of S3

S7 is a prefix of S4

S8 is a prefix of S5

41

Suffix Links

3) There is a cheap way to walk the boundary path Once the active path splits, we need to walk the boundary path until splitting stopsTo explain the suffix link, return to our view as a trie for ababcWe have inserted s[1] through s[4], about to insert s[5] = c

s1 points to s2, which points to s3, which points to s4, which points to root

I know I will have no problems with leaves s1 and s2 : active path is s3

When I find that s3 needs to split from s1, I need to check s4 as well, and perhaps s5

I follow the suffix pointers from s3

a

a

b

b

b

b

a

42

Accounting

I add one character at a time to one suffix - the active pathThis is clearly linear

When the active path splits, I need to start walking the boundary path from old active path to new end path (point were the splitting stops)

Any individual character may cause lots of splitting, but each suffix only splits once. Amortized cost is linear

To walk the boundary path, I update the suffix links. This can also be amortized.

a

a

b

b

b

b

a

a

a

b

b

b

b

ac c

c

43

Building Suffix Links

When we split, we need to add new nodes

These nodes will need new suffix links

We are showing a chain of suffix links

44

Building Suffix Links

When we split, we need to add new nodes

These nodes will need new suffix links

45

Building Suffix Links

When we split, we need to add new nodes

These nodes will need new suffix links

46

Building Suffix Links

When we split, we need to add new nodes

These nodes will need new suffix links

47

Building Suffix Links

When we split, we need to add new nodes

These nodes will need new suffix links

48

Building Suffix Links

When we split, we need to add new nodes

These nodes will need new suffix links

49

Building Suffix Links

When we split, we need to add new nodes

These nodes will need new suffix links

50

Building Suffix Links

When we split, we need to add new nodes

These nodes will need new suffix links

51

Canonize

We represent a suffix as an explicit node and a (growing) string of characters

Start with (n1 (a))

Add characters bbac to get (n1 (abbac))

We canonize this in a sequence of steps to get a better representation

(n2 (bac))

(n3 (c))

This allows us to use the suffix link at n3 rather than the suffix link at n1

abba

ca

n1

n2

n3

n4

52

Post mortem

Algorithm to build Suffix Tree is linear in time and space.

We haven’t proved this, but perhaps it is now plausible

But is the algorithm practical?

There are real issues when dealing with long strings

The human genome has about 3 billion base pairs

Keeping the suffix links updated can cause thrashing as we walk all over the suffix tree representing this

The suffix tree is important enough that people are working the issue

One idea that is easy to describe: merging suffix trees

53

References

A great reference to the field is Dan Gusfield’s Algorithms on Strings, Trees, and Sequences

P. Weiner (1973). "Linear pattern matching algorithm". 14th Annual IEEE Symposium on Switching and Automata Theory, 1-11.

Edward M. McCreight (1976). "A Space-Economical Suffix Tree Construction Algorithm". Journal of the ACM 23 (2): 262--272.

E. Ukkonen (1995). "On-line construction of suffix trees". Algorithmica 14 (3): 249--260.

R. Giegerich and S. Kurtz (1997). "From Ukkonen to McCreight and Weiner: A Unifying View of Linear-Time Suffix Tree Construction". Algorithmica 19 (3): 331--353.

Gusfield, Dan [1997] (1999). Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. USA: Cambridge University Press.