bundled suffix trees - units.it

34
Introduction Bundled Suffix Trees An application BUNDLED SUFFIX TREES Luca Bortolussi 1 Francesco Fabris 2 Alberto Policriti 1 1 Department of Mathematics and Computer Science University of Udine 2 Department of Mathematics and Computer Science University of Trieste IFIP TCS 2006, Santiago, Chile, 23 rd –24 th August 2006

Upload: others

Post on 14-May-2022

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BUNDLED SUFFIX TREES - units.it

Introduction Bundled Suffix Trees An application

BUNDLED SUFFIX TREES

Luca Bortolussi1 Francesco Fabris2 Alberto Policriti1

1Department of Mathematics and Computer ScienceUniversity of Udine

2Department of Mathematics and Computer ScienceUniversity of Trieste

IFIP TCS 2006, Santiago, Chile, 23rd–24th August 2006

Page 2: BUNDLED SUFFIX TREES - units.it

Introduction Bundled Suffix Trees An application

Outline

1 IntroductionSuffix Trees

2 Bundled Suffix TreesEncoding Approximate InformationDefinitionSize and Construction

3 An applicationComputing Surprise MeasuresSummary

Page 3: BUNDLED SUFFIX TREES - units.it

Introduction Bundled Suffix Trees An application

Suffix Trees

bcabbabc#

Gusfield D., Algorithms on strings, trees andsequences, Cambridge University Press, 1997.

E. Ukkonen. On-line construction of suffix-trees.Algorithmica, 14:249-260, 1995.

A Suffix Tree is a data structurerevealing the internal structureof a string.They occupy O(n) space andcan be built in O(n) time.

They are efficient for:

Exact String Matching

Longest Exact CommonSubstring Problem

Identifying ExactlyRepeated Patterns

Page 4: BUNDLED SUFFIX TREES - units.it

Introduction Bundled Suffix Trees An application

Limitations of Suffix Trees

bcabbabc#

Gusfield D., Algorithms on strings, trees andsequences, Cambridge University Press, 1997.

Landau G.M., Vishkin U., Efficient StringMatching with k Mismatches, TheoreticalComputer Science, 43, 239-249, 1986.

Suffix Trees cannot dealnaturally with approximatestring matching problems.(Hamming or Edit distance)

Two difficult problems:

Longest CommonApproximate SubstringProblem

Extraction of approximatelyrepeated patterns

Page 5: BUNDLED SUFFIX TREES - units.it

Introduction Bundled Suffix Trees An application

Extending Suffix Trees

THE TARGETExtending Suffix Trees in order to solve in a simple way someclasses of approximate string matching problems.

Bundled Suffix TreesBundled Suffix Trees extend suffix Trees.

They incorporate approximate information;They can be used like Suffix Trees for:

Longest Common Approximate Substring ProblemExtraction of approximately repeated patterns

Page 6: BUNDLED SUFFIX TREES - units.it

Introduction Bundled Suffix Trees An application

Approximate Matching

Character matching is a relation among letters(in fact, it is the equality relation)

We model approximate matching as a non-transitive relationamong letters:

two strings “match” if all their letters are in relation.

Page 7: BUNDLED SUFFIX TREES - units.it

Introduction Bundled Suffix Trees An application

Approximate Matching

Character matching is a relation among letters(in fact, it is the equality relation)

We model approximate matching as a non-transitive relationamong letters:

two strings “match” if all their letters are in relation.

Page 8: BUNDLED SUFFIX TREES - units.it

Introduction Bundled Suffix Trees An application

Non-Transitive Relation: An Example

Modeling a relation based on Hamming Distance

Start from a basic alphabet (e.g. binary: A = {0, 1})Construct an alphabet composed of macrocharacters(e.g. A = {00, 01, 10, 11})Two letters x , y ∈ A are in relation if and only ifdH(x , y) ≤ D (e.g. D = 1).

The Relation Graph

00 ↔ 01l l

10 ↔ 11

Relation is non-transitive

It encapsulates a(restricted) form ofdistance.

Page 9: BUNDLED SUFFIX TREES - units.it

Introduction Bundled Suffix Trees An application

Bundled Suffix Tree: An Example

bcabbabca ↔ b ↔ c We start from the suffix

tree for the string.

Let’s compare suffix 3 andsuffix 1:

b c a b b a b cm m m m m 6ma b b a c c

After bcabb in the tree, weput a red node with label 3.

Due to symmetry, there isalso a red node with label1 after abbab .

Page 10: BUNDLED SUFFIX TREES - units.it

Introduction Bundled Suffix Trees An application

Bundled Suffix Tree: An Example

bcabbabca ↔ b ↔ c We start from the suffix

tree for the string.

Let’s compare suffix 3 andsuffix 1:

b c a b b a b cm m m m m 6ma b b a c c

After bcabb in the tree, weput a red node with label 3.

Due to symmetry, there isalso a red node with label1 after abbab .

Page 11: BUNDLED SUFFIX TREES - units.it

Introduction Bundled Suffix Trees An application

Bundled Suffix Tree: An Example

bcabbabca ↔ b ↔ c We start from the suffix

tree for the string.

Let’s compare suffix 3 andsuffix 1:

b c a b b a b cm m m m m 6ma b b a c c

After bcabb in the tree, weput a red node with label 3.

Due to symmetry, there isalso a red node with label1 after abbab .

Page 12: BUNDLED SUFFIX TREES - units.it

Introduction Bundled Suffix Trees An application

Bundled Suffix Tree: An Example

bcabbabca ↔ b ↔ c We start from the suffix

tree for the string.

Let’s compare suffix 3 andsuffix 1:

b c a b b a b cm m m m m 6ma b b a c c

After bcabb in the tree, weput a red node with label 3.

Due to symmetry, there isalso a red node with label1 after abbab .

Page 13: BUNDLED SUFFIX TREES - units.it

Introduction Bundled Suffix Trees An application

Bundled Suffix Tree: An Example

bcabbabc ;a ↔ b ↔ c

If we do this process for everycouple of suffixes, we build aBundled Suffix Tree!

Note that this data structure isin the middle between a suffixtree and a suffix trie.

Page 14: BUNDLED SUFFIX TREES - units.it

Introduction Bundled Suffix Trees An application

Bundled Suffix Tree: An Example

bcabbabc ;a ↔ b ↔ c

Bundled Suffix Trees can beused to:

solve the LongestCommon ApproximateSubstring Problem withrespect to a given relation(just find the lowest rednode).

extract information aboutapproximately repeatedpatterns.

Page 15: BUNDLED SUFFIX TREES - units.it

Introduction Bundled Suffix Trees An application

How Big?

The number of red nodesinserted depends on:

the relation

the structure of the text.

In the worst case, the numberof red nodes is quadratic in thelength of the text S. Example

On average, the number of red nodes is limited by

m1+δ, δ = log1/p+ C.

( m is the length of the text, p+ is the normalized frequency of themost common letter in S, C depends on the relation)

1 + δ is slightly greater than one! Example

Page 16: BUNDLED SUFFIX TREES - units.it

Introduction Bundled Suffix Trees An application

How Fast?

Naive AlgorithmThe naive algorithm for building a BuST tries to “match”every suffix of the textalong every branch of the suffix tree,until a “mismatch” is found.

It can be quadratic in the worst case.

An analysis based on the average shape of a suffix treeshows that its average complexity is bounded by m1+δ′

(δ′

just slightly greater that δ).

W. Szpankowski. A Generalized Suffix Tree and its (Un)expected Asymptotic Behaviors. SIAM J. Comput.22(6): 1176-1198 (1993)

P. Jacquet, B. McVey, W. Szpankowski. Compact Suffix Trees Resemble PATRICIA Tries: LimitingDistribution of Depth, Journal of the Iranian Statistical Society, 3, 139-148, 2004.

Page 17: BUNDLED SUFFIX TREES - units.it

Introduction Bundled Suffix Trees An application

Faster

Efficient AlgorithmWe found an “McCreight-like” algorithm that is linear in the sizeof the output.

IntuitionsIt processes the suffixes backwards.

It is based on the concept of inverse suffix links. Show Details

It identifies the red nodes for suffix i by processing the rednodes for suffix i + 1. Show Details

Page 18: BUNDLED SUFFIX TREES - units.it

Introduction Bundled Suffix Trees An application

Experimental Results

We have implemented the naive algorithm for theconstruction of BuST.

We have tested it with relations induced by hammingdistance, defined over DNA-macrocharacters.

With macrocharacters of size 4 (X ↔ Y ⇔ dH(X , Y ) ≤ 1)the algorithm can process texts of length 100K in fewseconds.

The number of red nodes grows tamely. Show Details

Page 19: BUNDLED SUFFIX TREES - units.it

Introduction Bundled Suffix Trees An application

Measures of surprise: exact case

z-score

δ(α) =f (α)− E(α)

N(α)

f (α) is the observed frequency of α

E(α) is the expected frequency of α

N(α) is a normalization factor (e.g. the variance or itsfirst-order approximation).

MonotonicityIf f (α) = f (αβ) then δ(α) ≤ δ(αβ).

δ needs to be computed only for maximal strings at a fixedfrequency. These are exactly the strings ending at nodesof the Suffix Tree.

Page 20: BUNDLED SUFFIX TREES - units.it

Introduction Bundled Suffix Trees An application

Computing the z-score

bcabbabc# Using a Suffix Tree, we cancompute and store thez-score for all “interesting”substrings of a given text inlinear time and space(given that we cancompute E and N in lineartime and space).

A. Apostolico, M.E. Block, S. Lonardi. Monotonyof surprise and the large-scale quest for unusualwords. Journal of Computational Biology, 7(3-4),2003.

Page 21: BUNDLED SUFFIX TREES - units.it

Introduction Bundled Suffix Trees An application

Measures of Surprise in the Approximate World

bcabbabc ;a ↔ b ↔ c

Let’s consider asoccurrences of β in α allthe substrings β′ that arein relation with β.

Reasoning as in the exactcase, we can use a BuSTto compute the z-score forall interesting substrings ofα in time and spaceproportional to the BuST’ssize.

Page 22: BUNDLED SUFFIX TREES - units.it

Introduction Bundled Suffix Trees An application

Measures of Surprise in the Approximate World

If we use an Hamming-like relation built on macrocharacters,we are counting all the occurrences of a string with distancebounded by a threshold proportional to the string’s length.

Pros and ConsPros :

the algorithm runs in time proportional to the number ofmaximal substrings (w.r.t. δ).BuST provides a compact way to store and retrieve thisinformation.

Cons :the macrocharacters introduce rigidity (we can countcompute the z-score only for strings of length multiple of themacrocharacter’s size).the distance must be distributed evenly amongmacrocharacters.

Page 23: BUNDLED SUFFIX TREES - units.it

Introduction Bundled Suffix Trees An application

Conclusions

We have introduced Bundled Suffix Trees, a new datastructure extending suffix trees.

Given a relation among characters encoding some sort ofapproximate information, a BuST reveals the innerstructure of the strings w.r.t. this relation (all thisinformation is internal w.r.t. the processed string).

BuST can be used for all the problems related to the innerstructure of the string, like computation of approximatedfrequency.

The structure is based on a very general concept ofnon-transitive relation among characters. The use ofHamming-like relation on tuples is just a possible example.

Its size is slightly more than linear on average, and there’sa fast (McCreight-like) algorithm to build it.

Page 24: BUNDLED SUFFIX TREES - units.it

Dimension of BuST Efficient Algorithm

Quadratic BuST

Let’s consider the text

a . . . a︸ ︷︷ ︸m

c . . . c︸ ︷︷ ︸m

b . . . b︸ ︷︷ ︸2m

,

over {a, b, c, d}, with

a ↔ bl ld ↔ c

The number of nodessurrounded by the red boxis quadratic in m!

Return

Page 25: BUNDLED SUFFIX TREES - units.it

Dimension of BuST Efficient Algorithm

Quadratic BuST

Let’s consider the text

a . . . a︸ ︷︷ ︸m

c . . . c︸ ︷︷ ︸m

b . . . b︸ ︷︷ ︸2m

,

over {a, b, c, d}, with

a ↔ bl ld ↔ c

The number of nodessurrounded by the red boxis quadratic in m!

Return

Page 26: BUNDLED SUFFIX TREES - units.it

Dimension of BuST Efficient Algorithm

Quadratic BuST

Let’s consider the text

a . . . a︸ ︷︷ ︸m

c . . . c︸ ︷︷ ︸m

b . . . b︸ ︷︷ ︸2m

,

over {a, b, c, d}, with

a ↔ bl ld ↔ c

The number of nodessurrounded by the red boxis quadratic in m!

Return

Page 27: BUNDLED SUFFIX TREES - units.it

Dimension of BuST Efficient Algorithm

The exponent δ

Return

Page 28: BUNDLED SUFFIX TREES - units.it

Dimension of BuST Efficient Algorithm

The exponent δ

Return

Page 29: BUNDLED SUFFIX TREES - units.it

Dimension of BuST Efficient Algorithm

Test

Number of macrocharacters of length 4 over DNA alphabet.Test strings are generated according to a uniform p.d.

Return

Page 30: BUNDLED SUFFIX TREES - units.it

Dimension of BuST Efficient Algorithm

Inverse Suffix Links

A crucial role in the fastconstruction of suffix trees isplayed by suffix links.

Suffix links are pointers fromnodes with path label xα tonodes with path label α.

Whenever there is a node withpath label xα, there’s also a nodewith path label α.

Return

Page 31: BUNDLED SUFFIX TREES - units.it

Dimension of BuST Efficient Algorithm

Inverse Suffix Links

Inverse suffix links are pointersfrom nodes with path label α topositions in the tree labeled xα,for each x in the alphabet suchthat xα is a substring of S.

They can point in the middle ofan arc.

If a ISL takes from α to xα, it islabeled with x .

Return

Page 32: BUNDLED SUFFIX TREES - units.it

Dimension of BuST Efficient Algorithm

The Algorithm

Red nodes for suffix S[i] can becomputed from red nodes forsuffix S[i + 1], using InverseSuffix Links.

Suppose a red node for suffixS[i + 1] is just under a “black”node with path label α.

From this node, we can cross allinverse suffix links labeled withcharacters in relation with S(i).

With a skip and count trick, wecan identify the positions of rednodes for S[i].

Return

Page 33: BUNDLED SUFFIX TREES - units.it

Dimension of BuST Efficient Algorithm

The Algorithm

Red nodes for suffix S[i] can becomputed from red nodes forsuffix S[i + 1], using InverseSuffix Links.

Suppose a red node for suffixS[i + 1] is just under a “black”node with path label α.

From this node, we can cross allinverse suffix links labeled withcharacters in relation with S(i).

With a skip and count trick, wecan identify the positions of rednodes for S[i].

Return

Page 34: BUNDLED SUFFIX TREES - units.it

Dimension of BuST Efficient Algorithm

The Algorithm

Red nodes for suffix S[i] can becomputed from red nodes forsuffix S[i + 1], using InverseSuffix Links.

Suppose a red node for suffixS[i + 1] is just under a “black”node with path label α.

From this node, we can cross allinverse suffix links labeled withcharacters in relation with S(i).

With a skip and count trick, wecan identify the positions of rednodes for S[i].

Return