on-line linear-time construction of word suffix trees

29
On-line Linear-time Construction of Word Suffix Trees Shunsuke Inenaga (Japan Society for the Promotion of Science & Kyushu University) Masayuki Takeda (Kyushu University & JST)

Upload: geoff

Post on 21-Jan-2016

40 views

Category:

Documents


0 download

DESCRIPTION

On-line Linear-time Construction of Word Suffix Trees. Shunsuke Inenaga (Japan Society for the Promotion of Science & Kyushu University) Masayuki Takeda (Kyushu University & JST). Pattern Searching Problem. Given : text T in S * and pattern P in S * - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: On-line Linear-time Construction of  Word Suffix Trees

On-line Linear-time Construction of

Word Suffix Trees

Shunsuke Inenaga(Japan Society for the Promotion of Science

& Kyushu University)

Masayuki Takeda(Kyushu University & JST)

Page 2: On-line Linear-time Construction of  Word Suffix Trees

Pattern Searching Problem

Given: text T in and pattern P in Find: an occurrence of P in T

: alphabet : set of strings

Using an indexing structure for T, we can solve the above problem in O(|P|) time.

Page 3: On-line Linear-time Construction of  Word Suffix Trees

Suffix Trie

A trie representing all suffixes of T

a

a

c

b

$

c

b

$

c

b

$

b

$

$T = aacb$

aacb$acb$cb$b$$

Page 4: On-line Linear-time Construction of  Word Suffix Trees

Unwanted Matching

pace runner

pattern

s

text

Page 5: On-line Linear-time Construction of  Word Suffix Trees

Unwanted Matching

pace runner

pattern

s

text

Page 6: On-line Linear-time Construction of  Word Suffix Trees

Unwanted Matching

pace runner

pattern

s

text

Page 7: On-line Linear-time Construction of  Word Suffix Trees

Introducing Word Separator #

# : word separator - special symbol not in D = # : dictionary of words

Text T : an element of D+

(T is a sequence T1T2…Tk of k words in D)

e.g., T = This#is#a#pen#

= {A,…,z} D = {...,This#,...,a#,...is#,...pen#,...}

Page 8: On-line Linear-time Construction of  Word Suffix Trees

Word-level Pattern Searching Problem

Given: text T in D+ and pattern P in D+

Find: an occurrence of P in T which immediately follows #

e.g.

The#space#runner#is#not#your#good#pace#runner#

Page 9: On-line Linear-time Construction of  Word Suffix Trees

Word-level Pattern Searching Problem

Given: text T in D+ and pattern P in D+

Find: an occurrence of P in T which immediately follows #

e.g.

The#space#runner#is#not#your#good#pace#runner#

Page 10: On-line Linear-time Construction of  Word Suffix Trees

Word Suffix Trie

A trie representing the suffixes of T which immediately follows # (and T itself).

T = aa#b#

aa#b#a#b##b#b##

a

a

#

b

#

b

#

Page 11: On-line Linear-time Construction of  Word Suffix Trees

Comparison

a

a

#

b

#

b

#

a

a

#

b

#

#

b

#

#

b

#

b

#

T = aa#b#

Suffix Trie Word Suffix Trie

Page 12: On-line Linear-time Construction of  Word Suffix Trees

Construction

Suffix Trie : Ukkonen’s on-line algorithm ( 1995 )

Word Suffix Trie : We modify Ukkonen’s algorithm by:

Using minimum DFA accepting dictionary D Redefining suffix links

Page 13: On-line Linear-time Construction of  Word Suffix Trees

Minimum DFA

The minimum DFA accepting D = # clearly requires constant space (for fixed ).

We replace the root node of the suffix trie with the final state of the DFA.

#

Page 14: On-line Linear-time Construction of  Word Suffix Trees

Suffix Links

T = aa#b#

#

a,b

a

a

#

b

b

#

#

Word Suffix Trie

Page 15: On-line Linear-time Construction of  Word Suffix Trees

Suffix Links [cont.]

a

a

#

b

#

#

b

#

#

b

#

b

#

T = aa#b# a,b

a

a

#

b

b

#

#

Suffix Trie Word Suffix Trie

#

Page 16: On-line Linear-time Construction of  Word Suffix Trees

On-line Construction

#

a

T = aa#b# a,b

a

Suffix Trie Word Suffix Trie

#

Page 17: On-line Linear-time Construction of  Word Suffix Trees

On-line Construction

a

a

T = aa#b# a,b

a

a

Suffix Trie Word Suffix Trie

##

Page 18: On-line Linear-time Construction of  Word Suffix Trees

# #

On-line Construction

a

a

##

#

T = aa#b# a,b

a

a

#

Suffix Trie Word Suffix Trie

Page 19: On-line Linear-time Construction of  Word Suffix Trees

# #

On-line Construction

a

a

##

#

T = aa#b#

bb

b

b

a,b

a

a

#

b

b

Suffix Trie Word Suffix Trie

Page 20: On-line Linear-time Construction of  Word Suffix Trees

# #

On-line Construction

a

a

##

#

T = aa#b#

bb

b

b

# #

#

#

a,b

a

a

#

b

b

#

#

Suffix Trie Word Suffix Trie

Page 21: On-line Linear-time Construction of  Word Suffix Trees

Pseudo CodeJust change here!!

a

a

#

b

#

b

#

a

a

#

b

#

#

b

#

#

b

#

b

#

#

a,b#

#

Page 22: On-line Linear-time Construction of  Word Suffix Trees

Like Dress-up Doll

Page 23: On-line Linear-time Construction of  Word Suffix Trees

Like Dress-up Doll [cont.]Hair is different

Body is the same!!

Different looking!!!!

a

a

#

b

#

b

#

a

a

#

b

#

#

b

#

#

b

#

b

#

#

a,b#

#

Page 24: On-line Linear-time Construction of  Word Suffix Trees

Drawback of Word Suffix Trie

Word suffix tries require O(k|T|) space.

Andersson et al. introduced word suffix trees which can be implemented in O(k) space.

Page 25: On-line Linear-time Construction of  Word Suffix Trees

Construction of Word Suffix Trees

Algorithm by Andersson et al. ( 1996 )

for text T = T1T2…Tk, constructs word suffix trees in O(|T|) expected time with O(k) space.

Our algorithm

simulates the on-line word suffix trie algorithm on word suffix trees.

runs in O(|T|) time in the worst cases, with O(k) space.

Page 26: On-line Linear-time Construction of  Word Suffix Trees

Normal and Word Suffix Trees

aa

#b

#

b#

a

a#

b#

#b#

#b#

b#

#

T = aa#b#

Suffix Tree Word Suffix Tree

Page 27: On-line Linear-time Construction of  Word Suffix Trees

Construction Algorithm

Just change here!!

Page 28: On-line Linear-time Construction of  Word Suffix Trees

Conclusions

We first proposed an on-line word suffix trie construction algorithm.

The keys to the algorithm are the minimal DFA accepting D and the re-defined suffix links.

Further, we introduced an on-line algorithm to build word suffix trees that works with O(k) space and in O(|T|) time in the worst cases.

Page 29: On-line Linear-time Construction of  Word Suffix Trees

Further Work

“Sparse Directed Acyclic Word Graphs”by Shunsuke Inenaga and Masayuki TakedaAccepted to SPIRE’06

“Sparse Compact Directed Acyclic Word Graphs”by Shunsuke Inenaga and Masayuki TakedaAccepted to PSC’06