suffix trees and suffix arrays

22
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. R ibeiro-Neto Addison-Wesley, 1999. (Chapter 8)

Upload: usoa

Post on 17-Jan-2016

99 views

Category:

Documents


3 download

DESCRIPTION

Suffix Trees and Suffix Arrays. Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, 1999. (Chapter 8). Introduction. Word-based indexing Inverted indices are good for search words - PowerPoint PPT Presentation

TRANSCRIPT

1

Suffix Trees and Suffix Arrays

Modern Information Retrieval

by R. Baeza-Yates and B. Ribeiro-Neto

Addison-Wesley, 1999.

(Chapter 8)

2

Introduction

Word-based indexing» Inverted indices are good for search words» Queries such as phrases are expensive to solve using

Inverted files» For word-based applications, inverted files perform better

Suffix trees and suffix arrays» complex queries

3

Text Suffixes

text. A text has many words. Words are made from letters.

text has many words. Words are made from letters.

many words. Words are made from letters.

words. Words are made from letters.

Words are made from letters.

made from letters.

letters.

This is a text. A text has many words. Words are made from letters.

4

The Suffix Trie and Suffix Tree

This is a text. A text has many words. Words are made from letters.

1 11 19 28 33 40 46 50 60

11 19

28 50

60 6

33 40

11 19

28 50

60

33 40

wt m

l

o

r

d

s

e

x

t

a

dn

5 3

5

PAT Trees and PAT Arrays

Information Retrieval: Data Structures and Algorithms

by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992.

(Chapters 5)

6

PAT Trees and PAT Arrays

Problems of tradition IR models» Documents and words are assumed.

» Keywords must be extracted from the text (indexing).

» Queries are restricted to keywords. New indices for text

» A text is regarded as a long string.

» Each position corresponds to a semi-infinite string (sistring).

» No structures and no keywords

7

Semi-infinite Strings

ExampleText Once upon a time, in a far away land …sistring 1Once upon a time …sistring 2nce upon a time …sistring 8on a time, in a …sistring 11 a time, in a far …sistring 22 a far away land …

Compare sistrings22 < 11 < 2 < 8 < 1

8

PAT Tree

PAT TreeA Patricia tree constructed over all the possible sistrings of a text

Patricia tree

» a binary digital tree where the individual bits of the keys are used to decide on the branching

– A zero bit will cause a branch to the left subtree– A one bit will cause a branch to the right subtree

» each internal node indicates which bit of the query is used for branching

– absolute bit position– a count of the number of bits to skip

» each external node points to a sistring– the integer displacement to original text

Example

Text 01100100010111 …sistring 1 01100100010111 …sistring 2 1100100010111 …sistring 3 100100010111 …sistring 4 00100010111 …sistring 5 0100010111 …sistring 6 100010111 …sistring 7 00010111 …sistring 8 0010111 ...

1 1

21 2

23

1

1

2

23

1

2

14

2

23

1

2

4 3

15

: external node sistring (integer displacement) total displacement of the bit to be inspected

: internal node skip counter & pointer

0 1 0 1

0 1

2

2

2

4 3

15

1

Text 01100100010111 …sistring 1 01100100010111 …sistring 2 1100100010111 …sistring 3 100100010111 …sistring 4 00100010111 …sistring 5 0100010111 …sistring 6 100010111 …sistring 7 00010111 …sistring 8 0010111 ...

4

36

註: 3和 6要 4個 bits才能區辨

2

2

2

3

15

4

36

1

3

47

2

2

2

3

15

4

36

1

3

7 5

84

Search 00101

11

Indexing Points

The above example assumes every position in the text is indexed.i.e. n external nodes, one for each indexed position in the text

Word and phrase searchessistrings that are at the beginning of words are necessary

Trade-off between size of the index and search requirements

12

Prefix searching

ideaevery subtree of the PAT tree has all the sistrings with a given prefix.

Search: proportional to the query lengthexhaust the prefix or up to external node.

Search for the prefix“10100” and its answer

13

Proximity Searching

Find all places where s1 is at most a fixed (given by a user) number of characters away from s2.

in 4 ation ==> insulation, international, information Algorithm

1. Search for s1 and s2.2. Select the smaller answer set from these two sets and sort by position.3. Traverse the unsorted answer set, searching every position in the sorted set and checking if the distance between positions satisfying the proximity condition.

sort+traverse time:m1 logm1 +m2logm1 (assume m1<m2)

14

Range Searching

Search for all the strings within a certain lexicographical range. » Ex: the range of “abc” ..”acc”:

– “abracadabra”, “acacia” ○– “abacus”, “acrimonious” X

Algorithm» Search each end of the defining intervals.

» Collect all the sub-trees between (and including) them.

15

Longest Repetition Searching

the match between two different positions of a text where this match is the longest in the entire text, e.g.,

0 1 1 0 0 1 0 0 0 1 0 1 1 1

2

2

2

3

15

4

36

1

3

7 5

84

Text 01100100010111 sistring 1 01100100010111 sistring 2 1100100010111 sistring 3 100100010111 sistring 4 00100010111 sistring 5 0100010111 sistring 6 100010111 sistring 7 00010111 sistring 8 0010111

the tallest internal node gives a pairof sistrings that match for the greatestnumber of characters

16

“Most Significant” or “Most Frequent” Matching

The most frequently occurring strings within the text database» e.g., the most frequent trigram

Find the most frequent trigram» find the largest subtree at a distance 3 characters from root

2

2

2

3

15

4

36

1

3

7 5

84

the tallest internal node gives a pairof sistrings that match for the greatestnumber of characters

i.e., 1, 2, 3 are the same forsistrings 100100010111 and 100010111

17

Building PAT Trees as Patricia Trees (1)

Bucketing of external nodes» collect more than one external node

» a bucket replaces any subtree with size less than a certain constraint (b)save significant number of internal nodes

» the external nodes inside a bucket do not have any structure associated with themincrease the number of comparisons for each search

18

Building PAT Trees as Patricia Trees (2)

Mapping the tree onto the disk using super-nodes» Advantage: save the number of disk access and space

» Every disk page has a single entry point, contains as much of the trees as possible, and

– terminates either in external nodes or in pointers to other disk pages– The pointers in internal nodes will address either a disk page or another

node inside the same page reduces the storage cost of internal nodes

» Example– Assume a disk page contains on the order of 1,000 internal/external

nodes– on the average, each disk page contains about 10 steps of a root-to-

leaf path

19

PAT Trees Represented as Arrays

External node bucket size, b If we keep the external nodes in the bucket in the same relative order a

s they would be in the tree

» Indirect binary search vs. sequential search

2

2

2

3

15

4

36

1

3

7 5

84

7 4 8 5 1 6 3 2PAT array

0 1 1 0 0 1 0 0 0 1 0 1 1 1 ...Text

20

Searching PAT Trees as Arrays

Prefix searching and range searchingdoing an indirect binary search over the array with the results of the comparisons being less than, equal, and greater than.

ExampleSearch for the prefix 100 and its answer

Most frequent, Longest repetition

» Manber and Baeza-Yates (1991)

7 4 8 5 1 6 3 2PAT array

0 1 1 0 0 1 0 0 0 1 0 1 1 1 ...Text

21

Comparisons

Signature files» Use hashing techniques to produce an index» Advantage

– storage overhead is small (10%-20%)

» Disadvantages– the search time on the index is linear

– some answers may not match the query, thus filtering must be done

22

Comparisons (Continued)

Inverted files» storage overhead (30% ~ 100%)

» search time for word searches is logarithmic

PAT arrays» potential use in other kind of searches

– phrases

– regular expression searching

– approximate string searching

– longest repetitions

– most frequent searching