applied databases - the university of edinburgh · sebastian maneth lecture 16 suffix array,...

79
Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10 th , 2016 Applied Databases

Upload: others

Post on 20-Aug-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

Sebastian Maneth

Lecture 16Suffix Array, Burrows-Wheeler Transform

University of Edinburgh - March 10th, 2016

Applied Databases

Page 2: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

2

Outline

1. Suffix Array

2. Burrows-Wheeler Transform

Page 3: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

3

Online String-Matching

→ how can we do Horspool on Unicode Texts?

Page 4: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

4

Suffix Tree 123456789T = abaababa$

→ search time still O(|P|), as for suffix trie!→ perfect data structure for our task!

→ add end marker “$”

→ one-to-one correspondence of leaves to suffixes

→ a tree with m+1 leaves has <= 2m+1 nodes!

LemmaSize of suffix tree for “T$” islinear in n=|T|, i.e., in O(n).

Page 5: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

5

Suffix Tree Construction

Good news:Suffix tree can be constructed in linear time!

Complex construction algorithms

→ Weiner 1973

→ McCreight 1976

→ Ukkonen 1995

→ Farach 1997

Linear time for any integer alphabet, drawn from a polynomial range

→ again a big breakthrough

Linear time only for constant-size alphabets!Otherwise, O(n log n)

Page 6: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

6

Suffix TreeMany applications→ E.g., infinite-window Lempel-Ziv like compression

a b a abaa aba baba ab b → a b a (1,4) (1,3) (9,4) (1,2) b

M. C. Escher (1948)

(position, length)

Page 7: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

Questions

→ is the size of this tree really in O(n)?→ in terms of #nodes/edges: OK

→ how about the sizes of labels ??

Page 8: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

Questions

→ is the size of this tree really in O(n)?→ in terms of #nodes/edges: OK

→ how about the sizes of labels ??each requires log(n) bits!?

Page 9: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

Yes, but:

→ log(n) is small, e.g., 64 bits

→ can be considered constant! each requires log(n) bits!?

Page 10: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

Lesson to learn:

→ log(n) in terms of a run-time factor, can be fatal

→ in terms of a space-factor, it is fine! how long will a text be? 2^64???

4 / 8 Bytes each is enough!

Page 11: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

“For any text with fewer characters than #atoms in the universe, the label size for the suffix tree is a constant of x bits.. “ x Bits each is enough!

Page 12: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

→ label size is not an issue

→ but, size of edge-pointers?

→ imagine each edge requires a 32-bit pointer!!

5*4 Bytes = 20 Bytes for edge pointers

Page 13: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

13

Actual Space of Suffix Trees

Space for edge-pointers is problematic:

→ actual space of suffix tree, ca. 20|T|

→ on commodity hardware, texts of more than 1GB are not doable

Page 14: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

→ how to avoid the huge space needed for edges?

Page 15: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

15

1. Suffix Array

Page 16: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

16

Suffix Array Construction

p

Page 17: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

17

1. Suffix Array

→ read leaves from left-to-right!

SA(T) = [ 12,11,8,5,2,10,9,7,4,6,3 ]

Page 18: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

18

1. Suffix Array

→ read leaves from left-to-right!

SA(T) = [ 12,11,8,5,2,10,9,7,4,6,3 ]

TheoremThe suffix array of T can be constructed in time O(|T|).

Page 19: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

19

Search

Page 20: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

20

Search

Page 21: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

21

Search

Page 22: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

22

Search

Page 23: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

23

Search

Page 24: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

24

Search

Page 25: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

25

Search

Page 26: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

26

Search

Page 27: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

27

Search

Page 28: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

28

Search

Page 29: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

29

Search

Page 30: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

30

Search

Page 31: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

31

Search

→ O(|P| + log|T|) in practise, using a simple trick

→ O(|P| + log|T|) guaranteed, using LCP-array

LCP(k,j) = longest common prefix of T[SA[k]...] and T[SA[j]...]

Page 32: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

32

1. Suffix Arrays

→ much more space efficient than Suffix Tree→ used in pratise (suffix tree more used in theory)

→ Suffix Array Construction, without Suffix Trees?

[ Linear Work Suffix Array Construction, J. Kärkkäinen, Sanders, Burkhardt, Journal of the ACM, 2006 ]

→ See also:

[ A taxonomy of suffix array construction algorithms, S. J. Puglisi, W. F. Smyth, A. Turpin, ACM Computing Surveys 39, 2007 ]

Page 33: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

33

2. Burrows-Wheeler Transform

T = banana$

banana$$bananaa$bananna$banaana$bannana$baanana$b

generate all cyclic shifts of T

Page 34: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

34

2. Burrows-Wheeler Transform

T = banana$

banana$$bananaa$bananna$banaana$bannana$baanana$b

$bananaa$bananana$bananana$bbanana$na$bananana$ba

sort them lexicographically

$ < a < b < c < . . .

Page 35: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

35

2. Burrows-Wheeler Transform

T = banana$

banana$$bananaa$bananna$banaana$bannana$baanana$b

$bananaa$bananana$bananana$bbanana$na$bananana$ba

sort them lexicographically

$ < a < b < c < . . .

What information is capturedby the first column?

Page 36: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

36

2. Burrows-Wheeler Transform

T = banana$

banana$$bananaa$bananna$banaana$bannana$baanana$b

$bananaa$bananana$bananana$bbanana$na$bananana$ba

sort them lexicographically

$ < a < b < c < . . .

What information is capturedby the first column?

→ sorted #occ of each letter:

– one time “$”– three times “a”– one time “b”– two times “n”

Page 37: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

37

2. Burrows-Wheeler Transform

T = banana$

banana$$bananaa$bananna$banaana$bannana$baanana$b

$bananaa$bananana$bananana$bbanana$na$bananana$ba

sort them lexicographically

$ < a < b < c < . . .

What information is capturedby the first column?

→ sorted #occ of each letter:

– one time “$”– three times “a”– one time “b”– two times “n”

Can you retrieve the original text T,given only the first column?

Page 38: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

38

2. Burrows-Wheeler Transform

T = banana$

banana$$bananaa$bananna$banaana$bannana$baanana$b

$bananaa$bananana$bananana$bbanana$na$bananana$ba

sort them lexicographically

$ < a < b < c < . . .

What information is capturedby the first column?

→ sorted #occ of each letter:

– one time “$”– three times “a”– one time “b”– two times “n”

Can you retrieve the original text T,given only the first column?

Of course not!!

Page 39: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

39

2. Burrows-Wheeler Transform

T = banana$

banana$$bananaa$bananna$banaana$bannana$baanana$b

$bananaa$bananana$bananana$bbanana$na$bananana$ba

sort them lexicographically

$ < a < b < c < . . .

What information is capturedby the first column?

→ sorted #occ of each letter:

– one time “$”– three times “a”– one time “b”– two times “n”

Can you retrieve the original text T,given only the first column?

Note→ each column contains the same letters ($, 3*a, b, 2*n)

Page 40: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

40

2. Burrows-Wheeler Transform

T = banana$

banana$$bananaa$bananna$banaana$bannana$baanana$b

$bananaa$bananana$bananana$bbanana$na$bananana$ba

sort them lexicographically

$ < a < b < c < . . .

What information is capturedby the second column?

→ “sorting with respect to considering one previous letter”

Page 41: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

41

2. Burrows-Wheeler Transform

T = banana$

banana$$bananaa$bananna$banaana$bannana$baanana$b

$bananaa$bananana$bananana$bbanana$na$bananana$ba

sort them lexicographically

$ < a < b < c < . . .

What information is capturedby the second column?

→ “sorting with respect to considering one previous letter”

Can you retrieve the original T fromthe second column only?

Page 42: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

42

2. Burrows-Wheeler Transform

T = banana$

banana$$bananaa$bananna$banaana$bannana$baanana$b

$bananaa$bananana$bananana$bbanana$na$bananana$ba

sort them lexicographically

$ < a < b < c < . . .

What information is capturedby the second column?

→ “sorting with respect to considering one previous letter”

Can you retrieve the original T fromthe second column only?

Can you retrieve the first column,given the second one?

Page 43: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

43

2. Burrows-Wheeler Transform

T = banana$

banana$$bananaa$bananna$banaana$bannana$baanana$b

$bananaa$bananana$bananana$bbanana$na$bananana$ba

sort them lexicographically

$ < a < b < c < . . .

What information is capturedby the third column?

→ “sorting with respect to considering two previous letters”

Page 44: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

44

2. Burrows-Wheeler Transform

T = banana$

banana$$bananaa$bananna$banaana$bannana$baanana$b

$bananaa$bananana$bananana$bbanana$na$bananana$ba

sort them lexicographically

$ < a < b < c < . . .

What information is capturedby the last column?

→ “sorting with respect to considering all previous letters”

Page 45: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

45

2. Burrows-Wheeler Transform

T = banana$

banana$$bananaa$bananna$banaana$bannana$baanana$b

$bananaa$bananana$bananana$bbanana$na$bananana$ba

sort them lexicographically

$ < a < b < c < . . .

What information is capturedby the last column?

→ “sorting with respect to considering all previous letters”

Can you retrieve the original text T,given only the last column?

Page 46: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

46

2. Burrows-Wheeler Transform

T = banana$

banana$$bananaa$bananna$banaana$bannana$baanana$b

$bananaa$bananana$bananana$bbanana$na$bananana$ba

sort them lexicographically

$ < a < b < c < . . .

What information is capturedby the last column?

→ “sorting with respect to considering all previous letters”

Can you retrieve the original text T,given only the last column?

YES, you can!!

Page 47: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

47

2. Burrows-Wheeler Transform

T = banana$

banana$$bananaa$bananna$banaana$bannana$baanana$b

$bananaa$bananana$bananana$bbanana$na$bananana$ba

sort them lexicographically

$ < a < b < c < . . .

Burrows-Wheeler Transform Lof text T

Page 48: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

48

2. Burrows-Wheeler Transform

T = banana$

banana$$bananaa$bananna$banaana$bannana$baanana$b

$bananaa$bananana$bananana$bbanana$na$bananana$ba

sort them lexicographically

$ < a < b < c < . . .

Burrows-Wheeler Transform Lof text T

Given last column L, how can we reconstruct the original text T?

Page 49: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

49

2. Burrows-Wheeler Transform

Naively: annb$aa

Page 50: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

50

2. Burrows-Wheeler Transform

Naively: annb$aa

sort

$aaabnn

first column!

Page 51: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

51

2. Burrows-Wheeler Transform

Naively: annb$aa

sort

$aaabnn

first column!

$bananaa$bananana$bananana$bbanana$na$bananana$ba

previous letter!

Page 52: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

52

2. Burrows-Wheeler Transform

Naively: annb$aa

sort

$aaabnn

first column!

$banana a$a$bananana$bananana$bbanana$na$bananana$ba

must be a substring of T

previous letter!

Page 53: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

53

2. Burrows-Wheeler Transform

Naively: annb$aa

sort

$aaabnn

first column!

$banana a$a$banan naana$bananana$bbanana$na$bananana$ba

must be a substring of T

previous letter!

Page 54: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

54

2. Burrows-Wheeler Transform

Naively: annb$aa

sort

$aaabnn

first column!

$banana a$a$banan naana$ban naanana$bbanana$na$bananana$ba

must be a substring of T

previous letter!

Page 55: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

55

2. Burrows-Wheeler Transform

Naively: annb$aa

sort

$aaabnn

first column!

$banana a$a$banan naana$ban naanana$b babanana$ $bna$bana annana$ba an

All 2-letter substrings of T!

previous letter!

Page 56: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

56

2. Burrows-Wheeler Transform

Naively: annb$aa

sort

$aaabnn

first column!

a$nanaba$banan

sort

$ba$ananbanana

columns 1 and 2.

Page 57: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

57

2. Burrows-Wheeler Transform

Naively: annb$aa

sort

$aaabnn

first column!

a$nanaba$banan

sort

$ba$ananbanana

previous letter

a$bna$nanban$baanaana

Page 58: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

58

2. Burrows-Wheeler Transform

Naively: annb$aa

sort

$aaabnn

third column!

a$nanaba$banan

sort

$ba$ananbanana

previous letter

a$bna$nanban$baanaana

sort

$baa$banaanabanna$nan

Page 59: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

59

2. Burrows-Wheeler Transform

Naively: annb$aa

sort

$aaabnn

third column!

a$nanaba$banan

sort

$ba$ananbanana

previous letter

a$bna$nanban$baanaana

sort

$baa$banaanabanna$nan

Et cetera

Page 60: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

60

2. Burrows-Wheeler Transform

Naive method: very expensive! (many sortings)

rankb( L, k ) = #occ of b in L, up to position k (excluding k)

1 2 3 4 5 6 7L = a n n b $ a a

rankn( L, 4 ) = 2

Page 61: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

61

2. Burrows-Wheeler Transform

Naive method: very expensive! (many sortings)

rankb( L, k ) = #occ of b in L, up to position k (excluding k)

1 2 3 4 5 6 7L = a n n b $ a a

rankn( L, 4 ) = 2

rankn( L, 3 ) = 1

Page 62: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

62

2. Burrows-Wheeler Transform

Naive method: very expensive! (many sortings)

rankb( L, k ) = #occ of b in L, up to position k (excluding k)

1 2 3 4 5 6 7L = a n n b $ a a

rankn( L, 4 ) = 2

rankn( L, 3 ) = 1

ranka( L, 7 ) = 2

O(log |S|) time(after linear time preprocessing of L)

Page 63: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

63

2. Burrows-Wheeler Transform

rankb( L, k ) = #occ of b in L, up to position k-1

1 2 3 4 5 6 7 $ a b mL = a n n b $ a a C 1 2 5 6

Last-to-Front MappingLF(k)=C[L[k]] + rankL[k]( L, k )

1$2a a a5b6n n

first line starting with the letter

$bananaa$bananana$bananana$bbanana$na$bananana$ba

second “a”, coming from the topwhere is the second “a” from top, in the first column?

Page 64: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

64

2. Burrows-Wheeler Transform1$2a a a5b6n n

first line starting with the letter

$bananaa$bananana$bananana$bbanana$na$bananana$ba

second “a”, coming from the topwhere is the second “a” from top, in the first column?

LF(6) = C[L[6]] + ranka( L, 6 ) = C[“a”] + 1 = 2 + 1 = 3

rankb( L, k ) = #occ of b in L, up to position k-1

1 2 3 4 5 6 7 $ a b mL = a n n b $ a a C 1 2 5 6

Last-to-Front MappingLF(k)=C[L[k]] + rankL[k]( L, k )

Page 65: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

65

Decoding

first line starting with the letter

$bananaa$bananana$bananana$bbanana$na$bananana$ba

→ start with $ (position 5)→ LF(5) = 1→ L[1] = a [current decoding: “a$”]

previous letter!

rankb( L, k ) = #occ of b in L, up to position k-1

1 2 3 4 5 6 7 $ a b mL = a n n b $ a a C 1 2 5 6

Last-to-Front MappingLF(k)=C[L[k]] + rankL[k]( L, k )

Page 66: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

66

Decoding

first line starting with the letter

$bananaa$bananana$bananana$bbanana$na$bananana$ba

→ start with $ (position 5)→ LF(5) = 1→ L[1] = a [current decoding: “a$”]

→ LF(1) = 2→ L[2] = n [ “ba$” ]

rankb( L, k ) = #occ of b in L, up to position k-1

1 2 3 4 5 6 7 $ a b mL = a n n b $ a a C 1 2 5 6

Last-to-Front MappingLF(k)=C[L[k]] + rankL[k]( L, k )

Page 67: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

67

Decoding

first line starting with the letter

$bananaa$bananana$bananana$bbanana$na$bananana$ba

→ start with $ (position 5)→ LF(5) = 1→ L[1] = a [current decoding: “a$”]

→ LF(1) = 2→ L[2] = n [ “ba$” ]

→ LF(2) = 6→ L(6) = “a” [ “ana$” ]

rankb( L, k ) = #occ of b in L, up to position k-1

1 2 3 4 5 6 7 $ a b mL = a n n b $ a a C 1 2 5 6

Last-to-Front MappingLF(k)=C[L[k]] + rankL[k]( L, k )

Page 68: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

68

Decoding

first line starting with the letter

$bananaa$bananana$bananana$bbanana$na$bananana$ba

→ start with $ (position 5)→ LF(5) = 1→ L[1] = a [current decoding: “a$”]

→ LF(1) = 2, L[2] = n [ “ba$” ]→ LF(2) = 6, L(6) = “a” [ “ana$” ] → LF(6) = 3, L(3) = “n” [ “nana$” ]

rankb( L, k ) = #occ of b in L, up to position k-1

1 2 3 4 5 6 7 $ a b mL = a n n b $ a a C 1 2 5 6

Last-to-Front MappingLF(k)=C[L[k]] + rankL[k]( L, k )

Page 69: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

69

Decoding

first line starting with the letter

$bananaa$bananana$bananana$bbanana$na$bananana$ba

→ start with $ (position 5)→ LF(5) = 1→ L[1] = a [current decoding: “a$”]

→ LF(1) = 2, L[2] = n [ “ba$” ]→ LF(2) = 6, L(6) = “a” [ “ana$” ] → LF(6) = 3, L(3) = “n” [ “nana$” ]→ LF(3) = 7, L(7) = “a” [ “anana$” ]

rankb( L, k ) = #occ of b in L, up to position k-1

1 2 3 4 5 6 7 $ a b mL = a n n b $ a a C 1 2 5 6

Last-to-Front MappingLF(k)=C[L[k]] + rankL[k]( L, k )

Page 70: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

70

Decoding

first line starting with the letter

$bananaa$bananana$bananana$bbanana$na$bananana$ba

→ start with $ (position 5)→ LF(5) = 1→ L[1] = a [current decoding: “a$”]

→ LF(1) = 2, L[2] = n [ “ba$” ]→ LF(2) = 6, L(6) = “a” [ “ana$” ] → LF(6) = 3, L(3) = “n” [ “nana$” ]→ LF(3) = 7, L(7) = “a” [ “anana$” ] → LF(7) = 4, L(4) = “”b” [ “banana$” ]

rankb( L, k ) = #occ of b in L, up to position k-1

1 2 3 4 5 6 7 $ a b mL = a n n b $ a a C 1 2 5 6

Last-to-Front MappingLF(k)=C[L[k]] + rankL[k]( L, k )

Page 71: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

71

2. Burrows-Wheeler Transform

$bananaa$bananana$bananana$bbanana$na$bananana$ba

What is special about the BTW?

→ has many repeating characters (WHY?)→ can be run-length compressed!

Imagine the word “the” appears many times in a text.

he...the...the...the...the...the...the...t

(“t”,18733)

→ main motivation→ used in “bzip2” compressor

Page 72: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

72

2. Burrows-Wheeler Transform

$bananaa$bananana$bananana$bbanana$na$bananana$ba

What is special about the BTW?

→ efficient backward search!

→ counting #occ’s of pattern P in O(|P| log |S|) time!

Page 73: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

73

Backward Search on BWT

Page 74: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

74

Backward Search on BWT

Page 75: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

75

Backward Search on BWT

Page 76: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

76

BWT Construction

→ use the suffix array SA(T)!

→ BWT[ k ] = T[ SA[ k ] – 1 ] (assuming T[0]=$)

Page 77: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

77

BWT Construction

→ use the suffix array SA(T)!

→ BWT[ k ] = T[ SA[ k ] – 1 ] (assuming T[0]=$)

e.g.

1234567SA[banana$] = [7,6,4,2,1,5,3]

T[6] T[5] T[3] T[1] T[0] T[4] T[2] a n n b $ a a

$bananaa$bananana$bananana$bbanana$na$bananana$ba

Page 78: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

78

BWT Construction

→ use the suffix array SA(T)!

→ BWT[ k ] = T[ SA[ k ] – 1 ] (assuming T[0]=$)

→ explain why this equation is correct!

$bananaa$bananana$bananana$bbanana$na$bananana$ba

Page 79: Applied Databases - The University of Edinburgh · Sebastian Maneth Lecture 16 Suffix Array, Burrows-Wheeler Transform University of Edinburgh - March 10th, 2016 Applied Databases

79

ENDLecture 16