joint advanced student school 2004 1 compressed suffix arrays compression of suffix arrays to linear...

49
Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Upload: christopher-nicholson

Post on 18-Jan-2016

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 20041

Compressed Suffix Arrays

Compression of Suffix Arrays to linear size

Fabian Pache

Page 2: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 20042

Motivation

• Problem: Suffix Arrays are datastructures that support fast searching of patterns in long texts, but require large amounts of memory

• Goal: Find a compression that reduces the size of the Suffix Array while still allowing fast access.

Page 3: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 20043

Outline

1. Trivial Compression

2. Grossi & Vitter– Outline– Algorithms and Analysis

3. Sadakane– Outline– Algorithms and Analysis (Sketch)

Page 4: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 20044

Conventions used

• Text T [1…n]– Binary text {a,b}n, terminated with #

• Pattern P [1…m]– Binary text {a,b}m

• Suffix Array SA [1…n]– Each entry points to a T [ i ]– Uses n log n bits

Page 5: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 20045

Trivial Compression

• Construct and store SA;

T = baab#

1. baab#

2. aab#

3. ab#

4. b#

5. #

2. aab#

3. ab#

5. #

6. baab#

4. b#

SA = [ 2,3,5,1,4 ]

a < # < b

Page 6: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 20046

Trivial Compression

• Recover T from SASA = [ 2,3,5,1,4 ] T = _ _ _ _ #

SA = [ 2,3,5,1,4 ] T = b _ _ b #

SA = [ 2,3,5,1,4 ] T = b a a b #

# < b

a < #

Page 7: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 20047

Trivial Compression

• Therefore each Suffix Array can be compressed to

)(n

)(nO

• Drawback: decompression takes

Page 8: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 20048

Grossi & Vitter

• Outline:– Recursive „Divide and Conquer“-type

algorithm– Stores SA implicitly

(for all but the last level)

• Supported operations– lookup( i )– compress

Page 9: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 20049

G&V – compress

• Structural OutlineSA0 )log(|| 0 nnOSA

SA1 |||| 021

1 SASA

SA2

SAl )(|| nOSAl

nl loglog

|||| 121

2 SASA

Page 10: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200410

G&V – compress

Given a Text T; create SA

1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A1516311317192810 7 4 1 2124321430

17181920212223242526272829303132A B A B A B B A B B B A B B A #121827 9 6 3 2023291126 8 5 2 2225

TSA

Page 11: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200411

G&V – compress

Create array B [1...n] with n = |SA|– B [ i ] = 1, if T [ i ] even

1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A1516311317192810 7 4 1 2124321430

1 1 1 1 1 1 1 1

17181920212223242526272829303132A B A B A B B A B B B A B B A #121827 9 6 3 2023291126 8 5 2 22251 1 1 1 1 1 1 1

TSAB

Page 12: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200412

G&V – compress

Create array B [1...n] with n = |SA|– B [ i ] = 1, if T [ i ] even– B [ i ] = 0, if T [ i ] odd

1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A1516311317192810 7 4 1 21243214300 1 0 0 0 0 1 1 0 1 0 0 1 1 1 1

17181920212223242526272829303132A B A B A B B A B B B A B B A #121827 9 6 3 2023291126 8 5 2 22251 1 0 0 1 0 1 0 0 0 1 1 0 1 1 0

TSAB

Page 13: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200413

G&V – compress

• Create an array rank [ 1...n ] where rank [ i ] contains the number of 1s in B[ 1...i ]

1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A1516311317192810 7 4 1 21243214300 1 0 0 0 0 1 1 0 1 0 0 1 1 1 10 1 1 1 1 1 2 3 3 4 4 4 5 6 7 8

17181920212223242526272829303132A B A B A B B A B B B A B B A #121827 9 6 3 2023291126 8 5 2 22251 1 0 0 1 0 1 0 0 0 1 1 0 1 1 09 101010111112121212131414151616

TSABrank

Page 14: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200414

G&V – compress

Define a mapping [ 1..n ] so that– If B[ i ] = 0: [ i ] = j | SA [ j ] = SA [ i ] +1

1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A1516311317192810 7 4 1 21243214300 1 0 0 0 0 1 1 0 1 0 0 1 1 1 10 1 1 1 1 1 2 3 3 4 4 4 5 6 7 82 14151823 28103031

17181920212223242526272829303132A B A B A B B A B B B A B B A #121827 9 6 3 2023291126 8 5 2 22251 1 0 0 1 0 1 0 0 0 1 1 0 1 1 09 101010111112121212131414151616

7 8 10 131617 29 27

TSABRank

Page 15: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200415

G&V – compress

Define a mapping [ 1..n ] so that– If B[ i ] = 0: [ i ] = j | SA [ j ] = SA[ i ] +1– If B[ i ] = 1: [ i ] = i

1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A1516311317192810 7 4 1 21243214300 1 0 0 0 0 1 1 0 1 0 0 1 1 1 10 1 1 1 1 1 2 3 3 4 4 4 5 6 7 82 2 14151823 7 8 2810303113141516

17181920212223242526272829303132A B A B A B B A B B B A B B A #121827 9 6 3 2023291126 8 5 2 22251 1 0 0 1 0 1 0 0 0 1 1 0 1 1 09 1010101111121212121314141516161718 7 8 211023131617272829203127

TSABRank

Page 16: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200416

G&V – compress

Compressing from SAk to SAk+1

• Store only even values of SAk in SAk+1

• Divide each entry in SAk+1 by 2

1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A1516311317192810 7 4 1 21243214300 1 0 0 0 0 1 1 0 1 0 0 1 1 1 10 1 1 1 1 1 2 3 3 4 4 4 5 6 7 82 2 14151823 7 8 2810303113141516

17181920212223242526272829303132A B A B A B B A B B B A B B A #121827 9 6 3 2023291126 8 5 2 22251 1 0 0 1 0 1 0 0 0 1 1 0 1 1 09 1010101111121212121314141516161718 7 8 211023131617272829203127

TSAk

Bk

Rankk

k

1 2 3 4 5 6 7 8 9 101112131415168 14 5 2 1216 7 15 6 9 3 1013 4 1 11SAk+1

Page 17: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200417

G&V – lookup

Reconstruction of SAk [ i ] using Bk, rankk,

k and SAk+1 [ i ]

SAk [ i ] = 2 SAk+1 [ rankk ( k ( i ))] + (B [ i ] –1)

Page 18: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200418

G&V – lookup

• Proof / Example part 1: B [ i ] = 1

SAk [ i ] = 2 SAk+1 [ rankk ( k ( i ))] + (B [ i ] –1)

SAk [ i ] = 2 SAk+1 [ rankk ( k ( i ))]

SAk [ i ] = 2 SAk+1 [ rankk ( i )]1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A

100 1 0 0 0 0 1 1 0 1 0 0 1 1 1 10 1 1 1 1 1 2 3 3 4 4 4 5 6 7 82 2 14151823 7 8 2810303113141516

17181920212223242526272829303132A B A B A B B A B B B A B B A #18

1 1 0 0 1 0 1 0 0 0 1 1 0 1 1 09 1010101111121212121314141516161718 7 8 211023131617272829203127

TSAk

Bk

Rankk

k

1 2 3 4 5 6 7 8 9 101112131415168 14 5 2 1216 7 15 6 9 3 1013 4 1 11SAk+1

Page 19: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200419

G&V – lookup

• Proof / Example part 2: B [ i ] = 0

SAk [ i ] = 2 SAk+1 [ rankk ( k ( i ))] + (B [ i ] –1)

SAk [ i ] = 2 SAk+1 [ rankk ( k ( i ))] - 1

1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A

170 1 0 0 0 0 1 1 0 1 0 0 1 1 1 10 1 1 1 1 1 2 3 3 4 4 4 5 6 7 82 2 14151823 7 8 2810303113141516

17181920212223242526272829303132A B A B A B B A B B B A B B A #

31 1 0 0 1 0 1 0 0 0 1 1 0 1 1 09 1010101111121212121314141516161718 7 8 211023131617272829203127

TSAk

Bk

Rankk

k

1 2 3 4 5 6 7 8 9 101112131415168 14 5 2 1216 7 15 6 9 3 1013 4 1 11SAk+1

Page 20: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200420

G&V-lookup

Stored information

• For each level k = 0...l-1, – explicitly store Bk

– rankk and k stored implicit

– SA reconstructible by recursion

• For level l store SAl explicit

– No further information neede

Page 21: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200421

G&V – lookup

lookup ( i ) = rlookup( i ,0 )

rlookup ( i, k )

if (k == l)

return SA[i];

else

return 2 * rlookup( rankk[ psik[i]], k+1) + (Bk[i]-1);

end

• Pseudocode for the lookup function

Page 22: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200422

G&V - details

Speed versus Time

Quick and Large

Small and Slow

Space (in bits) O (n log log n) O (n)

Time O (log log n) O (log n) > 0

Page 23: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200423

G&V – Quick and Large

Storing rank spaceefficient and quickly accessible:

Explicit storage of rank takes n log n bitsJacobson´s method uses o( n ) bits

Both allow for constant time access

Page 24: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200424

G&V – Quick and Large

Storing k efficiently (outline):

• Create 2k arrays; one for each possible substring over {a,b}2k

using the substring as label

aa

ab

ba

bb

1 2 3 4 5 6 7 8 9 101112131415168 14 5 2 1216 7 15 6 9 3 1013 4 1 111 1 0 1 1 1 0 0 1 0 0 1 0 1 0 01 2 2 3 4 5 5 5 6 6 6 7 7 8 8 8

SA2

B2

rank2

1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A

17181920212223242526272829303132A B A B A B B A B B B A B B A #T

Example using k=2

Page 25: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200425

G&V – Quick and Large

• For each Bk [ j ] = 1 find the 2k literals preceding the suffix referenced by SAk [ j ] in T

• Store j in the array according to T

aa

ab

ba 1

bb

1 2 3 4 5 6 7 8 9 101112131415168 14 5 2 1216 7 15 6 9 3 1013 4 1 111 1 0 1 1 1 0 0 1 0 0 1 0 1 0 01 2 2 3 4 5 5 5 6 6 6 7 7 8 8 8

SA2

B2

rank2

1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A

17181920212223242526272829303132A B A B A B B A B B B A B B A #T

Example using k=2

Page 26: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200426

G&V – Quick and Large

In other words:

For each i with Bk [ i ] = 0 and t the first 2k literals of the suffix referenced by SAk [ i ], insert [ i ] in array t

aa

ab 9

ba 1

bb

1 2 3 4 5 6 7 8 9 101112131415168 14 5 2 1216 7 15 6 9 3 1013 4 1 111 1 0 1 1 1 0 0 1 0 0 1 0 1 0 01 2 2 3 4 5 5 5 6 6 6 7 7 8 8 8

SA2

B2

rank2

1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A

17181920212223242526272829303132A B A B A B B A B B B A B B A #T

Example using k=2

Page 27: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200427

G&V – Quick and Large

For each j with Bk [ j ] = 1

t = T [ 2k SAk [ j ] – 2k, ..., 2k SAk [ j ] – 1]

add j to the array with label t

aa

ab 9

ba 1, 6, 12, 14

bb 2, 4, 5

1 2 3 4 5 6 7 8 9 101112131415168 14 5 2 1216 7 15 6 9 3 1013 4 1 111 1 0 1 1 1 0 0 1 0 0 1 0 1 0 01 2 2 3 4 5 5 5 6 6 6 7 7 8 8 8

SA2

B2

rank2

1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A

17181920212223242526272829303132A B A B A B B A B B B A B B A #T

Example using k=2

Page 28: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200428

G&V – Quick and Large

To calculate k ( i ), use i - rankk ( i ) as index to the concatenated arrays

aa

ab 9

ba 1, 6, 12, 14

bb 2, 4, 5

1 2 3 4 5 6 7 8 9 101112131415168 14 5 2 1216 7 15 6 9 3 1013 4 1 111 1 0 1 1 1 0 0 1 0 0 1 0 1 0 01 2 2 3 4 5 5 5 6 6 6 7 7 8 8 8

SA2

B2

rank2

1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A

17181920212223242526272829303132A B A B A B B A B B B A B B A #T

Example using k=2 2(8) = 6

Page 29: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200429

G&V – Quick and Large

l = log log n levels of compression

• Space occupation:l levels, each occupying O ( n ) bits

O ( n log log n ) bits

• Time requirement for lookup ( i ):l levels, each requiring O ( 1)

O ( log log n ) steps

Page 30: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200430

G&V – Small and Slow

Reduction of size by allowing for higher time usage

Quick and Large

Small and Slow

Space (in bits) O (n log log n) O (n)

Time O (log log n) O (log n) > 0

Page 31: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200431

G&V – Small and Slow

Instead of storing all l = log log n levels, store only levels

nl

nl

loglog

loglog´

0

21

Example n = 32store levels 0, 2, 3

Page 32: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200432

G&V – Small and Slow

Example using | T | = 32

SA0

SA1

SA3

SA2

Page 33: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200433

G&V – Small and Slow

Keep only 3 levels

SA0

SA1

SA3

Page 34: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200434

G&V – Small and Slow

On levels 0 and l´, mark entries that are still present in the next level

SA0

SA1

SA3

Page 35: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200435

G&V – Small and Slow

Before the modification:• Bk[ i ] = 1 SAk[ i ] is stored in SAk+1

k used for each Bk[ i ] = 0 to find SAk[ [ i ] ] = SAk[ i ] +1

Modifications added:• Bo´[ i ] = 1 SA0[ i ] is stored in SAl´

• Bl´´[ i ] = 1 SAl´[ i ] is stored in SAl

´k used for each Bk[ i ] = 1 to find SAk[ [ i ] ] = SAk[ i ] +1

Page 36: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200436

G&V – Small and Slow

Construction of ´ and B´ markings of indices

1 2 3 4

2 3 4 1SA3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

8 14 5 2 12 16 7 15 6 9 3 10 13 4 1 11

1 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0

9 1 6 12 14 2 4 5

10 8 11 13 15 7 16 3

SA1

B1´

1

Page 37: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200437

G&V – Small and Slow

´ and in combination can be used to traverse the entire SA (ascending)

1 2 3 4

2 3 4 1SA3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

8 14 5 2 12 16 7 15 6 9 3 10 13 4 1 11

1 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0

9 1 6 12 14 2 4 5

10 8 11 13 15 7 16 3

SA1

B1´

11´

Page 38: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200438

G&V – Small and Slow

Length of the traversal determines required time for lookup

• Level 0 contains n entries

• Level l´ was divided ½ log log n times

n

nnnl

nn log22´||

loglogloglog21

Page 39: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200439

G&V – Small and Slow

Length of the traversal determines required time for lookup

• 0s in B´ are evenly spaced

n

n

nn

l

llog

log

´||

|| 0 Longest sequence of 0s

Page 40: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200440

G&V – Small and Slow

Generalized for more than 2 additional levels (the number must be constant!):

Let L be the number of levels, = L-1

The longest sequence of 0s has length log n

Page 41: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200441

G&V – Small and Slow

reconstruction of levels < l requires

• a vector describing which entries of level k´ can be found in k´+1

=>O ( n ) bits

• a function ´ that combined with allows for complete traversal of SA

=>O ( n ) bits

Page 42: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200442

Sadakane

Improvements on the datastructure and algorithms proposed by Grossi & Vitter

• More operations– inverse( j ): return i so that SA[ i ] = j– search( P ): return l, r where P matches T– decompress( s, e ): return T[s...e]

• Allow for alphabets || > 2

Page 43: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200443

Sadakane – inverse( i )

Goal:For a suffix starting at position j, find the index i of the lexicographic order of all suffices

Assuming:j = SA[ i ]

Create SA-1 so that:i = SA-1[ j ]

Page 44: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200444

Sadakane – inverse( i )

Proposition:

inverse( i ) can be computed in O( logn ) with explicit storage of SA-1 at the last level and a recursion for all above.

Page 45: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200445

Sadakane – search( P )

Goal:Find the interval [ l...r ] in SA so that P matches each of the suffices pointed to by SA. Do so without using T

Page 46: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200446

Sadakane – search( P )

Proposition:

By augmenting the datastructure by a function C-1 (the „inverse of the array of cumulative frequencies“) it is possible to obtain the substring in O( |P| ) time

Page 47: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200447

Sadakane – decompress( I )

Goal:Using only SA and its functions, return the substring of T pointed to by I = [ s, e ].

Page 48: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200448

Sadakane – decompress( I )

Proposition:

A substring of length l = e-s+1can be decompressed using only SA, SA-1 and C-1 in O( l + logn ) time, where n is the length of the original text.

Page 49: Joint Advanced Student School 2004 1 Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache

Joint Advanced Student School 200449

Sadakane – Complexity

Using inverse, search and decompress it is possible to implicitly store T. Therefore O( n ) words are no longer required.

The space-complexity of the Sadakane-improved Suffix Array is only 37% of a Grossi&Vitter Suffix Array including the text