joint advanced student school 2004 1 compressed suffix arrays compression of suffix arrays to linear...
TRANSCRIPT
Joint Advanced Student School 20041
Compressed Suffix Arrays
Compression of Suffix Arrays to linear size
Fabian Pache
Joint Advanced Student School 20042
Motivation
• Problem: Suffix Arrays are datastructures that support fast searching of patterns in long texts, but require large amounts of memory
• Goal: Find a compression that reduces the size of the Suffix Array while still allowing fast access.
Joint Advanced Student School 20043
Outline
1. Trivial Compression
2. Grossi & Vitter– Outline– Algorithms and Analysis
3. Sadakane– Outline– Algorithms and Analysis (Sketch)
Joint Advanced Student School 20044
Conventions used
• Text T [1…n]– Binary text {a,b}n, terminated with #
• Pattern P [1…m]– Binary text {a,b}m
• Suffix Array SA [1…n]– Each entry points to a T [ i ]– Uses n log n bits
Joint Advanced Student School 20045
Trivial Compression
• Construct and store SA;
T = baab#
1. baab#
2. aab#
3. ab#
4. b#
5. #
2. aab#
3. ab#
5. #
6. baab#
4. b#
SA = [ 2,3,5,1,4 ]
a < # < b
Joint Advanced Student School 20046
Trivial Compression
• Recover T from SASA = [ 2,3,5,1,4 ] T = _ _ _ _ #
SA = [ 2,3,5,1,4 ] T = b _ _ b #
SA = [ 2,3,5,1,4 ] T = b a a b #
# < b
a < #
Joint Advanced Student School 20047
Trivial Compression
• Therefore each Suffix Array can be compressed to
)(n
)(nO
• Drawback: decompression takes
Joint Advanced Student School 20048
Grossi & Vitter
• Outline:– Recursive „Divide and Conquer“-type
algorithm– Stores SA implicitly
(for all but the last level)
• Supported operations– lookup( i )– compress
Joint Advanced Student School 20049
G&V – compress
• Structural OutlineSA0 )log(|| 0 nnOSA
SA1 |||| 021
1 SASA
SA2
SAl )(|| nOSAl
nl loglog
|||| 121
2 SASA
Joint Advanced Student School 200410
G&V – compress
Given a Text T; create SA
1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A1516311317192810 7 4 1 2124321430
17181920212223242526272829303132A B A B A B B A B B B A B B A #121827 9 6 3 2023291126 8 5 2 2225
TSA
Joint Advanced Student School 200411
G&V – compress
Create array B [1...n] with n = |SA|– B [ i ] = 1, if T [ i ] even
1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A1516311317192810 7 4 1 2124321430
1 1 1 1 1 1 1 1
17181920212223242526272829303132A B A B A B B A B B B A B B A #121827 9 6 3 2023291126 8 5 2 22251 1 1 1 1 1 1 1
TSAB
Joint Advanced Student School 200412
G&V – compress
Create array B [1...n] with n = |SA|– B [ i ] = 1, if T [ i ] even– B [ i ] = 0, if T [ i ] odd
1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A1516311317192810 7 4 1 21243214300 1 0 0 0 0 1 1 0 1 0 0 1 1 1 1
17181920212223242526272829303132A B A B A B B A B B B A B B A #121827 9 6 3 2023291126 8 5 2 22251 1 0 0 1 0 1 0 0 0 1 1 0 1 1 0
TSAB
Joint Advanced Student School 200413
G&V – compress
• Create an array rank [ 1...n ] where rank [ i ] contains the number of 1s in B[ 1...i ]
1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A1516311317192810 7 4 1 21243214300 1 0 0 0 0 1 1 0 1 0 0 1 1 1 10 1 1 1 1 1 2 3 3 4 4 4 5 6 7 8
17181920212223242526272829303132A B A B A B B A B B B A B B A #121827 9 6 3 2023291126 8 5 2 22251 1 0 0 1 0 1 0 0 0 1 1 0 1 1 09 101010111112121212131414151616
TSABrank
Joint Advanced Student School 200414
G&V – compress
Define a mapping [ 1..n ] so that– If B[ i ] = 0: [ i ] = j | SA [ j ] = SA [ i ] +1
1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A1516311317192810 7 4 1 21243214300 1 0 0 0 0 1 1 0 1 0 0 1 1 1 10 1 1 1 1 1 2 3 3 4 4 4 5 6 7 82 14151823 28103031
17181920212223242526272829303132A B A B A B B A B B B A B B A #121827 9 6 3 2023291126 8 5 2 22251 1 0 0 1 0 1 0 0 0 1 1 0 1 1 09 101010111112121212131414151616
7 8 10 131617 29 27
TSABRank
Joint Advanced Student School 200415
G&V – compress
Define a mapping [ 1..n ] so that– If B[ i ] = 0: [ i ] = j | SA [ j ] = SA[ i ] +1– If B[ i ] = 1: [ i ] = i
1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A1516311317192810 7 4 1 21243214300 1 0 0 0 0 1 1 0 1 0 0 1 1 1 10 1 1 1 1 1 2 3 3 4 4 4 5 6 7 82 2 14151823 7 8 2810303113141516
17181920212223242526272829303132A B A B A B B A B B B A B B A #121827 9 6 3 2023291126 8 5 2 22251 1 0 0 1 0 1 0 0 0 1 1 0 1 1 09 1010101111121212121314141516161718 7 8 211023131617272829203127
TSABRank
Joint Advanced Student School 200416
G&V – compress
Compressing from SAk to SAk+1
• Store only even values of SAk in SAk+1
• Divide each entry in SAk+1 by 2
1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A1516311317192810 7 4 1 21243214300 1 0 0 0 0 1 1 0 1 0 0 1 1 1 10 1 1 1 1 1 2 3 3 4 4 4 5 6 7 82 2 14151823 7 8 2810303113141516
17181920212223242526272829303132A B A B A B B A B B B A B B A #121827 9 6 3 2023291126 8 5 2 22251 1 0 0 1 0 1 0 0 0 1 1 0 1 1 09 1010101111121212121314141516161718 7 8 211023131617272829203127
TSAk
Bk
Rankk
k
1 2 3 4 5 6 7 8 9 101112131415168 14 5 2 1216 7 15 6 9 3 1013 4 1 11SAk+1
Joint Advanced Student School 200417
G&V – lookup
Reconstruction of SAk [ i ] using Bk, rankk,
k and SAk+1 [ i ]
SAk [ i ] = 2 SAk+1 [ rankk ( k ( i ))] + (B [ i ] –1)
Joint Advanced Student School 200418
G&V – lookup
• Proof / Example part 1: B [ i ] = 1
SAk [ i ] = 2 SAk+1 [ rankk ( k ( i ))] + (B [ i ] –1)
SAk [ i ] = 2 SAk+1 [ rankk ( k ( i ))]
SAk [ i ] = 2 SAk+1 [ rankk ( i )]1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A
100 1 0 0 0 0 1 1 0 1 0 0 1 1 1 10 1 1 1 1 1 2 3 3 4 4 4 5 6 7 82 2 14151823 7 8 2810303113141516
17181920212223242526272829303132A B A B A B B A B B B A B B A #18
1 1 0 0 1 0 1 0 0 0 1 1 0 1 1 09 1010101111121212121314141516161718 7 8 211023131617272829203127
TSAk
Bk
Rankk
k
1 2 3 4 5 6 7 8 9 101112131415168 14 5 2 1216 7 15 6 9 3 1013 4 1 11SAk+1
Joint Advanced Student School 200419
G&V – lookup
• Proof / Example part 2: B [ i ] = 0
SAk [ i ] = 2 SAk+1 [ rankk ( k ( i ))] + (B [ i ] –1)
SAk [ i ] = 2 SAk+1 [ rankk ( k ( i ))] - 1
1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A
170 1 0 0 0 0 1 1 0 1 0 0 1 1 1 10 1 1 1 1 1 2 3 3 4 4 4 5 6 7 82 2 14151823 7 8 2810303113141516
17181920212223242526272829303132A B A B A B B A B B B A B B A #
31 1 0 0 1 0 1 0 0 0 1 1 0 1 1 09 1010101111121212121314141516161718 7 8 211023131617272829203127
TSAk
Bk
Rankk
k
1 2 3 4 5 6 7 8 9 101112131415168 14 5 2 1216 7 15 6 9 3 1013 4 1 11SAk+1
Joint Advanced Student School 200420
G&V-lookup
Stored information
• For each level k = 0...l-1, – explicitly store Bk
– rankk and k stored implicit
– SA reconstructible by recursion
• For level l store SAl explicit
– No further information neede
Joint Advanced Student School 200421
G&V – lookup
lookup ( i ) = rlookup( i ,0 )
rlookup ( i, k )
if (k == l)
return SA[i];
else
return 2 * rlookup( rankk[ psik[i]], k+1) + (Bk[i]-1);
end
• Pseudocode for the lookup function
Joint Advanced Student School 200422
G&V - details
Speed versus Time
Quick and Large
Small and Slow
Space (in bits) O (n log log n) O (n)
Time O (log log n) O (log n) > 0
Joint Advanced Student School 200423
G&V – Quick and Large
Storing rank spaceefficient and quickly accessible:
Explicit storage of rank takes n log n bitsJacobson´s method uses o( n ) bits
Both allow for constant time access
Joint Advanced Student School 200424
G&V – Quick and Large
Storing k efficiently (outline):
• Create 2k arrays; one for each possible substring over {a,b}2k
using the substring as label
aa
ab
ba
bb
1 2 3 4 5 6 7 8 9 101112131415168 14 5 2 1216 7 15 6 9 3 1013 4 1 111 1 0 1 1 1 0 0 1 0 0 1 0 1 0 01 2 2 3 4 5 5 5 6 6 6 7 7 8 8 8
SA2
B2
rank2
1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A
17181920212223242526272829303132A B A B A B B A B B B A B B A #T
Example using k=2
Joint Advanced Student School 200425
G&V – Quick and Large
• For each Bk [ j ] = 1 find the 2k literals preceding the suffix referenced by SAk [ j ] in T
• Store j in the array according to T
aa
ab
ba 1
bb
1 2 3 4 5 6 7 8 9 101112131415168 14 5 2 1216 7 15 6 9 3 1013 4 1 111 1 0 1 1 1 0 0 1 0 0 1 0 1 0 01 2 2 3 4 5 5 5 6 6 6 7 7 8 8 8
SA2
B2
rank2
1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A
17181920212223242526272829303132A B A B A B B A B B B A B B A #T
Example using k=2
Joint Advanced Student School 200426
G&V – Quick and Large
In other words:
For each i with Bk [ i ] = 0 and t the first 2k literals of the suffix referenced by SAk [ i ], insert [ i ] in array t
aa
ab 9
ba 1
bb
1 2 3 4 5 6 7 8 9 101112131415168 14 5 2 1216 7 15 6 9 3 1013 4 1 111 1 0 1 1 1 0 0 1 0 0 1 0 1 0 01 2 2 3 4 5 5 5 6 6 6 7 7 8 8 8
SA2
B2
rank2
1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A
17181920212223242526272829303132A B A B A B B A B B B A B B A #T
Example using k=2
Joint Advanced Student School 200427
G&V – Quick and Large
For each j with Bk [ j ] = 1
t = T [ 2k SAk [ j ] – 2k, ..., 2k SAk [ j ] – 1]
add j to the array with label t
aa
ab 9
ba 1, 6, 12, 14
bb 2, 4, 5
1 2 3 4 5 6 7 8 9 101112131415168 14 5 2 1216 7 15 6 9 3 1013 4 1 111 1 0 1 1 1 0 0 1 0 0 1 0 1 0 01 2 2 3 4 5 5 5 6 6 6 7 7 8 8 8
SA2
B2
rank2
1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A
17181920212223242526272829303132A B A B A B B A B B B A B B A #T
Example using k=2
Joint Advanced Student School 200428
G&V – Quick and Large
To calculate k ( i ), use i - rankk ( i ) as index to the concatenated arrays
aa
ab 9
ba 1, 6, 12, 14
bb 2, 4, 5
1 2 3 4 5 6 7 8 9 101112131415168 14 5 2 1216 7 15 6 9 3 1013 4 1 111 1 0 1 1 1 0 0 1 0 0 1 0 1 0 01 2 2 3 4 5 5 5 6 6 6 7 7 8 8 8
SA2
B2
rank2
1 2 3 4 5 6 7 8 9 10111213141516A B B A B B A B B A B B A B A A
17181920212223242526272829303132A B A B A B B A B B B A B B A #T
Example using k=2 2(8) = 6
Joint Advanced Student School 200429
G&V – Quick and Large
l = log log n levels of compression
• Space occupation:l levels, each occupying O ( n ) bits
O ( n log log n ) bits
• Time requirement for lookup ( i ):l levels, each requiring O ( 1)
O ( log log n ) steps
Joint Advanced Student School 200430
G&V – Small and Slow
Reduction of size by allowing for higher time usage
Quick and Large
Small and Slow
Space (in bits) O (n log log n) O (n)
Time O (log log n) O (log n) > 0
Joint Advanced Student School 200431
G&V – Small and Slow
Instead of storing all l = log log n levels, store only levels
nl
nl
loglog
loglog´
0
21
Example n = 32store levels 0, 2, 3
Joint Advanced Student School 200432
G&V – Small and Slow
Example using | T | = 32
SA0
SA1
SA3
SA2
Joint Advanced Student School 200433
G&V – Small and Slow
Keep only 3 levels
SA0
SA1
SA3
Joint Advanced Student School 200434
G&V – Small and Slow
On levels 0 and l´, mark entries that are still present in the next level
SA0
SA1
SA3
Joint Advanced Student School 200435
G&V – Small and Slow
Before the modification:• Bk[ i ] = 1 SAk[ i ] is stored in SAk+1
k used for each Bk[ i ] = 0 to find SAk[ [ i ] ] = SAk[ i ] +1
Modifications added:• Bo´[ i ] = 1 SA0[ i ] is stored in SAl´
• Bl´´[ i ] = 1 SAl´[ i ] is stored in SAl
´k used for each Bk[ i ] = 1 to find SAk[ [ i ] ] = SAk[ i ] +1
Joint Advanced Student School 200436
G&V – Small and Slow
Construction of ´ and B´ markings of indices
1 2 3 4
2 3 4 1SA3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
8 14 5 2 12 16 7 15 6 9 3 10 13 4 1 11
1 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0
9 1 6 12 14 2 4 5
10 8 11 13 15 7 16 3
SA1
B1´
1
1´
Joint Advanced Student School 200437
G&V – Small and Slow
´ and in combination can be used to traverse the entire SA (ascending)
1 2 3 4
2 3 4 1SA3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
8 14 5 2 12 16 7 15 6 9 3 10 13 4 1 11
1 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0
9 1 6 12 14 2 4 5
10 8 11 13 15 7 16 3
SA1
B1´
11´
Joint Advanced Student School 200438
G&V – Small and Slow
Length of the traversal determines required time for lookup
• Level 0 contains n entries
• Level l´ was divided ½ log log n times
n
nnnl
nn log22´||
loglogloglog21
Joint Advanced Student School 200439
G&V – Small and Slow
Length of the traversal determines required time for lookup
• 0s in B´ are evenly spaced
n
n
nn
l
llog
log
´||
|| 0 Longest sequence of 0s
Joint Advanced Student School 200440
G&V – Small and Slow
Generalized for more than 2 additional levels (the number must be constant!):
Let L be the number of levels, = L-1
The longest sequence of 0s has length log n
Joint Advanced Student School 200441
G&V – Small and Slow
reconstruction of levels < l requires
• a vector describing which entries of level k´ can be found in k´+1
=>O ( n ) bits
• a function ´ that combined with allows for complete traversal of SA
=>O ( n ) bits
Joint Advanced Student School 200442
Sadakane
Improvements on the datastructure and algorithms proposed by Grossi & Vitter
• More operations– inverse( j ): return i so that SA[ i ] = j– search( P ): return l, r where P matches T– decompress( s, e ): return T[s...e]
• Allow for alphabets || > 2
Joint Advanced Student School 200443
Sadakane – inverse( i )
Goal:For a suffix starting at position j, find the index i of the lexicographic order of all suffices
Assuming:j = SA[ i ]
Create SA-1 so that:i = SA-1[ j ]
Joint Advanced Student School 200444
Sadakane – inverse( i )
Proposition:
inverse( i ) can be computed in O( logn ) with explicit storage of SA-1 at the last level and a recursion for all above.
Joint Advanced Student School 200445
Sadakane – search( P )
Goal:Find the interval [ l...r ] in SA so that P matches each of the suffices pointed to by SA. Do so without using T
Joint Advanced Student School 200446
Sadakane – search( P )
Proposition:
By augmenting the datastructure by a function C-1 (the „inverse of the array of cumulative frequencies“) it is possible to obtain the substring in O( |P| ) time
Joint Advanced Student School 200447
Sadakane – decompress( I )
Goal:Using only SA and its functions, return the substring of T pointed to by I = [ s, e ].
Joint Advanced Student School 200448
Sadakane – decompress( I )
Proposition:
A substring of length l = e-s+1can be decompressed using only SA, SA-1 and C-1 in O( l + logn ) time, where n is the length of the original text.
Joint Advanced Student School 200449
Sadakane – Complexity
Using inverse, search and decompress it is possible to implicitly store T. Therefore O( n ) words are no longer required.
The space-complexity of the Sadakane-improved Suffix Array is only 37% of a Grossi&Vitter Suffix Array including the text