Sparse Suffix Sorting
Tomohiro I
! Sorting suffixes of a string T is a fundamental task having a lot of applications such as text indexing.
Suffix sorting
! Sorting suffixes of a string T is a fundamental task having a lot of applications such as text indexing.
Suffix sorting
T = l i o n l y i n g o n l y o n l y o n $ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Suffix sorting
lionlyingonlyonlyon$ ionlyingonlyonlyon$ onlyingonlyonlyon$ nlyingonlyonlyon$ lyingonlyonlyon$ yingonlyonlyon$ ingonlyonlyon$ ngonlyonlyon$ gonlyonlyon$ onlyonlyon$ nlyonlyon$ lyonlyon$ yonlyon$ onlyon$ nlyon$ lyon$ yon$ on$ n$ $
all suffixes of T
T = l i o n l y i n g o n l y o n l y o n $ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
$ gonlyonlyon$ ingonlyonlyon$ ionlyingonlyonlyon$ lionlyingonlyonlyon$ lyingonlyonlyon$ lyon$ lyonlyon$ n$ ngonlyonlyon$ nlyingonlyonlyon$ nlyon$ nlyonlyon$ on$ onlyingonlyonlyon$ onlyon$ onlyonlyon$ yingonlyonlyon$ yon$ yonlyon$
20 9 7 2 1 5
16 12 19 8 4
15 11 18 3
14 10 6
17 13
sort
Suffix sorting
lionlyingonlyonlyon$ ionlyingonlyonlyon$ onlyingonlyonlyon$ nlyingonlyonlyon$ lyingonlyonlyon$ yingonlyonlyon$ ingonlyonlyon$ ngonlyonlyon$ gonlyonlyon$ onlyonlyon$ nlyonlyon$ lyonlyon$ yonlyon$ onlyon$ nlyon$ lyon$ yon$ on$ n$ $
$ gonlyonlyon$ ingonlyonlyon$ ionlyingonlyonlyon$ lionlyingonlyonlyon$ lyingonlyonlyon$ lyon$ lyonlyon$ n$ ngonlyonlyon$ nlyingonlyonlyon$ nlyon$ nlyonlyon$ on$ onlyingonlyonlyon$ onlyon$ onlyonlyon$ yingonlyonlyon$ yon$ yonlyon$
20 9 7 2 1 5
16 12 19 8 4
15 11 18 3
14 10 6
17 13
all suffixes of T
sort
T = l i o n l y i n g o n l y o n l y o n $ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Suffix Array of T
Suffix sorting
lionlyingonlyonlyon$ ionlyingonlyonlyon$ onlyingonlyonlyon$ nlyingonlyonlyon$ lyingonlyonlyon$ yingonlyonlyon$ ingonlyonlyon$ ngonlyonlyon$ gonlyonlyon$ onlyonlyon$ nlyonlyon$ lyonlyon$ yonlyon$ onlyon$ nlyon$ lyon$ yon$ on$ n$ $
$ gonlyonlyon$ ingonlyonlyon$ ionlyingonlyonlyon$ lionlyingonlyonlyon$ lyingonlyonlyon$ lyon$ lyonlyon$ n$ ngonlyonlyon$ nlyingonlyonlyon$ nlyon$ nlyonlyon$ on$ onlyingonlyonlyon$ onlyon$ onlyonlyon$ yingonlyonlyon$ yon$ yonlyon$
20 9 7 2 1 5
16 12 19 8 4
15 11 18 3
14 10 6
17 13
T = l i o n l y i n g o n l y o n l y o n $ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Suffix Array of T
lyon
Suffix sorting
lionlyingonlyonlyon$ ionlyingonlyonlyon$ onlyingonlyonlyon$ nlyingonlyonlyon$ lyingonlyonlyon$ yingonlyonlyon$ ingonlyonlyon$ ngonlyonlyon$ gonlyonlyon$ onlyonlyon$ nlyonlyon$ lyonlyon$ yonlyon$ onlyon$ nlyon$ lyon$ yon$ on$ n$ $
$ gonlyonlyon$ ingonlyonlyon$ ionlyingonlyonlyon$ lionlyingonlyonlyon$ lyingonlyonlyon$ lyon$ lyonlyon$ n$ ngonlyonlyon$ nlyingonlyonlyon$ nlyon$ nlyonlyon$ on$ onlyingonlyonlyon$ onlyon$ onlyonlyon$ yingonlyonlyon$ yon$ yonlyon$
20 9 7 2 1 5
16 12 19 8 4
15 11 18 3
14 10 6
17 13
T = l i o n l y i n g o n l y o n l y o n $ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Suffix Array of T
lyon
Suffix sorting
lionlyingonlyonlyon$ ionlyingonlyonlyon$ onlyingonlyonlyon$ nlyingonlyonlyon$ lyingonlyonlyon$ yingonlyonlyon$ ingonlyonlyon$ ngonlyonlyon$ gonlyonlyon$ onlyonlyon$ nlyonlyon$ lyonlyon$ yonlyon$ onlyon$ nlyon$ lyon$ yon$ on$ n$ $
$ gonlyonlyon$ ingonlyonlyon$ ionlyingonlyonlyon$ lionlyingonlyonlyon$ lyingonlyonlyon$ lyon$ lyonlyon$ n$ ngonlyonlyon$ nlyingonlyonlyon$ nlyon$ nlyonlyon$ on$ onlyingonlyonlyon$ onlyon$ onlyonlyon$ yingonlyonlyon$ yon$ yonlyon$
20 9 7 2 1 5
16 12 19 8 4
15 11 18 3
14 10 6
17 13
T = l i o n l y i n g o n l y o n l y o n $ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Suffix Array of T
I found lyon in 3 steps!
Sparse suffix sorting
If we are interested in some (small) set of suffixes of T.
T = l i o n l y i n g o n l y o n l y o n $ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
lionlyingonlyonlyon$ ionlyingonlyonlyon$ onlyingonlyonlyon$ nlyingonlyonlyon$ lyingonlyonlyon$ yingonlyonlyon$ ingonlyonlyon$ ngonlyonlyon$ gonlyonlyon$ onlyonlyon$ nlyonlyon$ lyonlyon$ yonlyon$ onlyon$ nlyon$ lyon$ yon$ on$ n$ $
Sparse suffix sorting
T = l i o n l y i n g o n l y o n l y o n $ lionlyingonlyonlyon$ ionlyingonlyonlyon$ onlyingonlyonlyon$ nlyingonlyonlyon$ lyingonlyonlyon$ yingonlyonlyon$ ingonlyonlyon$ ngonlyonlyon$ gonlyonlyon$ onlyonlyon$ nlyonlyon$ lyonlyon$ yonlyon$ onlyon$ nlyon$ lyon$ yon$ on$ n$ $
lionlyingonlyonlyon$ lyonlyon$ on$ onlyon$ onlyonlyon$
1 12 18 14 10
Sparse Suffix Array (SSA)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
For sparse indexing, the sparse suffix array is enough, which is much smaller than full suffix array.
sort
If we are interested in some (small) set of suffixes of T.
! It can be solved in O(n) time by computing full suffix array and make it sparse, but it needs O(n) space in addition to the text.
! An alternative way is to use any sorting algorithm for a set of strings (not for suffixes) that uses O(b) additional space, but it takes Ω(bn) time.
Problem
Given T of length n and set of b positions, sort the designated suffixes.
Using O(s) additional space for some s ∈ [b, n], how fast can we sort sparse suffixes?
Question
Monte-Carlo Algorithm
Randomized algorithms that may return a wrong output with low probability.
We can win the gamble with high probability!!
! Monte-Carlo algorithm using Karp-Rabin fingerprints. ! Time-space tradeoffs: for any s ∈ [b, n]
O(n+(bn/s)log s)) time using O(s) additional space. ! When s = b, the time complexity is O(n log b). ! When s = b log b, the time complexity is O(n).
Monte-Carlo Algorithm
Randomized algorithms that may return a wrong output with low probability.
We can win the gamble with high probability!!
Karp-Rabin fingerprints (rolling hash) [Karp and Rabin, `87]
• q is a prime number. • r is chosen from [0, q-1] uniformly at random.
For sufficiently large q, the hash function becomes perfect with high probability, i.e., T[i..i+l] = T[i’..i’+l] iff FP[i..i+l] = FP[i’..i’+l].
T i j
FP[i.. j]= T[k]⋅ r j−kk=i
j
∑ modq
Karp-Rabin fingerprints (rolling hash) [Karp and Rabin, `87]
i j
FP[i.. j +1]= FP[i.. j]⋅ r +T[ j +1]modq
FP[k +1.. j]= FP[i.. j]−FP[i..k]⋅ r j−kmodq
i j k+1
i j
i j+1
T
We can compute FPs incrementally.
We can “subtract” prefix FP’s influence.
T 1 n
We can preprocess T in O(n) time and O(s) space so that the fingerprint for any substring of length l can be computed in O(min{l, n/s}) time.
Lemma
FP-2
FP-4
n/s 2n/s 3n/s 4n/s
O(s)
FP-1
FP-3
FP-3
FP-1
FP-2
FP-4
T 1 n n/s 2n/s 3n/s 4n/s
We can preprocess T in O(n) time and O(s) space so that the fingerprint for any substring of length l can be computed in O(min{l, n/s}) time.
Lemma
O(s)
FP-short
FP-long
FP-1
FP-3
! The algorithm constructs sparse suffix tree.
Monte-Carlo algorithm
1
12
18
10
14
ionlying…
yonlyon$
lyon
l
on
ly…
$ $
lionlyingonlyonlyon$ lyonlyon$ on$ onlyon$ onlyonlyon$
1 12 18 14 10
Sparse Suffix Array
Sparse Suffix Tree: • Compacted trie representing all designated suffixes. • For each node, child edges are sorted. • Since edge labels can be encoded by pointers to T,
sparse suffix tree can be represented in O(b) space.
Overview: gradual refinement
1
10
12
18
14
lionlying…
lyonlyon$ onlyon$
onlyonlyon$
on$
sort edges
1
12
18
10
14
ionlying…
yonlyon$
lyon
l
on
ly…
$ $
goal
1
10
14
18
12
only
lionlying…
lyonlyon$
on$
on$
only…
1
10
14
12
18
lyon on
$
lionlying…
$
ly…
lyonlyon$
1
12
10
18
14
ionlying…
yonlyon$
lyon
l
on ly…
$ $
T = l i o n l y i n g o n l y o n l y o n $ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
l-strict (unsorted) sparse suffix trees
sort edges
1
12
18
10
14
ionlying…
yonlyon$
lyon
l
on
ly…
$ $
goal
2-strict 4-strict {32, 16, 8}-strict
1-strict
T = l i o n l y i n g o n l y o n l y o n $ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1
10
12
18
14
lionlying…
lyonlyon$ onlyon$
onlyonlyon$
on$
1
10
14
18
12
only
lionlying…
lyonlyon$
on$
on$
only…
1
10
14
12
18
lyon on
$
lionlying…
$
ly…
lyonlyon$
1
12
10
18
14
ionlying…
yonlyon$
lyon
l
on ly…
$ $
l-strict (unsorted) sparse suffix trees
For any internal node of l-strict tree, any pair of child edge labels has common prefix of length less than l.
T = l i o n l y i n g o n l y o n l y o n $ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1
10
12
18
14
lionlying…
lyonlyon$ onlyon$
onlyonlyon$
on$
1
10
14
18
12
only
lionlying…
lyonlyon$
on$
on$
only…
1
10
14
12
18
lyon on
$
lionlying…
$
ly…
lyonlyon$
1
12
10
18
14
ionlying…
yonlyon$
lyon
l
on ly…
$ $
2-strict 4-strict {32, 16, 8}-strict
1-strict Condition of l-strict tree
1
10
12
18
14
lionlying…
lyonlyon$ onlyon$
onlyonlyon$
on$
1
12
10
18
14
ionlying…
yonlyon$
lyon
l
on ly…
$ $
From 4-strict tree to 2-strict tree
{32, 16, 8}-strict
1-strict
2-strict 4-strict
For every edge label of 4-strict tree, we compute FP for the prefix of length 2, and merge edge labels having the same value.
T = l i o n l y i n g o n l y o n l y o n $ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1
10
14
18
12
only
lionlying…
lyonlyon$
on$
on$
only…
1
10
14
12
18
lyon on
$
lionlying…
$
ly…
lyonlyon$
Gradual refinement
T = l i o n l y i n g o n l y o n l y o n $ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1
10
12
18
14
lionlying…
lyonlyon$ onlyon$
onlyonlyon$
on$
1
10
14
18
12
only
lionlying…
lyonlyon$
on$
on$
only…
1
10
14
12
18
lyon on
$
lionlying…
$
ly…
lyonlyon$
1
12
10
18
14
ionlying…
yonlyon$
lyon
l
on ly…
$ $
2-strict 4-strict {32, 16, 8}-strict
1-strict
! For each refinement step, the FPs can be computed in O(b min{l, n/s}) time.
! For the first log s steps, the cost is O((bn/s)log s). ! After log s steps, since l < n/2log s = n/s, the cost is
bn/s + bn/2s + bn/4s + … + b = O(bn/s)
The total cost for gradual refinement
The total cost for FP computation is O((bn/s)log s). Proof.
! For each refinement step, we conduct radix grouping of O(b) integer keys with radix s, which can be done in O(b logsn) time.
The total cost for gradual refinement
Proof.
The total cost for merging edge labels is O(b log n logsn) = O(b log2n / log s) = O((bn/s)log s).
Grouping by distribute-and-collect
432 52
52
978
978
52
k elements associated with keys (integers in [0, n])
Empty-initialized buckets of size n
0 1 … 51 52 53 … 431 432 434 … 977 978 979 … n
Grouping by distribute-and-collect
432 52
52
978
978
52
k elements associated with keys (integers in [0, n])
0 1 … 51 52 53 … 431 432 434 … 977 978 979 … n
Empty-initialized buckets of size n
Grouping by distribute-and-collect
432 52
52
978
978
52
k elements associated with keys (integers in [0, n])
0 1 … 51 52 53 … 431 432 434 … 977 978 979 … n
O(k) time Empty-initialized buckets of size n
Grouping by distribute-and-collect
432 52
52
978
978
52
k elements associated with keys (integers in [0, n])
0 1 2 3 4 5 6 7 8 9
Empty-initialized buckets of size 10
Grouping by distribute-and-collect
432 52
52
978
978
52
k elements associated with keys (integers in [0, n])
0 1 2 3 4 5 6 7 8 9
Empty-initialized buckets of size 10
Grouping by distribute-and-collect
432 52
52
978
978
52
k elements associated with keys (integers in [0, n])
0 1 2 3 4 5 6 7 8 9
432 52
52 978 978
52
Empty-initialized buckets of size 10
Grouping by distribute-and-collect
k elements associated with keys (integers in [0, n])
0 1 2 3 4 5 6 7 8 9
432 52
52 978 978
52
432 52
52
978
978
52
Empty-initialized buckets of size 10
Grouping by distribute-and-collect
k elements associated with keys (integers in [0, n])
0 1 2 3 4 5 6 7 8 9
432 52
52 978 978
52
432 52
52
978
978
52 432
52 52
52
Empty-initialized buckets of size 10
Grouping by distribute-and-collect
k elements associated with keys (integers in [0, n])
0 1 2 3 4 5 6 7 8 9
432 52
52 978 978
52
432 52
52
978
978
52 432
52 52
52
978 978
432
52 52
52
978 978
O(k log10n) time Empty-initialized buckets of size 10
Grouping by distribute-and-collect Radix grouping with radix s
k elements associated with keys (integers in [0, n])
0 1 2 3 … … s-4 s-3 s-2 s-1
432 52
52
978
978
52 432
52 52
52
978 978
O(k logsn) time
logsn steps
Empty-initialized buckets of size s
! For each refinement step, we conduct radix grouping of O(b) integer keys with radix s, which can be done in O(b logsn) time.
The total cost for gradual refinement
Proof.
The total cost for merging edge labels is O(b log n logsn) = O(b log2n / log s) = O((bn/s)log s).
For any s ∈ [b, n], we can construct sparse suffix tree correctly with high probability in O(n+(bn/s)log s)) time using O(s) additional space.
! When s = b, the time complexity is O(n log b). ! When s = b log b, the time complexity is O(n).
Monte-Carlo algorithm
Theorem
The total cost for FP computation is O((bn/s)log s).
The total cost for merging edge labels is O(b log n logsn) = O(b log2n / log s) = O((bn/s)log s).
! Follow the gradual refinement steps for this instance.
Exercise
1
3
6
9
7
doortodoortodortmund$
odoortodortmund$
doortodortmund$
ortodoortodortmund$
ortodortmund$
T = d o o r t o d o o r t o d o r t m u n d $ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
?
13
12
odortmund$
dortmund$