a simpler analysis of burrows-wheeler based compression haim kaplan shir landau elad verbin
Post on 21-Dec-2015
218 views
TRANSCRIPT
A Simpler Analysis of Burrows-Wheeler Based
Compression
Haim Kaplan Shir Landau Elad Verbin
Our Results
1. Improve the bounds of one of the main BWT based compression algorithms
2. New technique for worst case analysis of BWT based compression algorithms using the Local Entropy
3. Interesting results concerning compression of integer strings
The Burrows-Wheeler Transform
(1994)Given a string S the Burrows-Wheeler Transform creates a permutation of S that is locally homogeneous.
S BWT S’ is locally homogeneous
Empirical Entropy - Intuition
The Problem – Given a string S encode each symbol in S using a fixed codeword…
Order-0 Entropy (Shannon 48)
H0(s): Maximum compression we can get using only frequencies and no context information
0
1
0
1
10
10
Example: Huffman Code
Order-k entropy
Hk(s): Lower bound for compression with order-k contexts – the codeword representing each symbol depends on the k symbols preceding it
MISSISSIPPI
Context 1 for i
“mssp”
Context 1 for s
“isis”
Traditionally, compression ratio of compression algorithms measured using Hk(s)
HistoryThe Main Burrows-Wheeler Compression
Algorithm (Burrows, Wheeler 1994):
Compressed String S’
String S
BWTBurrows-Wheeler Transfor
m
MTFMove-to-
front
? RLE
Run-Length encodi
ng
Order-0 Encoding
MTFGiven a string S = baacb over alphabet = {a,b,c,d}
b a a c b
1 1 0 2 2
a
b
c
d
b
a
c
d
a
b
c
d
a
b
c
d
c
a
b
d
S =
MTF(S) =
b
c
a
d
Main Bounds (Manzini 1999)
• gk is a constant dependant on the context k and the size of the alphabet
• these are worst-case bounds
kkMTF g0.08n(s)8nH(s)BW
Now we are ready to begin…
Some Intuition…
• H0 – “measures” frequency
• Hk – “measures” frequency and context
→ We want a statistic that measures local similarity in a string and specifically in the BWT of the string
Some Intuition…
• The more the contexts are similar in the original string, the more its BWT will exhibit local similarity…
• The more local similarity found in the BWT of the string the smaller the numbers we get in MTF…
→ The solution: Local Entropy
The Local Entropy- Definition
We define: given a string s
= “s1s2…sn”
The local entropy of s: (Bentley, Sleator, Tarjan, Wei, 86)
s MTF MTF(S)
Original string Integer sequence
1)log(sLE(s)n
1ii
The Local Entropy - Definition
Note: LE(s) = number of bits needed to write the MTF
sequence in binary. Example:
MTF(s)= 311
→ LE(s) = 4
→ MTF(s) in binary = 1111
1)log(sLE(s)n
1ii
In Dream world… We would like to
compress S to LE(S)…
The Local Entropy – Properties
We use two properties of LE:
1. The entropy hierarchy
2. Convexity
The Local Entropy – Property 1
1. The entropy hierarchy:
We prove: For each k:
LE(BWT(s)) ≤ nHk(s) + O(1)
→ Any upper bound that we get for BWT with LE holds for Hk(s) as well.
The Local Entropy – Properties 2
2. Convexity:
→ This means that a partition of a string s does not improve the Local Entropy of s.
O(1))LE(s)LE(s)sLE(s 2121
Convexity• Cutting the input string into parts doesn’t
influence much: Only positions per part
a a a b a ba b
Convexity – Why do we need it?
Ferragina, Giancarlo, Manzini and Sciortino, JACM 2005:
Compressed String S’
String S
BWTBurrows-Wheeler
transform
Booster RHCVariation of
Huffman encoding
BWT(S)Partition of BWT(S)
k*k g(s)nHFGMS(s)
Using LE and its properties we get our bounds
Theorem: For every where ...ζ(μ)μμμ 3
1
2
1
1
11μ
nζ(μ)LE(BWT(s))μ(s)BWMTF log
kk gnsnH )(log)( Our LE bound
Our Hk bound
Our boundsWe get an improvement of the known bounds:
As opposed to the known bounds (Manzini, 1999):
kkMTF g0.006n(s)8nH(s)BW
kkMTF g0.08n(s)4.45nH(s)BW
kkMTF g0.08n(s)8nH(s)BW
Our Test Results
File Namebzip2Our bound using LE
Our Hk bound
Manzini’s bound8nHk(s)+ 0.08n + gk
alice29.txt3455683968137669402328219
asyoulik.txt3165523678746831712141646
cp.html6105669858105033295714
fields.c243122571343379119210
grammar.lsp10264102341605445134
lcet10.txt861184102144019672405867291
plrabn12.txt1164360139131024644408198976
xargs.114096138582231764673
*The files are non-binary files from the Canterbury corpus. gzip results are also taken from the corpus. The size is indicated in bytes.
How is LE related to compression of integer
sequences?• We mentioned “dream world” but what about
reality? How close can we come to ?
Problem: Compress an integer sequence S close to its sum of logs:
Notice for any s:
))(( sBWTLE
sx
1)log(xSL(s)
SL(MTF(s))LE(s)
Compressing Integer Sequences
• Universal Encodings of Integers: prefix-free encoding for integers (e.g. Fibonacci encoding).
• Doing some math, it turns out that order-0 encoding is good.
Not only good: It is best!
The order-0 math
• Theorem: For any string s of length n over the integer alphabet {1,2,…h} and for any ,
• Strange conclusion… we get an upper-bound on the order-0 algorithm with a phrase dependant on the value of the integers.
• This is true for all strings but is especially interesting for strings with smaller integers.
nsSLsnH )(log)()(0
1
A lower bound for SL
Theorem: For any algorithm A and for any , and any C such that C < log(ζ(μ))
there exists a string S of length n for which:
|A(S)| > μ∙SL(S) + C∙n
1
Our Results - Summary
• New improved bounds for BWMTF
• Local Entropy (LE)
• New bounds for compression of integer strings
Open Issues
We question the effectiveness of .
Is there a better statistic?
)(snH k
?
Anybody want to guess??
•For each encoding unit (letter, in this example), associate a frequency (number of times it occurs)
•Create a binary tree whose children are the encoding units with the smallest frequencies
–The frequency of the root is the sum of the frequencies of the leaves
•Repeat this procedure until all the encoding units are in the binary tree
Creating a Huffman encoding
ExampleAssume that relative frequencies are:
A: 40B: 20C: 10D: 10R: 20
Example , cont.
Example, cont.
A = 0B = 100C = 1010D = 1011R = 11
• Assign 0 to left branches, 1 to right branches• Each encoding is a path from the root
n ana#b a
a nana# b
n a#ban ab anana #
The Burrows-Wheeler Transform (1994)
Given a string S = banana#
banana#anana#bnana#ba ana#ban
a#bananna#bana
Sort the rows# banan aa #bana na na#ba n
#banana
The Burrows-Wheeler
Transform
Suffix Arrays and the BWT
So all we need to get the BWT is the suffix array!
n ana#b a
a nana# b
n a#ban ab anana #
# banan aa #bana na na#ba n
7642153
6531742
The Suffix Array
Index of
BWT