a simpler analysis of burrows-wheeler based compression haim kaplan shir landau elad verbin

A Simpler Analysis of Burrows-Wheeler Based

Compression

Haim Kaplan Shir Landau Elad Verbin

Our Results

1. Improve the bounds of one of the main BWT based compression algorithms

2. New technique for worst case analysis of BWT based compression algorithms using the Local Entropy

3. Interesting results concerning compression of integer strings

The Burrows-Wheeler Transform

(1994)Given a string S the Burrows-Wheeler Transform creates a permutation of S that is locally homogeneous.

S BWT S’ is locally homogeneous

Empirical Entropy - Intuition

The Problem – Given a string S encode each symbol in S using a fixed codeword…

Order-0 Entropy (Shannon 48)

H0(s): Maximum compression we can get using only frequencies and no context information

0

1

0

1

10

10

Example: Huffman Code

Order-k entropy

Hk(s): Lower bound for compression with order-k contexts – the codeword representing each symbol depends on the k symbols preceding it

MISSISSIPPI

Context 1 for i

“mssp”

Context 1 for s

“isis”

Traditionally, compression ratio of compression algorithms measured using Hk(s)

HistoryThe Main Burrows-Wheeler Compression

Algorithm (Burrows, Wheeler 1994):

Compressed String S’

String S

BWTBurrows-Wheeler Transfor

m

MTFMove-to-

front

? RLE

Run-Length encodi

ng

Order-0 Encoding

MTFGiven a string S = baacb over alphabet = {a,b,c,d}

b a a c b

1 1 0 2 2

a

b

c

d

b

a

c

d

a

b

c

d

a

b

c

d

c

a

b

d

S =

MTF(S) =

b

c

a

d

Main Bounds (Manzini 1999)

• gk is a constant dependant on the context k and the size of the alphabet

• these are worst-case bounds

kkMTF g0.08n(s)8nH(s)BW

Now we are ready to begin…

Some Intuition…

• H0 – “measures” frequency

• Hk – “measures” frequency and context

→ We want a statistic that measures local similarity in a string and specifically in the BWT of the string

Some Intuition…

• The more the contexts are similar in the original string, the more its BWT will exhibit local similarity…

• The more local similarity found in the BWT of the string the smaller the numbers we get in MTF…

→ The solution: Local Entropy

The Local Entropy- Definition

We define: given a string s

= “s1s2…sn”

The local entropy of s: (Bentley, Sleator, Tarjan, Wei, 86)

s MTF MTF(S)

Original string Integer sequence

1)log(sLE(s)n

1ii

The Local Entropy - Definition

Note: LE(s) = number of bits needed to write the MTF

sequence in binary. Example:

MTF(s)= 311

→ LE(s) = 4

→ MTF(s) in binary = 1111

1)log(sLE(s)n

1ii

In Dream world… We would like to

compress S to LE(S)…

The Local Entropy – Properties

We use two properties of LE:

1. The entropy hierarchy

2. Convexity

The Local Entropy – Property 1

1. The entropy hierarchy:

We prove: For each k:

LE(BWT(s)) ≤ nHk(s) + O(1)

→ Any upper bound that we get for BWT with LE holds for Hk(s) as well.

The Local Entropy – Properties 2

2. Convexity:

→ This means that a partition of a string s does not improve the Local Entropy of s.

O(1))LE(s)LE(s)sLE(s 2121

Convexity• Cutting the input string into parts doesn’t

influence much: Only positions per part

a a a b a ba b

Convexity – Why do we need it?

Ferragina, Giancarlo, Manzini and Sciortino, JACM 2005:

Compressed String S’

String S

BWTBurrows-Wheeler

transform

Booster RHCVariation of

Huffman encoding

BWT(S)Partition of BWT(S)

k*k g(s)nHFGMS(s)

Using LE and its properties we get our bounds

Theorem: For every where ...ζ(μ)μμμ 3

1

2

1

1

11μ

nζ(μ)LE(BWT(s))μ(s)BWMTF log

kk gnsnH )(log)( Our LE bound

Our Hk bound

Our boundsWe get an improvement of the known bounds:

As opposed to the known bounds (Manzini, 1999):


kkMTF g0.08n(s)4.45nH(s)BW


Our Test Results

File Namebzip2Our bound using LE

Our Hk bound

Manzini’s bound8nHk(s)+ 0.08n + gk

alice29.txt3455683968137669402328219

asyoulik.txt3165523678746831712141646

cp.html6105669858105033295714

fields.c243122571343379119210

grammar.lsp10264102341605445134

lcet10.txt861184102144019672405867291

plrabn12.txt1164360139131024644408198976

xargs.114096138582231764673

*The files are non-binary files from the Canterbury corpus. gzip results are also taken from the corpus. The size is indicated in bytes.

How is LE related to compression of integer

sequences?• We mentioned “dream world” but what about

reality? How close can we come to ?

Problem: Compress an integer sequence S close to its sum of logs:

Notice for any s:

))(( sBWTLE

sx

1)log(xSL(s)

SL(MTF(s))LE(s)

Compressing Integer Sequences

• Universal Encodings of Integers: prefix-free encoding for integers (e.g. Fibonacci encoding).

• Doing some math, it turns out that order-0 encoding is good.

Not only good: It is best!

The order-0 math

• Theorem: For any string s of length n over the integer alphabet {1,2,…h} and for any ,

• Strange conclusion… we get an upper-bound on the order-0 algorithm with a phrase dependant on the value of the integers.

• This is true for all strings but is especially interesting for strings with smaller integers.

nsSLsnH )(log)()(0

1

A lower bound for SL

Theorem: For any algorithm A and for any , and any C such that C < log(ζ(μ))

there exists a string S of length n for which:

|A(S)| > μ∙SL(S) + C∙n

1

Our Results - Summary

• New improved bounds for BWMTF

• Local Entropy (LE)

• New bounds for compression of integer strings

Open Issues

We question the effectiveness of .

Is there a better statistic?

)(snH k

?

Anybody want to guess??

•For each encoding unit (letter, in this example), associate a frequency (number of times it occurs)

•Create a binary tree whose children are the encoding units with the smallest frequencies

–The frequency of the root is the sum of the frequencies of the leaves

•Repeat this procedure until all the encoding units are in the binary tree

Creating a Huffman encoding

ExampleAssume that relative frequencies are:

A: 40B: 20C: 10D: 10R: 20

Example , cont.

Example, cont.

A = 0B = 100C = 1010D = 1011R = 11

• Assign 0 to left branches, 1 to right branches• Each encoding is a path from the root

n ana#b a

a nana# b

n a#ban ab anana #

The Burrows-Wheeler Transform (1994)

Given a string S = banana#

banana#anana#bnana#ba ana#ban

a#bananna#bana

Sort the rows# banan aa #bana na na#ba n

#banana

The Burrows-Wheeler

Transform

Suffix Arrays and the BWT

So all we need to get the BWT is the suffix array!

n ana#b a

a nana# b

n a#ban ab anana #

# banan aa #bana na na#ba n

7642153

6531742

The Suffix Array

Index of

BWT

a simpler analysis of burrows-wheeler based compression haim kaplan shir landau elad verbin

Documents

s bwt s

string slide

local entropy of s

local entropy slide

s isis

s o1

permutation of s

convexity slide