a simpler analysis of burrows-wheeler based compression haim kaplan shir landau elad verbin

35
A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

Post on 21-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

A Simpler Analysis of Burrows-Wheeler Based

Compression

Haim Kaplan Shir Landau Elad Verbin

Page 2: A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

Our Results

1. Improve the bounds of one of the main BWT based compression algorithms

2. New technique for worst case analysis of BWT based compression algorithms using the Local Entropy

3. Interesting results concerning compression of integer strings

Page 3: A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

The Burrows-Wheeler Transform

(1994)Given a string S the Burrows-Wheeler Transform creates a permutation of S that is locally homogeneous.

S BWT S’ is locally homogeneous

Page 4: A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

Empirical Entropy - Intuition

The Problem – Given a string S encode each symbol in S using a fixed codeword…

Page 5: A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

Order-0 Entropy (Shannon 48)

H0(s): Maximum compression we can get using only frequencies and no context information

0

1

0

1

10

10

Example: Huffman Code

Page 6: A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

Order-k entropy

Hk(s): Lower bound for compression with order-k contexts – the codeword representing each symbol depends on the k symbols preceding it

MISSISSIPPI

Context 1 for i

“mssp”

Context 1 for s

“isis”

Traditionally, compression ratio of compression algorithms measured using Hk(s)

Page 7: A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

HistoryThe Main Burrows-Wheeler Compression

Algorithm (Burrows, Wheeler 1994):

Compressed String S’

String S

BWTBurrows-Wheeler Transfor

m

MTFMove-to-

front

? RLE

Run-Length encodi

ng

Order-0 Encoding

Page 8: A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

MTFGiven a string S = baacb over alphabet = {a,b,c,d}

b a a c b

1 1 0 2 2

a

b

c

d

b

a

c

d

a

b

c

d

a

b

c

d

c

a

b

d

S =

MTF(S) =

b

c

a

d

Page 9: A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

Main Bounds (Manzini 1999)

• gk is a constant dependant on the context k and the size of the alphabet

• these are worst-case bounds

kkMTF g0.08n(s)8nH(s)BW

Page 10: A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

Now we are ready to begin…

Page 11: A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

Some Intuition…

• H0 – “measures” frequency

• Hk – “measures” frequency and context

→ We want a statistic that measures local similarity in a string and specifically in the BWT of the string

Page 12: A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

Some Intuition…

• The more the contexts are similar in the original string, the more its BWT will exhibit local similarity…

• The more local similarity found in the BWT of the string the smaller the numbers we get in MTF…

→ The solution: Local Entropy

Page 13: A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

The Local Entropy- Definition

We define: given a string s

= “s1s2…sn”

The local entropy of s: (Bentley, Sleator, Tarjan, Wei, 86)

s MTF MTF(S)

Original string Integer sequence

1)log(sLE(s)n

1ii

Page 14: A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

The Local Entropy - Definition

Note: LE(s) = number of bits needed to write the MTF

sequence in binary. Example:

MTF(s)= 311

→ LE(s) = 4

→ MTF(s) in binary = 1111

1)log(sLE(s)n

1ii

In Dream world… We would like to

compress S to LE(S)…

Page 15: A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

The Local Entropy – Properties

We use two properties of LE:

1. The entropy hierarchy

2. Convexity

Page 16: A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

The Local Entropy – Property 1

1. The entropy hierarchy:

We prove: For each k:

LE(BWT(s)) ≤ nHk(s) + O(1)

→ Any upper bound that we get for BWT with LE holds for Hk(s) as well.

Page 17: A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

The Local Entropy – Properties 2

2. Convexity:

→ This means that a partition of a string s does not improve the Local Entropy of s.

O(1))LE(s)LE(s)sLE(s 2121

Page 18: A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

Convexity• Cutting the input string into parts doesn’t

influence much: Only positions per part

a a a b a ba b

Page 19: A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

Convexity – Why do we need it?

Ferragina, Giancarlo, Manzini and Sciortino, JACM 2005:

Compressed String S’

String S

BWTBurrows-Wheeler

transform

Booster RHCVariation of

Huffman encoding

BWT(S)Partition of BWT(S)

k*k g(s)nHFGMS(s)

Page 20: A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

Using LE and its properties we get our bounds

Theorem: For every where ...ζ(μ)μμμ 3

1

2

1

1

11μ

nζ(μ)LE(BWT(s))μ(s)BWMTF log

kk gnsnH )(log)( Our LE bound

Our Hk bound

Page 21: A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

Our boundsWe get an improvement of the known bounds:

As opposed to the known bounds (Manzini, 1999):

kkMTF g0.006n(s)8nH(s)BW

kkMTF g0.08n(s)4.45nH(s)BW

kkMTF g0.08n(s)8nH(s)BW

Page 22: A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

Our Test Results

File Namebzip2Our bound using LE

Our Hk bound

Manzini’s bound8nHk(s)+ 0.08n + gk

alice29.txt3455683968137669402328219

asyoulik.txt3165523678746831712141646

cp.html6105669858105033295714

fields.c243122571343379119210

grammar.lsp10264102341605445134

lcet10.txt861184102144019672405867291

plrabn12.txt1164360139131024644408198976

xargs.114096138582231764673

*The files are non-binary files from the Canterbury corpus. gzip results are also taken from the corpus. The size is indicated in bytes.

Page 23: A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

How is LE related to compression of integer

sequences?• We mentioned “dream world” but what about

reality? How close can we come to ?

Problem: Compress an integer sequence S close to its sum of logs:

Notice for any s:

))(( sBWTLE

sx

1)log(xSL(s)

SL(MTF(s))LE(s)

Page 24: A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

Compressing Integer Sequences

• Universal Encodings of Integers: prefix-free encoding for integers (e.g. Fibonacci encoding).

• Doing some math, it turns out that order-0 encoding is good.

Not only good: It is best!

Page 25: A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

The order-0 math

• Theorem: For any string s of length n over the integer alphabet {1,2,…h} and for any ,

• Strange conclusion… we get an upper-bound on the order-0 algorithm with a phrase dependant on the value of the integers.

• This is true for all strings but is especially interesting for strings with smaller integers.

nsSLsnH )(log)()(0

1

Page 26: A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

A lower bound for SL

Theorem: For any algorithm A and for any , and any C such that C < log(ζ(μ))

there exists a string S of length n for which:

|A(S)| > μ∙SL(S) + C∙n

1

Page 27: A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

Our Results - Summary

• New improved bounds for BWMTF

• Local Entropy (LE)

• New bounds for compression of integer strings

Page 28: A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

Open Issues

We question the effectiveness of .

Is there a better statistic?

)(snH k

?

Page 29: A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

Anybody want to guess??

Page 30: A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

•For each encoding unit (letter, in this example), associate a frequency (number of times it occurs)

•Create a binary tree whose children are the encoding units with the smallest frequencies

–The frequency of the root is the sum of the frequencies of the leaves

•Repeat this procedure until all the encoding units are in the binary tree

Creating a Huffman encoding

Page 31: A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

ExampleAssume that relative frequencies are:

A: 40B: 20C: 10D: 10R: 20

Page 32: A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

Example , cont.

Page 33: A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

Example, cont.

A = 0B = 100C = 1010D = 1011R = 11

• Assign 0 to left branches, 1 to right branches• Each encoding is a path from the root

Page 34: A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

n ana#b a

a nana# b

n a#ban ab anana #

The Burrows-Wheeler Transform (1994)

Given a string S = banana#

banana#anana#bnana#ba ana#ban

a#bananna#bana

Sort the rows# banan aa #bana na na#ba n

#banana

The Burrows-Wheeler

Transform

Page 35: A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

Suffix Arrays and the BWT

So all we need to get the BWT is the suffix array!

n ana#b a

a nana# b

n a#ban ab anana #

# banan aa #bana na na#ba n

7642153

6531742

The Suffix Array

Index of

BWT