data compressionferragin/teach/information... · 2015-12-15 · prof. paolo ferragina, algoritmi...
TRANSCRIPT
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Data Compression
Basics + Huffman coding
How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…
2221
1
−=∑−
=
nn
i
i
We need to talkabout stochastic sources
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the self information of s is:
bits
Lower probability � higher information
Entropy is the weighted average of i(s)
∑∈
∗=Ss sp
spSH)(
1log)()( 2
)(log)(
1log)( 22 sp
spsi −==
Statistical Coding
How do we use probability p(s) to encode s?
� Prefix codes and relationship to Entropy
� Huffman codes
� Arithmetic codes
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?
A uniquely decodable code can always be
uniquely decomposed into their codewords.
Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0 1
a
b c
d
0
0 1
1
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Average Length
For a code C with codeword length L[s], the
average length is defined as
We say that a prefix code C is optimal if for
all prefix codes C’, La(C) ≤ La(C’)
∑∈
∗=Ss
a sLspCL ][)()(
A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…
Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}
then pi < pj � L[si] ≥ L[sj]
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Relationship to Entropy
Theorem (lower bound, Shannon). For any probability distribution and any uniquely decodable code C, we have
Theorem (upper bound, Shannon). For any probability distribution, there exists a prefix code C such that
)()( CLSH a≤
1)()( +< SHCLa
Shannon code
takes log 1/p bits
Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms
� gzip, bzip, jpeg (as option), fax compression,…
Properties:
� Generates optimal prefix codes
� Cheap to encode and decode
� La(Huff) = H if probabilities are powers of 2
� Otherwise, at most 1 bit more per symbol!!!
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5a(.1) b(.2) d(.5)c(.2)
a=000, b=001, c=01, d=1There are 2n-1 “equivalent” Huffman trees
(.3)
(.5)
(1)
What about ties (and thus, tree depth) ?
0
0
0
11
1
Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to the symbol to be encoded.
Decoding: Start at root and take branch for each bit received. When at leaf, output its symbol and return to root.
a(.1) b(.2)
(.3) c(.2)
(.5) d(.5)0
0
0
1
1
1
abc... �00000101
101001... � dcb
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
A property on tree contraction
...by induction, optimality follows…
Something like substituting symbols x,y with one new symbol x+y
Optimum vs. Huffman
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
We store for any level L:
� firstcode[L]
� Symbol[L,i], for each i in level L
This is ≤ h2+ |Σ| log |Σ| bits
Canonical Huffman tree
= 00.....0
Canonical HuffmanEncoding
1 2 3 4 5
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Canonical Huffman
Decoding
T=...00010...
firstcode[1]=2firstcode[2]=1firstcode[3]=1firstcode[4]=2firstcode[5]=0
Problem with Huffman Coding
Consider a symbol with probability .999. The self information is
If we were to send 1000 such symbols we might hope to use 1000*.0014 = 1.44 bits.
Using Huffman, we take at least 1 bit per symbol, so we would require 1000 bits.
00144.)999log(. =−
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
What can we do?
Macro-symbol = block of k symbols
☺ 1 extra bit per macro-symbol = 1/k extra-bits per symbol
� Larger model to be transmitted
Shannon took infinite sequences, and k � ∞ !!
In practice, we have:
� Model takes |Σ|k (k * log |Σ|) + h2 (where h might be |Σ|)
� It is H0(SL) ≤ L * Hk(S) + O(k * log |Σ|), for each k ≤ L
Compress + Search ? [Moura et al, 98]
Compressed text derived from a word-based Huffman:
� Symbols of the huffman tree are the words of T
� The Huffman tree has fan-out 128
� Codewords are byte-aligned and tagged
1 0 0
Byte-aligned codeword
γ α βtagging
“or”7 bits
Codeword
γ α βhuffman
β0
γ γ0 0
β1 1
1
γ 0 β01
β1 β01
[bzip] [ ]
[bzip][ ] [not]
[or]
T = “bzip or not bzip”
β1
[ ]
α α
α αC(T)
α
α
α
βγ
γ
βbzip
β space
or not
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CGrep and other ideas...
yes no
no
β0
γ γ0 0
β1 1
1
γ 0 β01
β1 β01
[bzip] [ ]
[bzip][ ] [not]
[or]
T = “bzip or not bzip”
β1
[ ]
α α
α αC(T)
P= bzip
yes
α
α
α
βγ
γ
βbzip
β space
or not
= 1α 0β
Byte
-search
Speed ≈ Compression ratio
GREP
You find this at
You find it under my Software projects
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Data Compression
Basic search algorithms:Single and Multiple-patterns
Mismatches and Edits
Problem 1
yes no
no
β0
γ γ0 0
β1 1
1
γ 0 β01
β1 β01
[bzip] [ ]
[bzip][ ] [not]
[or]
S = “bzip or not bzip”
β1
[ ]
α α
α αC(S)
P = bzip
yes
α
α
α
βγ
γ
βbzip
β space
or not
= 1α 0β
Speed ≈ Compression ratio
bzip
not
or
space
Dictionary
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the pattern P[1,m] in the text T[1,n].
T
P
BDAB C ABA
BA
� Naïve solution� For any position i of T, check if T[i,i+m-1]=P[1,m]� Complexity: O(nm) time
� (Classical) Optimal solutions based on comparisons
� Knuth-Morris-Pratt
� Boyer-Moore
☺ Complexity: O(n + m) time
Semi-numerical pattern matching
� We show methods in which Arithmetic and Bit-operations replace comparisons
� We will survey two examples of such methods
� The Random Fingerprint method due to Karp and Rabin
� The Shift-And method due to Baeza-Yates and Gonnet
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Rabin-Karp Fingerprint
� We will use a class of functions from strings to integers in order to obtain:
� An efficient randomized algorithm that makes an error with
small probability.
� A randomized algorithm that never errors whose running time is efficient with high probability.
� We will consider a binary alphabet. (i.e., T={0,1}n)
Arithmetic replaces Comparisons
� Strings are also numbers, H: strings → numbers.
� Let s be a string of length m
� P = 0 1 0 1 H(P) = 230+221+210+201= 5
� s=s’ if and only if H(s)=H(s’ )
� Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).
∑ =−= m
i
im issH1
][2)(
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Arithmetic replaces Comparisons
� Strings are also numbers, H: strings → numbers� Exact match = Scan T and compare H(Ti) and H(P)
� There is an occurrence of P starting at position r of T if and only if H(P) = H(Tr)
T = 1 0 1 1 0 1 0 1
P = 0 1 0 1 H(P) = 5
T = 1 0 1 1 0 1 0 1 H(T2) = 6 ≠ H(P)
P = 0 1 0 1
T = 1 0 1 1 0 1 0 1 H(T5) = 5 = H(P) Match!
P = 0 1 0 1
Arithmetic replaces Comparisons
� We can compute H(Tr) from H(Tr-1)
T = 1 0 1 1 0 1 0 1
T1 = 1 0 1 1
T2 = 0 1 1 0
)1()1(2)(2)( 1 −++−−= − nrTrTTHTH mrr
)0110(61622012112)0110()(
11)1011()(4
2
1
HHTH
HTH
==−=+⋅−⋅==
==
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Arithmetic replaces Comparisons
� A simple efficient algorithm:
� Compute H(P) and H(T1)
� Run over T
� Compute H(Tr) from H(Tr-1) in constant time,and make the comparisons (i.e., H(P)=H(Tr)).
� Total running time O(n+m)?
� NO! why?
� The problem is that when m is large, it is unreasonable to assume that each arithmetic operation can be done in O(1) time.
� Values of H() are m-bits long numbers. In general, they are too BIG to fit in a machine’s word.
� IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s is defined by Hq(s) = H(s) (mod q)
An example
P = 1 0 1 1 1 1 H(P) = 47
q = 7 Hq(P) = 47 (mod 7) = 5
Hq(P) can be computed incrementally!
Intermediate values are also small! (< 2q)
)(5)7(mod5
51)7(mod22
21)7(mod24
41)7(mod25
51)7(mod22
20)7(mod21
PHq===+⋅=+⋅=+⋅=+⋅=+⋅
We can still compute Hq(Tr) from Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Karp-Rabin Fingerprint
� How about the comparisons?
Arithmetic:There is an occurrence of P starting at position r of T if and only if H(P) = H(Tr)
Modular arithmetic:If there is an occurrence of P starting at position r of T then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!
� Our goal will be to choose a modulus q such that� q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)� q is large enough so that the probability of a false match is kept small
Karp-Rabin fingerprint algorithm
� Choose a positive integer I� Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).� For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either � declare a probable match (randomized algorithm). � or check and declare a definite match (deterministic
algorithm)
� Running time: excluding verification O(n+m).
� Randomized algorithm is correct w.h.p� Deterministic algorithm whose expected running time is
O(n+m)
Proof on the board
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Problem 1: Solution
yes
β0
γ γ0 0
β1 1
1
γ 0 β01
β1 β01
[bzip] [ ]
[bzip][ ] [not]
[or]
S = “bzip or not bzip”
β1
[ ]
α α
α αC(S)
P = bzip
yes
α
α
α
βγ
γ
βbzip
β space
or not
= 1α 0β
Speed ≈ Compression ratio
bzip
not
or
space
Dictionary
Karp
-Rabin
The Shift-And method
� Define M to be a binary m by n matrix such that:
M(i,j) = 1 iff the first i characters of P exactly match the icharacters of T ending at character j.
� i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
� Example: T = california and P = for
� How does M solve the exact match problem?
M
r
o
f
ainrofilac
000100O0003
00001000002
00000100001
10
987654321n
m *****
*****
j
i
T
P
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
How to construct M
� We want to exploit the bit-parallelism to compute the j-thcolumn of M from the j-1-th one� Machines can perform bit and arithmetic operations between
two words in constant time.� Examples:
� And(A,B) is bit-wise and between A and B. � BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.
� Let w be the word size. (e.g., 32 or 64 bits). We’ll assume m=w. NOTICE: any column of M fits in a memory word.
=
0
1
1
0
1
)
1
0
1
1
0
(BitShift
How to construct M
� We want to exploit the bit-parallelism tocompute the j-th column of M from the j-th one
� We define the m-length binary vector U(x) for each character x in the alphabet. U(x) is set to 1for the positions in P where character x appears.
� Example:
P = abaac
=
0
1
1
0
1
)(aU
=
0
0
0
1
0
)(bU
=
1
0
0
0
0
)(cU
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
How to construct M
� Initialize column 0 of M to all zeros
� For j > 0, j-th column is obtained by
� For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1 ⇔ M(i-1,j-1) = 1
(2) P[i] = T[j] ⇔ the i-th bit of U(T[j]) = 1
� BitShift moves bit M(i-1,j-1) in i-th position
� AND this with i-th bit of U(T[j]) to estabilish if both are true
])[(&))1(()( jTUjMBitShiftjM −=
An example j=1
1 2 3 4 5 6 7 8 9 10
T = x a b x a b a a c a
1 2 3 4 5
P = a b a a c
03
04
5
2
1
n
m
0
0
0
10
987654321
=
0
0
0
0
0
)(xU
=
=
0
0
0
0
0
0
0
0
0
0
&
0
0
0
0
1
])1[(&))0(( TUMBitShift
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
An example j=2
1 2 3 4 5 6 7 8 9 10
T = x a b x a b a a c a
1 2 3 4 5
P = a b a a c
=
0
1
1
0
1
)(aU
003
004
5
2
1
n
m
00
00
10
10
987654321
=
=
0
0
0
0
1
0
1
1
0
1
&
0
0
0
0
1
])2[(&))1(( TUMBitShift
An example j=3
1 2 3 4 5 6 7 8 9 10
T = x a b x a b a a c a
1 2 3 4 5
P = a b a a c
=
0
0
0
1
0
)(bU
0003
0004
5
2
1
n
m
100
000
010
10
987654321
=
=
0
0
0
1
0
0
0
0
1
0
&
0
0
0
1
1
])3[(&))2(( TUMBitShift
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
An example j=9
1 2 3 4 5 6 7 8 9 10
T = x a b x a b a a c a
1 2 3 4 5
P = a b a a c
=
1
0
0
0
0
)(cU
0010000003
0100000004
5
2
1
n
m
000100100
100000000
011010010
10
987654321
=
=
1
0
0
0
0
1
0
0
0
0
&
1
0
0
1
1
])9[(&))8(( TUMBitShift
Shift-And method: Complexity
� If m<=w, any column and vector U() fit in a memory word.
� Any step requires O(1) time.
� If m>w, any column and vector U() can bedivided in m/w memory words.
� Any step requires O(m/w) time.
� Overall O(n(1+m/w)+m) time.
� Thus, it is very fast when pattern length is closeto the word size.
� Very often in practice. Recall that w=64 bits in modern architectures.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Some simple extensions
� We want to allow the pattern to contain special symbols, like [a-f] classes of chars
=
0
1
1
0
1
)(aU
=
0
0
0
1
1
)(bU
=
1
0
0
0
0
)(cU
P = [a-b]baac
� What about ‘?’, ‘[^…]’ (not).
Problem 1: An other solution
yes no
no
β0
γ γ0 0
β1 1
1
γ 0 β01
β1 β01
[bzip] [ ]
[bzip][ ] [not]
[or]
S = “bzip or not bzip”
β1
[ ]
α α
α αC(S)
P = bzip
yes
α
α
α
βγ
γ
βbzip
β space
or not
= 1α 0β
Speed ≈ Compression ratio
bzip
not
or
space
Dictionary
Shift-A
nd
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Problem 2
β0
γ γ0 0
β1 1
1
γ 0 β01
β1 β01
[bzip] [ ]
[bzip][ ] [not]
[or]
S = “bzip or not bzip”
β1
[ ]
α α
α αC(S)
P = o
yes
α
α
α
βγ
γ
βbzip
β space
or not
Speed ≈ Compression ratio?
bzip
not
or
space
DictionaryGiven a pattern P findall the occurrences in Sof all terms containingP as substring Codes
not= 1 γ 0 γ 0 α
or = 1 γ 0 α 0 βyes
No! Why?
Shift-And
Shift-And
A scan of C(s) for each term that contains P
Multi-Pattern Matching Problem
Given a set of patterns PPPP = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T
P1
BDAB C ABA
AC
� Naïve solution� Use an (optimal) Exact Matching Algorithm searching eachpattern in PPPP
� Complexity: O(nl+m) time, not good with many patterns
� Optimal solution due to Aho and Corasick
� Complexity: O(n + l + m) time
P2 AD
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
A simple extention of Shift-And
� S is the concatenation of the patterns in P
� R is a bitmap of lenght m.� R[i] = 1 iff S[i] is the first symbol of a pattern
� Use a variant of Shift-And method searching forS� For any symbol c, U’(c) = U(c) and R
� U’(c)[i] = 1iff S[i]=c and is the first symbol of a pattern
� For any step j,
� compute M(j)
� then M(j) OR U’(T[j]). Why?
� Set to 1 the first bit of each pattern that start withT[j]
� Check if there are occurrences ending in j. How?
Problem 3
β0
γ γ0 0
β1 1
1
γ 0 β01
β1 β01
[bzip] [ ]
[bzip][ ] [not]
[or]
S = “bzip or not bzip”
β1
[ ]
α α
α αC(S)
P = bot k=2
α
α
α
βγ
γ
βbzip
β space
or not
bzip
not
or
space
DictionaryGiven a pattern P findall the occurrences in Sof all terms containingP as substring allowingat most k mismatches
???
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Agrep: Shift-And method with errors
� We extend the Shift-And method for finding inexact occurrences of a pattern in a text.
� Example:T = aatatccacaaP = atcgaa
P appears in T with 2 mismatches starting at position 4, it also occurs with 4 mismatches starting at position 2.
a a t a t c c a c a a a a t a t c c a c a a
a t c g a a a t c g a a
Agrep
� Our current goal given k find all the occurrences of P in T with up to k mismatches
� We define the matrix Ml to be an m by n binary matrix, such that:
Ml(i,j) = 1 iffThere are no more than l mismatches between the first i characters of P match the i characters up through character j of T.
� What is M0?
� How does Mk solve the k-mismatch problem?
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Computing Mk
� We compute Ml for all l=0, … ,k.
� For each j compute M(j), M1(j), … , Mk(j)
� For all l initialize Ml(0) to the zero vector.
� In order to compute Ml(j), we observe that there is a match iff
Computing Ml: case 1
� The first i-1 characters of P match a substring of T ending at j-1, with at most l mismatches, and the next pair of characters in P and T are equal.
*****
*****i-1
])[())1(( jTUjMBitShift l ∧−
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Computing Ml: case 2
� The first i-1 characters of P match a substring of T ending at j-1, with at most l-1 mismatches.
*****
*****
j-1
i-1
))1(( 1 −− jMBitShift l
Computing Ml
� We compute Ml for all l=0, … ,k.
� For each j compute M(j), M1(j), … , Mk(j)
� For all l initialize Ml(0) to the zero vector.
� In order to compute Ml(j), we observe that there is a match iff
))1((
))](())1(([
)(
1 −∨∧−
=
− jMBitShift
jTUjMBitShift
jM
l
l
l
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Example M1
1 2 3 4 5 6 7 8 910
T = x a b x a b a a c a
P = a b a a d
10010010003
00100100004
5
2
1
0110100100
0100000000
1111111111
10
987654321
00010000003
00100000004
5
2
1
0000100100
0000000000
1011010010
10
987654321M0=
M1=
How much do we pay?
� The running time is O(kn(1+m/w)
� Again, the method is practically efficient for small m.
� Still only a O(k) columns of M are needed at any given time. Hence, the space used by the algorithm is O(k) memory words.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Problem 3: Solution
β0
γ γ0 0
β1 1
1
γ 0 β01
β1 β01
[bzip] [ ]
[bzip][ ] [not]
[or]
S = “bzip or not bzip”
β1
[ ]
α α
α αC(S)
P = bot k=2
yes
α
α
α
βγ
γ
βbzip
β space
or not
bzip
not
or
space
DictionaryGiven a pattern P findall the occurrences in Sof all terms containingP as substring allowingk mismatches
Codes
not= 1 γ 0 γ 0 αyes
Agrep
Shift-And
Agrep: more sophisticated operations
� The Shift-And method can solve other ops
� The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:
� Insertion: insert a symbol in p
� Delection: delete a symbol from p
� Substitution: change a symbol in p with a different one
� Example: d(ananas,banane) = 3
� Search by regular expressions
� Example: (a|b)?(abc|a)
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Data Compression
Some thoughts
Variations…
Canonical Huffman still needs to know the codeword lengths, and thus build the tree…
This may be extremely time/space costly when you deal with Gbs of textual data
Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.
A simple algorithm
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
γ−code for integer encoding
� x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.
� γ−code for x takes 2 log2 x +1 bits (ie. factor of 2 from optimal)
0000...........0 x in binary Length-1
� Optimal for Pr(x) = 1/2x2, and i.i.d integers
It is a prefix-free encoding…
� Given the following sequence of γ−coded integers, reconstruct the original sequence:
0001000001100110000011101100111
8 6 3 59 7
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Analysis
Sort pi in decreasing order, and encode si via
the variable-length code γ(i).
Recall that: |γ(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Σi=1,...,x pi ≥ x * px � x ≤ 1/px
How good is it ?
Encode the integers via δ-coding:
|γ(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):
∑∑Σ=Σ=
+∗≤∗,..,1,...,1
]11
log*2[|)(|i i
ii
i ppip γ
This is: 1)(*2 0 +≤ XH
No much worse than Huffman,
and improvable to H0(X) + 2 + ....
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
A better encoding
� Byte-aligned and tagged Huffman
� 128-ary Huffman tree
� First bit of the first byte is tagged
� Configurations on 7-bits: just those of Huffman
� End-tagged dense code
� The rank r is mapped to r-th binary sequence on 7*k bits
� First bit of the last byte is tagged
A better encoding
Surprising changes� It is a prefix-code
� Better compression: it uses all 7-bits configurations
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
(s,c)-dense codes
Distribution of words is skewed: 1/iθ, where 1 < θ < 2
� A new concept: Continuers vs Stoppers
� Previously we used: s = c = 128
� The main idea is:� s + c = 256 (we are playing with 8 bits)� Thus s items are encoded with 1 byte� And s*c with 2 bytes, s * c2 on 3 bytes, ...
� An example� 5000 distinct words
� ETDC encodes 128 + 1282 = 16512 words on 2 bytes
� (230,26)-dense code encodes 230 + 230*26 = 6210 on 2 bytes, hence more on 1 byte and thus if skewed...
It is
a pre
fix-co
de
Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.
� Brute-force approach
� Binary search:� On real distributions, it seems that one unique minimum
Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Experiments: (s,c)-DC much interesting…
Search is 6% faster than byte-aligned Huffword
Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?
� Move-to-Front (MTF):
� As a freq-sorting approximator
� As a caching strategy
� As a compressor
� Run-Length-Encoding (RLE):
� FAX compression
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Move to Front Coding
Transforms a char sequence into an integersequence, that can then be var-length coded
� Start with the list of symbols L=[a,b,c,d,…]
� For each input symbol s
1) output the position of s in L
2) move s to the front of L
Properties:
� Exploit temporal locality, and it is dynamic
� X = 1n 2n 3n… nn � Huff = O(n2 log n), MTF = O(n log n) + n2
There is a memory
MTF: how good is it ?
Encode the integers via δ-coding:
|γ(i)| ≤ 2 * log i + 1
Put Σ in the front and consider the cost of encoding:
∑∑Σ
= =−
−+ΣΣ1 2
)()log(1
x
n
i
xxx
iippO γ
∑Σ
=
++ΣΣ≤1
]1log*2[)log(x x
x n
NnOBy Jensen’s:
]1)(*2[*)log( 0 ++ΣΣ≤ XHNO
No much worse than Huffman
...but it may be far better
)1()(*2][ 0 OXHmtfLa +≤
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
MTF: higher compression
Alphabet of words
How to keep efficiently the MTF-list:
� Search tree
� Leaves contain the words, ordered as in the MTF-List
� Nodes contain the size of their descending subtree
� Hash Table
� keys are the words (of the MTF-List)
� data is a pointer to the corresponding tree leaves
� Each ops takes O(log Σ)
� Total cost is O(n log Σ)
Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings � just numbers and one bit
Properties:
� Exploit spatial locality, and it is a dynamic code
� X = 1n 2n 3n… nn �
Huff(X) = n2 log n > Rle(X) = n (1+log n)
There is a memory