1 15-211 fundamental data structures and algorithms margaret reid-miller 24 february 2005 lzw...
TRANSCRIPT
1
15-211Fundamental Data Structures and Algorithms
Margaret Reid-Miller24 February 2005
LZW Compression
Last Time…
3
Huffman Trees
4
Huffman’s Algorithm
Huffman’s algorithm gives the optimal prefix code.
For a nice online demo, see http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/huffman.html
5
Huffman Compression
Huffman trees provide a straightforward method for file compression.
1. Read the file and compute frequencies2. Use frequencies to build Huffman codes3. Encode file using the codes4. Write the codes (or tree) and encoded
file into the output file.
6
Huffman decompression
Decompression reverses the process.§ Read the header in the compressed
file, and build the code tree§ Read the rest of the file, decode
using the tree§ Write to output
7
Beating Huffman
How about doing better than Huffman!
Impossible! Huffman’s algorithm gives the optimal
prefix code!
Right. But who says we have to use a prefix
code?
8
Example
Suppose we have a file containingabcdabcdabcdabcdabcdabcd… abcdabcd
This could be expressed very compactly asabcd^1000
Dictionary-BasedCompression
10
Dictionary-based methods
Here is a simple idea: Remember of “words” that we have seen, and replace
them with a code number when we see them again. The code is presumably shorter than the word
If words repeat this should produce nice compression.
and make additions to the dictionary as we read the input file.
11
Dictionary-based Methods
As we read the input file we keep adding new words to the dictionary to get more and more abbreviations:
( word, code )
Since we will always use the longest applicable abbreviation, the set of current words is prefix (so it looks like tries might be useful).
12
Fred Hacker’s Algorithm…
Fred now knows what to do…
( <the-whole-file>, 1 )
Transmit 1, done.
13
Right?
Fred’s algorithm provides excellent compression, but…
…the receiver does not know what is in the dictionary! And sending the dictionary is the same
as sending the entire uncompressed file
Thus, we can’t decompress the “1”.
14
Hence…
…we need to build our dictionary in such a way that the receiver can rebuild the dictionary easily.
15
Lempel & Ziv (1977/78)
LZW Compression:The Binary Version
LZW=variant of Lempel-Ziv Compression, by Terry Welch (1984)
17
Maintaining a Dictionary
We need a way of incrementally building up a dictionary during compression in such a way that…
…someone who wants to uncompress can “rediscover” the very same dictionary
And we already know that a convenient way to build a dictionary incrementally is to use a trie.
18
Getting off the ground
Suppose we want to compress a file containing only letters a, b, c and d.
It seems reasonable to start with a dictionary
a:0 b:1 c:2 d:3
At least we can then deal with the first letter.
And the receiver knows how to start.
19
Growing pains
Now suppose the file starts like so:
a b b a b b …
We scan the a, look it up and output a 0.
After scanning the b, we have seen the word ab. So, we add it to the dictionary
a:0 b:1 c:2 d:3 ab:4
20
Growing pains
We already scanned the first b.
a b b a b b …
Then we get another b. bb is not in the dictionary.
So we output a 1 for the first b, and add bb to the dictionary
a:0 b:1 c:2 d:3 ab:4 bb:5
21
So?
Right, so far zero compression.
We already scanned the second b.
a b b a b b …
Next we get an a. As ba is not in the dictionary, we output 1 for the b and put ba in the dictionary
… d:3 ab:4 bb:5 ba:6
Still zero compression.
22
But now…
We already scanned a.
a b b a b b …
We scan the next b, and ab : 4 is in the dictionary.
We scan the next b, and don’t find abb in the dictionary. We output 4, and put abb into the dictionary.
… d:3 ab:4 bb:5 ba:6 abb:7
We got compression, because 4 is shorter than ab.
23
Suppose the input continues
a b b a b b b b a …
We scan the next b, and bb:5 is in the dictionary
We scan the next b, and don’t find bbb in the dictionary. We output 5, and put bbb into the dictionary
… ab:4 bb:5 ba:6 abb:7 bbb:8
And so on
24
More Hits
As our dictionary grows, we are able to replace longer and longer blocks by short code numbers.
a b b a b b b b a …
0 1 1 4 5 6
And we increase the dictionary at each step by adding another word.
25
More importantly
Since we extend our dictionary in such a simple way, it can be easily reconstructed on the other end.
Start with the same initialization, then Read one code number after the other,
look up the each one in the dictionary, and extend the dictionary as you go along.
26
Again: Extending
where each prefix is in the dictionary.
We stop when we fall out of the dictionary:
a1 a2 a3 …. ak b
We scan a sequence of symbols
a1 a2 a3 …. ak
27
Again: Extending
We output the code for a1 a2 a3 …. ak and
put a1 a2 a3 …. ak b into the dictionary.
Then we set
a1 = b
And start all over.
28
Decoding
Let's take a closer look at an example.
Assume alphabet {a,b,c}.
The code for aabbaabb is 0 0 1 1 3 5.
The decoding starts with dictionary D:
a:0, b:1, c:2
29
Moving along
The first 4 code words are already in D.
0 0 1 1 3 5
and produce output a a b b.
As we go along, we extend D:
a:0, b:1, c:2, aa:3, ab:4, bb:5
For the rest we get
a a b b a a b b
30
Done
We have also added to D:
ba:6, aab:7
But these entries are never used.
Everything is easy, since there is already an entry in D for each code number when we encounter it.
31
One more detail…
One detail remains: how to build the dictionary for compression (decompression is easy).
We need to be able to scan through a sequence of symbols and check if they form a prefix of a word already in the dictionary.
Could use a balanced tree, but then each new symbol would launch a new search.
32
Tries!
a b
10 32
c d
4
a
5
d
6
d
a:0 b:1 c:2 d:3 ba:4 ad:4 dd:6
Corresponds to dictionary
33
Tries
Even better: in the LZW situation, we can add the new word to the trie dictionary in O(1) steps after discovering that the string is no longer a prefix of a dictionary word.
Just add a new leaf to the last node touched.
Pretty Pictures
35
LZW: 4 Letter Example
Suppose our entire character set consists only of the four letters: {a, b, c, d}
Let’s consider the compression of the string baddad
36
Byte LZW: Compress example
baddadInput:^
a bDictionary:
Output:
10 32
c d
37
Byte LZW: Compress example
baddadInput:^
a bDictionary:
Output:
10 32
c d
1
4
a
38
Byte LZW: Compress example
baddadInput:^
a bDictionary:
Output:
10 32
c d
10
4
a
5
d
39
Byte LZW: Compress example
baddadInput:^
a bDictionary:
Output:
10 32
c d
103
4
a
5
d
6
d
40
Byte LZW: Compress example
baddadInput:^
a bDictionary:
Output:
10 32
c d
1033
4
a
5
d
6
d
7
a
41
Byte LZW: Compress example
baddadInput:^
a bDictionary:
Output:
10 32
c d
10335
4
a
5
d
6
d
7
a
42
Byte LZW output
So, the input baddad
compresses to 10335
or compressed again using Huffman
43
Byte LZW: Uncompress example
The uncompress step for LZW is the most complicated part of the entire process.
44
Byte LZW: Uncompress example
10335Input:^
a bDictionary:
Output:
10 32
c d
45
Byte LZW: Uncompress example
10335Input:^
a bDictionary:
Output:
10 32
c d
b
46
Byte LZW: Uncompress example
10335Input:^
a bDictionary:
Output:
10 32
c d
ba
4
a
47
Byte LZW: Uncompress example
10335Input:^
a bDictionary:
Output:
10 32
c d
bad
4
a
5
d
48
Byte LZW: Uncompress example
10335Input:^
a bDictionary:
Output:
10 32
c d
badd
4
a
5
d
6
d
49
Byte LZW: Uncompress example
10335Input:^
a bDictionary:
Output:
10 32
c d
baddad
4
a
5
d
6
d
7
a
50
Decoding difficulty
When we decode, is the code number always in the dictionary?
Unfortunately, no.
It may happen that we run into a code number without having an appropriate entry in D.
But, it can only happen in very special circumstances, and we can manufacture the missing entry.
51
A Bad Run
Consider input
a a b b b a a ==> 0 0 1 5 3
After reading 0 0 1, we output
a a b
and extend D with codes for aa and ab
0:a, 1:b, 2:c, 3:aa, 4:ab
52
Disaster
We have read 0 0 1 from the input
0 0 1 5 3
The dictionary is
0:a, 1:b, 2:c, 3:aa, 4:ab
The next code number to read is 5, but it’s not in D.
How could this have happened?
Can we recover?
53
… narrowly averted
This problem only arises when on the compressor end:
• the input contains a substring
…s s s …
• compressor read s , output code c for s , and added c+1: s s to the dictionary.
• Here s is a single symbol, but a (possibly empty) word.
54
… narrowly averted (pt. 2)
On the decompressor end, D contains
c: s
• but does not contain c+1: s s
• the decompressor has already output
x = s
and is now looking at unknown code number c+1.
55
… narrowly averted (pt. 3)
But then the fix is to output
x + first(x)
where x is the last decompressed word, and first(x) the first symbol of x.
Because x=s was already output, we get the required
s s s
We also update the dictionary to contain the new entry x+first(x) = s s.
56
In our example we have read 0 0 1 from the input
0 0 1 5 3
The last decompressed word is b, and the next code number to read is 5. Thus
• s = b
• = empty
•The next word to output and add to D is
s s = bb
Example
57
Summary
Let x be the last word output.
Ordinarily, D contains a word y matching to the next input code number.
We output y and extend D with
x+ first (y)
But sometimes the encoder immediately uses what was last added to the dictionary.
Then it must be x = s and we output
x + first(x) = s s
58
Example (extended)
0 0 1 5 3 6 7 9 5 aabbbaabbaaabaababb s s s s
Input Output add to D
0 a
0 + a 3:aa
1 + b 4:ab
5 - bb 5:bb
3 + aa 6:bba
6 + bba 7:aab
7 + aab 8:bbaa
9 - aaba 9:aaba
5 + bb 10:aabab
s = a = ab
59
Pseudo Code: Compression
Initialize dictionary D to all words of length 1.
Read all input characters:
output code words from D,
extend D whenever a new word appears.
New code words: just an integer counter.
60
Less Pseudo
initialize D;
c = nextchar; // next input character
W = c; // a string
while( c = nextchar ) {
if( W+c is in D ) // dictionary
W = W + c;
else
output code(W); add W+c to D; W = c;
}
output code(W)
61
Pseudo Code: Decompression
Initialize dictionary D with all words of length 1.
Read all code words and
- output corresponding words from D,
- extend D at each step.
This time the dictionary is of the form
( integer, word )
Keys are integers, values words.
62
Less Pseudo
initialize D;
c = nextcode; // first code number
x = word(c); // corresponding word
output x;
First code number is easy: codes only a single symbol.
Remember x (previous word).
63
More Less Pseudo
while ( c = nextcode ) {
if ( c is in D ) {
y = word(c);
ww = x + first(y);
insert ww in D;
}
else {
64
The hard case
else {
y = x + first(x);
insert y in D;
}
output y;
x = y;
}
65
LZW details
In reality, one usually restricts the code words to be 12 or 16 bit integers.
Hence, one may have to flush the dictionary ever so often.
Thus it is important to conserve code numbers (see below).
66
LZW details
Lastly, LZW generates as output a stream of integers.
It makes perfect sense to try to compress these further, e.g., by Huffman.
67
Summary of LZW
LZW is an adaptive, dictionary based compression method.
Encoding is easy in LZW, but uses a special data structure (trie).
Decoding is slightly complicated, requires no special data structures.