1 chapter 3: compression (part 1). 2 storage size for media types
TRANSCRIPT
1
Chapter 3: Compression(Part 1)
2
Text Image Audio Animation Video
Object Type
-ASCII -EBCDIC -BIG5 -GB
-Bitmapped graphics -Still photos -Faxes
-Noncoded stream of digitized audio or voice
-Synched image and music stream at 15-19 frames/s
-Digital video at 25-30 frames/s
Size and Bandwidth
2KB per page
-Simple (640x480): 1MB per image
Voice/Phone 8kHz/8 bits (mono) 8 KB/s Audio CD 44.1 kHz/16 bits (stereo) 176 KB/s
6.5 MB/s for 320x640x16 pixels/frame (16 bit color) 16 frames/s
27.7 MB/s for 640x480x24 pixels per frame 30 frame/s
Storage size for media types
3
A typical encyclopedia
50,000 pages of text (2KB/page) 0.1 GB 3,000 color pictures
(on average 64048024 bits = 1MB/picture) 3 GB 500 maps/drawings
(on average 64048016 bits = 0.6 MB/map) 0.3 GB 60 minutes of stereo sound (176 KB/s) 0.6 GB 30 animations, on average 2 minutes in duration
(64032016 pixels/frame = 6.5 MB/s) 23.4 GB 50 digitized movies, on average 1 min. in duration
(64048030 frame/sec = 27.6 MB/s) 82.8 GB
TOTAL: 110 GB
(or 170 CD’s, or about 2 feet high of CD’s)
4
Compressing an encyclopedia
Text 1:2
Color Images 1:15
Maps 1:10
Stereo Sound 1:6
Animation 1:50
Motion Video 1:50
0.01
0.1
1
10
100
Text
Co
lor
Imag
es
Map
s
Ste
reo
So
un
d
An
imat
ion
Vid
eo
Uncompressed
Compressed
110 GB -> 2.5 GB
GB
5
Compression and encoding
We can view compression as an encoding method which transforms a sequence of bits/bytes into an artificial format such that the output (in general) takes less space than the input.
CompressorEncoder
CompressorEncoderFat dataFat data
Fat(redundancy)
Slim data
6
Redundancy
Compression is made possible by exploiting redundancy in source data
For example, redundancy found in: Character distribution Character repetition High-usage pattern Positional redundancy
(these categories overlap to some extent)
7
Symbols and Encoding
We assume a piece of data (such as text, image, etc.) is represented by a sequence of symbols. Each symbol is encoded in a computer by a code (or value).
Example: English text: abc (symbols), ASCII (coding) Chinese text: 多媒體 (symbols), BIG5 (coding) Image: color (symbols), RGB (coding)
8
Character distribution
Some symbols are used more frequently than others.
In English text, ‘e’ and space occur most often. Fixed-length encoding: use the same number of bits
to represent each symbol. With fixed-length encoding, to represent n symbols,
we need log2n bits for each code.
9
Character distribution
Fixed-length encoding may not be the most space-efficient.
Example: “abacabacabacabacabacabacabac” With fixed-length encoding, we need 2 bits for each of
the 3 symbols. Alternatively, we could use ‘1’ to represent ‘a’ and ‘00’
and ‘01’ to represent ‘b’ and ‘c’, respectively. How much saving did we achieve?
10
Character distribution
Technique: whenever symbols occur with different frequencies, we use variable length encoding.
For your interest: In 1838, Samuel Morse and Alfred Vail designed the
“Morse Code” for telegraph. Each letter of the alphabet is represented by dots (electric
current of short duration) and dashes (3 dots). e.g., E “.” Z “--..” Morse & Vail determined the code by counting the
letters in each bin of a printer’s type box.
11
Morse code table
Ch
SOS = — — —
12
Character distribution
It can be shown that Morse Code can be improved by no more than 15%.
Principle: use fewer bits to represent frequently occurring symbols, use more bits to represent those rarely occurring ones.
13
Character repetition
when a string of repetitions of a single symbol occurs, e.g., “aaaaaaaaaaaaaabc”
Example: long strings of spaces/tabs in forms/tables.
Example: in graphics, it is common to have a line of pixels using the same color.
Technique: run-length encoding.
14
High usage patterns
Certain patterns (i.e., sequences of symbols) occur very often. Example: “qu”, “th”, ……, in English text. Example: “begin”, “end”, …, in Pascal source
code. Technique: Use a single special symbol to
represent a common sequence of symbols.e.g., = ‘th’ i.e., instead of using 2 codes for ‘t’ and ‘h’, use only 1
code for ‘th’
15
Positional redundancy
A certain thing appears persistently at a predictable place.
Sometimes, symbols can be predicted. Example: the left-uppermost pixel of my
slides is always white. Technique: tell me the “changes” only.
example: differential pulse code modulation (DPCM)
16
Categories of compression techniques
Entropy encoding regard data stream as a simple digital sequence, its
semantics is ignored lossless compression
Source encoding take into account the semantics of the data and the
characteristics of the specific medium to achieve a higher compression ratio
lossy compression
17
Categories of compression techniques
Hybrid encoding A combination of the above two.
A categorizationHuffman Code
Entropy Encoding Run-Length Coding
Arithmetic Coding
Prediction DPCM
Source Coding DM
Transformation FFT
DCT
Vector Quantization
JPEG
Hybrid Coding MPEG
H.261
19
Information theory
Entropy is a measure of uncertainty or randomness. Entropy order
According to Shannon, the entropy of an information source S is defined as:
where pi is the probability that symbol Si in S occurs. log2(1/pi) indicates the amount of information contained in
Si or the number of bits needed to encode it.
H S ppi
ii
( ) log 2
1
20
Information theory
The entropy is the lower bound for lossless compression: when the probability of each symbol is fixed, each symbol should be represented at least with H(S) bits on average.
A guideline: the code lengths for different symbols should vary according to the information carried by the symbol. The coding techniques based on this are called entropy encoding.
21
Information theory
For example, in an image with uniform distribution of gray-level intensity, pi, is 1/256. The number of bits needed to code each gray
level is log21/(1/256) = log2256 = 8 bits. The entropy of this image is
81
log256
1)(
256
1 25612
i
SH
22
Information theory
Q: How about an image in which half of the pixels are white, another half are black? Ans: p(white) = 0.5, p(black) = 0.5,
Q: How about an image such that white occurs ¼ of the time, black ¼ of the time, and gray ½ of the time? Ans: Entropy = 1.5.
Entropy encoding is about wisely using just enough number of bits to represent the occurrence of certain patterns.
15.0
1log5.0)(
2
12
i
SH
23
Variable-length encoding
A problem with variable-length encoding is ambiguity in the decoding process. e.g., 4 symbols: a (‘0’), b (‘1’), c (‘01’), d (‘10’) the bit stream: ‘0110’ can be interpreted as
abba, orabd, orcba, orcd
24
Prefix code
The idea is to translate fixed-sized pieces of input data into variable-length encoding.
We wish to encode each symbol into a sequence of 0’s and 1’s so that no code for a symbol is the prefix of the code of any other symbol – the prefix property.
With the prefix property, we are able to decode a sequence of 0’s and 1’s into a sequence of symbols.
25
Shannon-Fano algorithm
Example: 5 characters are used in a text, statistics:
Algorithm (top-down approach):1. sort symbols according to their probabilities.2. divide symbols into 2 half-sets, each with roughly equal
counts.3. all symbols in the first group are given ‘0’ as the first bit of
their codes; ‘1’ for the second group.4. recursively perform steps 2-3 for each group with the
additional bits appended.
Symbol A B C D ECount 15 7 6 6 5
A
B
26
Shannon-Fano algorithm
Example: Gp1: {a,b}; Gp2: {c,d,e} subdivide Gp1:
Gp11:{a}; Gp12{b} subdivide Gp2:
Gp21:{c}; Gp22:{d,e} subdivide Gp22:
Gp221:{d}; Gp222:{e}
Symbol A B C D E Code 0.. 0.. 1.. 1.. 1..
Symbol A B C D E
Code 00 01 1.. 1.. 1..
Symbol A B C D E Code 00 01 10 11. 11.
Symbol A B C D E
Code 00 01 10 110 111
27
Shannon-Fano algorithm
Result:
Symbol Count log2(1/p) Code Bits used
A 15 1.38 00 30 B 7 2.48 01 14 C 6 2.70 10 12 D 6 2.70 110 18 E 5 2.96 111 15
28
Shannon-Fano algorithm
The coding scheme in a tree representation:
Total # of bits used for encoding is 89. Uncompressed code will need 3 bits for each symbol a
total of 117 bits.
15 7 6
6 5
11
39
22 17
A B C
D E
0 1
0 1 0 1
10
29
Shannon-Fano algorithm
It can be shown that the number of bits used to encode a symbol on average using the Shannon-Fano algorithm is between H(S) and H(S) + 1. Hence it is fairly close to the theoretical optimal.
30
Huffman code
used in many compression programs, e.g., PKzip, ARJ.
Each code length in bits is approximately log2(1/pj) example:
Pr[space] = 0.1919182 2.4 bits, Pr[Z] = 0.0007836 10.3 bits
31
Huffman code
Algorithm (bottom-up approach): We start with a forest of n trees. Each tree has only 1
node corresponding to a symbol. Let’s define the weight of a tree = sum of the
probabilities of all the symbols in the tree. Repeat the following steps until the forest has one tree
left: pick 2 trees of the lightest weights, let’s say, T1, T2. replace T1 and T2 by a new tree T, such that T1 and T2 are T’s
children.
32
Huffman code
a(15/39) b(7/39) e(5/39)c(6/39) d(6/39)
a(15/39) b(7/39) c(6/39)e(5/39)d(6/39)
11/39
a(15/39) e(5/39)d(6/39)
11/39
c(6/39)b(7/39)
13/39
33
Huffman code
a(15/39) e(5/39)d(6/39)
11/39
c(6/39)b(7/39)
13/39
24/39
34
Huffman code
result from the previous example:
35
Huffman code
The code of a symbol is obtained by traversing the tree from the root to the symbol, taking a ‘0’ when a left branch is followed; and a ‘1’ when a right branch is followed.
36
Huffman code
char code bits/char # occ. #bits
used
A 0 1 15 15
B 100 3 7 21
C 101 3 6 18
D 110 3 6 18
E 111 3 5 15
total: 87 bits
“ABAAAE” “0100000111”
37
Huffman code (advantages)
Decoding is trivial as long as the coding table is sent before the data. Overhead is relatively negligible if the data file is big.
Prefix property: No code is a prefix to any other code. Forward
scanning is never ambiguous. Good for decoder.
38
Huffman code (advantages)
“Optimal code size” (among any coding scheme that uses a fixed code for each symbol.) Entropy is
Average bits/symbol needed for Huffman coding is
( * . * . * . * . * . ) /
. /
.
15 138 7 2 48 6 2 7 6 2 7 5 2 96 39
85 26 39
2 19
87 39
2 23
/
.
39
Huffman code (disadvantages)
A table is needed that lists each input symbol and its corresponding code.
Need to know the symbol frequency distribution. (It might be ok for English text, but not data files) need two passes over the data.
Frequency distribution may change over the content of the source.
40
Arithmetic coding
Motivation: Huffman coding uses integer number (k) of bits for each
symbol. k is at least 1. Sometimes, e.g., sending bi-level images,
compression does not help. In fact, Huffman is optimal (i.e., can attain the theoretical information lower bound) when and only when the symbol probabilities are integral powers of 1/2.
Arithmetic coding works by representing a message by an interval of real numbers between 0 and 1. As the message becomes longer, the interval needed to represent it becomes smaller, and the number of bits needed to represent the interval increases.
idea: Suppose alphabet is {X,Y} and p(X) = 2/3, p(Y) = 1/3. Map X,Y to interval
[0..1]:
To encode length-2 messages, we map all possible messages to interval [0..1]:
Now, just send enough bits of the binary fraction that uniquely specifies the interval.
X Y
0 2/3 1
XX YX
0 6/9 14/9 8/9
XY YY
XX YX
0 6/9 14/9 8/9
XY YY
¼.01
2/4.10
3/4.110
15/16.11111
42
Arithmetic coding
Similarly, we can map all length-3 messages as follows: Length1 Length2 Length3 Interval
0XXX
XX 8/27XXY
X 12/27XYX
XY 16/27XYY
18/27YXX
YX 22/27YXY
Y 24/27YYX
YY 26/27YYY
126
43
Arithmetic coding
Decoding .110Length1 Length2 Length3 Interval
0XXX
XX 8/27XXY
X 12/27XYX
XY 16/27XYY
18/27YXX
YX 22/27YXY
Y 24/27YYX
YY 26/27YYY
126
on receiving ‘1’marker: .1????interval [0.5, 1]symbol: ?
44
Arithmetic coding
Decoding .110Length1 Length2 Length3 Interval
0XXX
XX 8/27XXY
X 12/27XYX
XY 16/27XYY
18/27YXX
YX 22/27YXY
Y 24/27YYX
YY 26/27YYY
126
on receiving ‘11’marker: .11???interval [0.75, 1]symbol: Y
45
Arithmetic coding
Decoding .110Length1 Length2 Length3 Interval
0XXX
XX 8/27XXY
X 12/27XYX
XY 16/27XYY
18/27YXX
YX 22/27YXY
Y 24/27YYX
YY 26/27YYY
126
on receiving ‘110’marker: .110?interval [0.75, 0.875)symbol: YX
46
Arithmetic coding
Decoding .110Length1 Length2 Length3 Interval
0XXX
XX 8/27XXY
X 12/27XYX
XY 16/27XYY
18/27YXX
YX 22/27YXY
Y 24/27YYX
YY 26/27YYY
126
on receiving ‘110’marker: .110interval [0.75, 0.875)symbol: YXIf no more bits are received, there isinsufficient data to deduce the 3rd symbol.Hence, ‘110’ representsthe symbol string “YX”.
47
Arithmetic coding
Number of bits of codeword determined by the size of interval. First interval is 8/27, needs 2 bits. 2/3 bits per symbol
(efficient). Last interval is 1/27, needs 5 bits. In general, it needs log2(1/p) bits to represent interval of size
p. The code length approaches optimal encoding as message unit
gets longer. Although arithmetic coding gives good compression
efficiency, it is very CPU demanding (in generating the code).
48
Run-length encoding (RLE)
Image, video and audio data streams often contain sequences of the same bytes, e.g., background in images, and silence in sound files.
Substantial compression can be achieved by replacing repeated byte sequences with the number of occurrences.
49
Run-length encoding
If a byte occurs at least 4 consecutive times, the sequence is encoded with the byte followed by a special flag (i.e., a special byte) and the number of its occurrences, e.g., Uncompressed: ABCCCCCCCCDEFGGGHI Compressed: ABC!8DEFGGGHI
If 4Cs (CCCC) are encoded as C!0, and 5Cs as C!1, and so on, then it allows compression of 4 to 259 bytes into 3 bytes only.
50
Run-length encoding
Problems: Ineffective with text
except, perhaps, “Ahhhhhhhhhhhhhhhhh!”, “Noooooooooo!”. Need an escape byte (‘!’)
ok with text not ok with data solution?
used in ITU-T Group 3 fax compression a combination of RLE and Huffman compression ratio: 10:1 or better
53
Lempel-Ziv-Welch algorithm
Method due to Ziv, Lempel in 1977, 1978. Improvement by Welch in 1984. Hence, named LZW compression.
used in UNIX compress, GIF, and V.42bis compression modems.
Motivation: Suppose we want to encode an English text. We first encode
the Webster’s English dictionary which contains about 159,000 entries. Here, we surely can use 18 bits for a word. Then why don’t we just transmit each word in the text as an 18 bit number, and so obtain a compressed version of the text?
54
LZW
Problem: Need to preload a copy of the dictionary before
decoding. Dictionary is huge, 159,000 entries.
Solution: Build the dictionary on-line while decoding. No
bandwidth and storage is wasted for the preloading. Build the dictionary adaptively. Words not used in the
text are not incorporated in the dictionary.
55
LZW
Example: encode “^WED^WE^WEE^WEB^WET” using LZW.
Assume each symbol (character) is encoded by an 8-bit code (i.e., the codes for the characters run from 0 to 255). We assume that the dictionary is loaded with these 256 codes initially.
While scanning the string, sub-string patterns are encoded into the dictionary. Patterns that appear after the first time would then be substituted by the index into the dictionary.
56
Example 1 (compression)
^WED^WE^WEE^WEB^WET$ Dict Index Output
^^W ^W 256 ^ 2c WE WE 257 W 2c ED ED 258 E 2c D^ D^ 259 D 2c ^W seen at 256 ^WE ^WE 260 256 3c from 256 E^ E^ 261 E 2c ^W seen at 256 ^WE seen at 260 ^WEE ^WEE 262 260 4c from 260 E^ seen at 261 E^W E^W 263 261 3c from 261 WE seen at 257 WEB WEB 264 257 3c from 257 B^ B^ 265 B 2c ^W seen at 256 ^WE seen at 260 ^WET ^WET 266 260 4c from 260 T$ T
buffer
57
Example 1 (compression)
^WED^WE^WEE^WEB^WET$ Dict Index Output
^^W ^W 256 ^ 2c WE WE 257 W 2c ED ED 258 E 2c D^ D^ 259 D 2c ^W seen at 256 ^WE ^WE 260 256 3c from 256 E^ E^ 261 E 2c ^W seen at 256 ^WE seen at 260 ^WEE ^WEE 262 260 4c from 260 E^ seen at 261 E^W E^W 263 261 3c from 261 WE seen at 257 WEB WEB 264 257 3c from 257 B^ B^ 265 B 2c ^W seen at 256 ^WE seen at 260 ^WET ^WET 266 260 4c from 260 T$ T
buffer
symbols
should beinterpretedas numericcodes
58
Example 1 (compression)
^WED^WE^WEE^WEB^WET$ Dict Index Output
^^W ^W 256 ^ 2c WE WE 257 W 2c ED ED 258 E 2c D^ D^ 259 D 2c ^W seen at 256 ^WE ^WE 260 256 3c from 256 E^ E^ 261 E 2c ^W seen at 256 ^WE seen at 260 ^WEE ^WEE 262 260 4c from 260 E^ seen at 261 E^W E^W 263 261 3c from 261 WE seen at 257 WEB WEB 264 257 3c from 257 B^ B^ 265 B 2c ^W seen at 256 ^WE seen at 260 ^WET ^WET 266 260 4c from 260 T$ T
buffer
59
LZW (encoding rules)
Rule 1: whenever the pattern in the buffer is in the dictionary, we try to extend it.
Rule 2: whenever a pattern is added to the dictionary, we output the codeword of the pattern before the last extension.
60
Example 1
Original: 19 symbols, 8-bit code each 19 8 = 152 bits
LZW’ed: 12 codes, 9 bits each 12 9 = 108 bits Some compression achieved. Compression
effectiveness is more substantial for long messages. The dictionary is not explicitly stored. In a sense,
the compressor is encoding the dictionary in the output stream implicitly.
61
LZW decompression
“^WED{256}E{260}{261}{257}B{260}T” is read in.
Assume the decoder also knows about the first 256 entries of the dictionary.
While scanning the code stream for decoding, the dictionary is rebuilt. When an index is read, the corresponding dictionary entry is retrieved to splice in position.
The decoder deduces what the buffer of the encoder has gone through and rebuild the dictionary.
recall that theseare codes
62
Example 1 (decompression)
^WED{256}E{260}{261}{257}B{260}T$ Dict Index Output
^ ^ ^W ^W 256 W rebld 2c WE WE 257 E rebld 2c ED ED 258 D rebld 2c D{256} D^ 259 ^W rebld 2c {256}E ^WE 260 E 3c ~ 256 E{260} E^ 261 ^WE rebld 2c {260}{261} ^WEE 262 E^ 4c ~ 260 {261}{257} E^W 263 WE 3c ~ 261 {257}B WEB 264 B 3c ~ 257 B{260} B^ 265 ^WE rebld 2c {260}T ^WET 266 T 4c ~ 260 T$
63
LZW (decoding rule)
Whenever there are 2 codes C1C2 in the buffer, add a new entry to the dictionary: symbols of C1 + first symbol of C2
w = NIL; while ( read a symbol k ) { if ( wk exists in dictionary ) w = wk; // proceed to next else { output code for w; add wk to dictionary; w = k; } }
Compression code
Decompression coderead a code k; output symbol represented by k; w = k; while ( read a code k ) { ent = dictionary entry for k; output symbol(s) in entry ent; add w + ent[0] to dictionary w = ent; }
65
Example 2
ababcbababaaaaaaa
66
Example 2 (compression)
ababcbababaaaaaaa$ Dict Index Output
a 1 b 2 c 3 a ab ab 4 1 ba ba 5 2 abc abc 6 4 cb cb 7 3 bab bab 8 5 baba baba 9 8 aa aa 10 1 aaa aaa 11 10 aaaa aaaa 12 11 a$ 1
67
LZW: Pros and Cons
LZW (advantages) effective at exploiting:
character repetition redundancyhigh-usage pattern redundancy
an adaptive algorithm (no need to learn about the statistical properties of the symbols beforehand)
one-pass procedure
68
LZW: Pros and Cons
LZW (disadvantages) Messages must be long for effective compression
It takes time for the dictionary to build up. However, if the message is not homogeneous
redundancy characteristics shift during the message compression efficiency drops.
69
LZW (Implementation)
decompression is generally faster than compression. Why?
usually uses 12-bit codes 4096 table entries.
need a hash table for compression.
70
LZW and GIF
11 colorscreate a color map of11 entries:
color code[r0,g0,b0] 0[r1,g1,b1] 1[r2,g2,b2] 2… …[r10,g10,b10] 10
convert the image intoa sequence of codes,apply LZW to thecode stream.
320 x 200
GIF, 5280 bytes(0.66 bpp)
JPEG, 16327 bytes(2.04 bpp)
bit per pixel
72
GIF image
73
JPEG image
74
V.42bis
based on LZW algorithm. dictionary entries can be recycled. transparent mode: disable compression but
dictionary is still being updated (used in transmitting already-compressed files).
bis is a suffix meaning a modified standard. Derived from the Latin “bis” means twice.
75
An experiment
Hayes Microcomputer Products, Inc.’s implementation of V.42bis compression.
4 different types of files: a 176,369-byte English essay. (Essays) a 210,931-byte USENET news articles. (News) a 119,779-byte Pascal source. (Pascal) a 201,746-byte Latex manual source. (Tex)
76
Test files 512 1024 2048 4096
Essays 0.57 0.49 0.45 0.43
News 0.7 0.64 0.6 0.56
Pascal 0.52 0.41 0.35 0.31
Tex 0.59 0.5 0.45 0.420.5
Experiment result
Compression ratios:Dictionary Size