1 chapter 3: compression (part 1). 2 storage size for media types

76
1 Chapter 3: Compression (Part 1)

Upload: barry-barber

Post on 05-Jan-2016

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

1

Chapter 3: Compression(Part 1)

Page 2: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

2

Text Image Audio Animation Video

Object Type

-ASCII -EBCDIC -BIG5 -GB

-Bitmapped graphics -Still photos -Faxes

-Noncoded stream of digitized audio or voice

-Synched image and music stream at 15-19 frames/s

-Digital video at 25-30 frames/s

Size and Bandwidth

2KB per page

-Simple (640x480): 1MB per image

Voice/Phone 8kHz/8 bits (mono) 8 KB/s Audio CD 44.1 kHz/16 bits (stereo) 176 KB/s

6.5 MB/s for 320x640x16 pixels/frame (16 bit color) 16 frames/s

27.7 MB/s for 640x480x24 pixels per frame 30 frame/s

Storage size for media types

Page 3: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

3

A typical encyclopedia

50,000 pages of text (2KB/page) 0.1 GB 3,000 color pictures

(on average 64048024 bits = 1MB/picture) 3 GB 500 maps/drawings

(on average 64048016 bits = 0.6 MB/map) 0.3 GB 60 minutes of stereo sound (176 KB/s) 0.6 GB 30 animations, on average 2 minutes in duration

(64032016 pixels/frame = 6.5 MB/s) 23.4 GB 50 digitized movies, on average 1 min. in duration

(64048030 frame/sec = 27.6 MB/s) 82.8 GB

TOTAL: 110 GB

(or 170 CD’s, or about 2 feet high of CD’s)

Page 4: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

4

Compressing an encyclopedia

Text 1:2

Color Images 1:15

Maps 1:10

Stereo Sound 1:6

Animation 1:50

Motion Video 1:50

0.01

0.1

1

10

100

Text

Co

lor

Imag

es

Map

s

Ste

reo

So

un

d

An

imat

ion

Vid

eo

Uncompressed

Compressed

110 GB -> 2.5 GB

GB

Page 5: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

5

Compression and encoding

We can view compression as an encoding method which transforms a sequence of bits/bytes into an artificial format such that the output (in general) takes less space than the input.

CompressorEncoder

CompressorEncoderFat dataFat data

Fat(redundancy)

Slim data

Page 6: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

6

Redundancy

Compression is made possible by exploiting redundancy in source data

For example, redundancy found in: Character distribution Character repetition High-usage pattern Positional redundancy

(these categories overlap to some extent)

Page 7: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

7

Symbols and Encoding

We assume a piece of data (such as text, image, etc.) is represented by a sequence of symbols. Each symbol is encoded in a computer by a code (or value).

Example: English text: abc (symbols), ASCII (coding) Chinese text: 多媒體 (symbols), BIG5 (coding) Image: color (symbols), RGB (coding)

Page 8: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

8

Character distribution

Some symbols are used more frequently than others.

In English text, ‘e’ and space occur most often. Fixed-length encoding: use the same number of bits

to represent each symbol. With fixed-length encoding, to represent n symbols,

we need log2n bits for each code.

Page 9: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

9

Character distribution

Fixed-length encoding may not be the most space-efficient.

Example: “abacabacabacabacabacabacabac” With fixed-length encoding, we need 2 bits for each of

the 3 symbols. Alternatively, we could use ‘1’ to represent ‘a’ and ‘00’

and ‘01’ to represent ‘b’ and ‘c’, respectively. How much saving did we achieve?

Page 10: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

10

Character distribution

Technique: whenever symbols occur with different frequencies, we use variable length encoding.

For your interest: In 1838, Samuel Morse and Alfred Vail designed the

“Morse Code” for telegraph. Each letter of the alphabet is represented by dots (electric

current of short duration) and dashes (3 dots). e.g., E “.” Z “--..” Morse & Vail determined the code by counting the

letters in each bin of a printer’s type box.

Page 11: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

11

Morse code table

Ch

SOS = — — —

Page 12: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

12

Character distribution

It can be shown that Morse Code can be improved by no more than 15%.

Principle: use fewer bits to represent frequently occurring symbols, use more bits to represent those rarely occurring ones.

Page 13: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

13

Character repetition

when a string of repetitions of a single symbol occurs, e.g., “aaaaaaaaaaaaaabc”

Example: long strings of spaces/tabs in forms/tables.

Example: in graphics, it is common to have a line of pixels using the same color.

Technique: run-length encoding.

Page 14: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

14

High usage patterns

Certain patterns (i.e., sequences of symbols) occur very often. Example: “qu”, “th”, ……, in English text. Example: “begin”, “end”, …, in Pascal source

code. Technique: Use a single special symbol to

represent a common sequence of symbols.e.g., = ‘th’ i.e., instead of using 2 codes for ‘t’ and ‘h’, use only 1

code for ‘th’

Page 15: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

15

Positional redundancy

A certain thing appears persistently at a predictable place.

Sometimes, symbols can be predicted. Example: the left-uppermost pixel of my

slides is always white. Technique: tell me the “changes” only.

example: differential pulse code modulation (DPCM)

Page 16: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

16

Categories of compression techniques

Entropy encoding regard data stream as a simple digital sequence, its

semantics is ignored lossless compression

Source encoding take into account the semantics of the data and the

characteristics of the specific medium to achieve a higher compression ratio

lossy compression

Page 17: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

17

Categories of compression techniques

Hybrid encoding A combination of the above two.

Page 18: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

A categorizationHuffman Code

Entropy Encoding Run-Length Coding

Arithmetic Coding

Prediction DPCM

Source Coding DM

Transformation FFT

DCT

Vector Quantization

JPEG

Hybrid Coding MPEG

H.261

Page 19: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

19

Information theory

Entropy is a measure of uncertainty or randomness. Entropy order

According to Shannon, the entropy of an information source S is defined as:

where pi is the probability that symbol Si in S occurs. log2(1/pi) indicates the amount of information contained in

Si or the number of bits needed to encode it.

H S ppi

ii

( ) log 2

1

Page 20: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

20

Information theory

The entropy is the lower bound for lossless compression: when the probability of each symbol is fixed, each symbol should be represented at least with H(S) bits on average.

A guideline: the code lengths for different symbols should vary according to the information carried by the symbol. The coding techniques based on this are called entropy encoding.

Page 21: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

21

Information theory

For example, in an image with uniform distribution of gray-level intensity, pi, is 1/256. The number of bits needed to code each gray

level is log21/(1/256) = log2256 = 8 bits. The entropy of this image is

81

log256

1)(

256

1 25612

i

SH

Page 22: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

22

Information theory

Q: How about an image in which half of the pixels are white, another half are black? Ans: p(white) = 0.5, p(black) = 0.5,

Q: How about an image such that white occurs ¼ of the time, black ¼ of the time, and gray ½ of the time? Ans: Entropy = 1.5.

Entropy encoding is about wisely using just enough number of bits to represent the occurrence of certain patterns.

15.0

1log5.0)(

2

12

i

SH

Page 23: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

23

Variable-length encoding

A problem with variable-length encoding is ambiguity in the decoding process. e.g., 4 symbols: a (‘0’), b (‘1’), c (‘01’), d (‘10’) the bit stream: ‘0110’ can be interpreted as

abba, orabd, orcba, orcd

Page 24: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

24

Prefix code

The idea is to translate fixed-sized pieces of input data into variable-length encoding.

We wish to encode each symbol into a sequence of 0’s and 1’s so that no code for a symbol is the prefix of the code of any other symbol – the prefix property.

With the prefix property, we are able to decode a sequence of 0’s and 1’s into a sequence of symbols.

Page 25: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

25

Shannon-Fano algorithm

Example: 5 characters are used in a text, statistics:

Algorithm (top-down approach):1. sort symbols according to their probabilities.2. divide symbols into 2 half-sets, each with roughly equal

counts.3. all symbols in the first group are given ‘0’ as the first bit of

their codes; ‘1’ for the second group.4. recursively perform steps 2-3 for each group with the

additional bits appended.

Symbol A B C D ECount 15 7 6 6 5

A

B

Page 26: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

26

Shannon-Fano algorithm

Example: Gp1: {a,b}; Gp2: {c,d,e} subdivide Gp1:

Gp11:{a}; Gp12{b} subdivide Gp2:

Gp21:{c}; Gp22:{d,e} subdivide Gp22:

Gp221:{d}; Gp222:{e}

Symbol A B C D E Code 0.. 0.. 1.. 1.. 1..

Symbol A B C D E

Code 00 01 1.. 1.. 1..

Symbol A B C D E Code 00 01 10 11. 11.

Symbol A B C D E

Code 00 01 10 110 111

Page 27: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

27

Shannon-Fano algorithm

Result:

Symbol Count log2(1/p) Code Bits used

A 15 1.38 00 30 B 7 2.48 01 14 C 6 2.70 10 12 D 6 2.70 110 18 E 5 2.96 111 15

Page 28: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

28

Shannon-Fano algorithm

The coding scheme in a tree representation:

Total # of bits used for encoding is 89. Uncompressed code will need 3 bits for each symbol a

total of 117 bits.

15 7 6

6 5

11

39

22 17

A B C

D E

0 1

0 1 0 1

10

Page 29: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

29

Shannon-Fano algorithm

It can be shown that the number of bits used to encode a symbol on average using the Shannon-Fano algorithm is between H(S) and H(S) + 1. Hence it is fairly close to the theoretical optimal.

Page 30: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

30

Huffman code

used in many compression programs, e.g., PKzip, ARJ.

Each code length in bits is approximately log2(1/pj) example:

Pr[space] = 0.1919182 2.4 bits, Pr[Z] = 0.0007836 10.3 bits

Page 31: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

31

Huffman code

Algorithm (bottom-up approach): We start with a forest of n trees. Each tree has only 1

node corresponding to a symbol. Let’s define the weight of a tree = sum of the

probabilities of all the symbols in the tree. Repeat the following steps until the forest has one tree

left: pick 2 trees of the lightest weights, let’s say, T1, T2. replace T1 and T2 by a new tree T, such that T1 and T2 are T’s

children.

Page 32: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

32

Huffman code

a(15/39) b(7/39) e(5/39)c(6/39) d(6/39)

a(15/39) b(7/39) c(6/39)e(5/39)d(6/39)

11/39

a(15/39) e(5/39)d(6/39)

11/39

c(6/39)b(7/39)

13/39

Page 33: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

33

Huffman code

a(15/39) e(5/39)d(6/39)

11/39

c(6/39)b(7/39)

13/39

24/39

Page 34: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

34

Huffman code

result from the previous example:

Page 35: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

35

Huffman code

The code of a symbol is obtained by traversing the tree from the root to the symbol, taking a ‘0’ when a left branch is followed; and a ‘1’ when a right branch is followed.

Page 36: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

36

Huffman code

char code bits/char # occ. #bits

used

A 0 1 15 15

B 100 3 7 21

C 101 3 6 18

D 110 3 6 18

E 111 3 5 15

total: 87 bits

“ABAAAE” “0100000111”

Page 37: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

37

Huffman code (advantages)

Decoding is trivial as long as the coding table is sent before the data. Overhead is relatively negligible if the data file is big.

Prefix property: No code is a prefix to any other code. Forward

scanning is never ambiguous. Good for decoder.

Page 38: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

38

Huffman code (advantages)

“Optimal code size” (among any coding scheme that uses a fixed code for each symbol.) Entropy is

Average bits/symbol needed for Huffman coding is

( * . * . * . * . * . ) /

. /

.

15 138 7 2 48 6 2 7 6 2 7 5 2 96 39

85 26 39

2 19

87 39

2 23

/

.

Page 39: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

39

Huffman code (disadvantages)

A table is needed that lists each input symbol and its corresponding code.

Need to know the symbol frequency distribution. (It might be ok for English text, but not data files) need two passes over the data.

Frequency distribution may change over the content of the source.

Page 40: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

40

Arithmetic coding

Motivation: Huffman coding uses integer number (k) of bits for each

symbol. k is at least 1. Sometimes, e.g., sending bi-level images,

compression does not help. In fact, Huffman is optimal (i.e., can attain the theoretical information lower bound) when and only when the symbol probabilities are integral powers of 1/2.

Arithmetic coding works by representing a message by an interval of real numbers between 0 and 1. As the message becomes longer, the interval needed to represent it becomes smaller, and the number of bits needed to represent the interval increases.

Page 41: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

idea: Suppose alphabet is {X,Y} and p(X) = 2/3, p(Y) = 1/3. Map X,Y to interval

[0..1]:

To encode length-2 messages, we map all possible messages to interval [0..1]:

Now, just send enough bits of the binary fraction that uniquely specifies the interval.

X Y

0 2/3 1

XX YX

0 6/9 14/9 8/9

XY YY

XX YX

0 6/9 14/9 8/9

XY YY

¼.01

2/4.10

3/4.110

15/16.11111

Page 42: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

42

Arithmetic coding

Similarly, we can map all length-3 messages as follows: Length1 Length2 Length3 Interval

0XXX

XX 8/27XXY

X 12/27XYX

XY 16/27XYY

18/27YXX

YX 22/27YXY

Y 24/27YYX

YY 26/27YYY

126

Page 43: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

43

Arithmetic coding

Decoding .110Length1 Length2 Length3 Interval

0XXX

XX 8/27XXY

X 12/27XYX

XY 16/27XYY

18/27YXX

YX 22/27YXY

Y 24/27YYX

YY 26/27YYY

126

on receiving ‘1’marker: .1????interval [0.5, 1]symbol: ?

Page 44: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

44

Arithmetic coding

Decoding .110Length1 Length2 Length3 Interval

0XXX

XX 8/27XXY

X 12/27XYX

XY 16/27XYY

18/27YXX

YX 22/27YXY

Y 24/27YYX

YY 26/27YYY

126

on receiving ‘11’marker: .11???interval [0.75, 1]symbol: Y

Page 45: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

45

Arithmetic coding

Decoding .110Length1 Length2 Length3 Interval

0XXX

XX 8/27XXY

X 12/27XYX

XY 16/27XYY

18/27YXX

YX 22/27YXY

Y 24/27YYX

YY 26/27YYY

126

on receiving ‘110’marker: .110?interval [0.75, 0.875)symbol: YX

Page 46: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

46

Arithmetic coding

Decoding .110Length1 Length2 Length3 Interval

0XXX

XX 8/27XXY

X 12/27XYX

XY 16/27XYY

18/27YXX

YX 22/27YXY

Y 24/27YYX

YY 26/27YYY

126

on receiving ‘110’marker: .110interval [0.75, 0.875)symbol: YXIf no more bits are received, there isinsufficient data to deduce the 3rd symbol.Hence, ‘110’ representsthe symbol string “YX”.

Page 47: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

47

Arithmetic coding

Number of bits of codeword determined by the size of interval. First interval is 8/27, needs 2 bits. 2/3 bits per symbol

(efficient). Last interval is 1/27, needs 5 bits. In general, it needs log2(1/p) bits to represent interval of size

p. The code length approaches optimal encoding as message unit

gets longer. Although arithmetic coding gives good compression

efficiency, it is very CPU demanding (in generating the code).

Page 48: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

48

Run-length encoding (RLE)

Image, video and audio data streams often contain sequences of the same bytes, e.g., background in images, and silence in sound files.

Substantial compression can be achieved by replacing repeated byte sequences with the number of occurrences.

Page 49: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

49

Run-length encoding

If a byte occurs at least 4 consecutive times, the sequence is encoded with the byte followed by a special flag (i.e., a special byte) and the number of its occurrences, e.g., Uncompressed: ABCCCCCCCCDEFGGGHI Compressed: ABC!8DEFGGGHI

If 4Cs (CCCC) are encoded as C!0, and 5Cs as C!1, and so on, then it allows compression of 4 to 259 bytes into 3 bytes only.

Page 50: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

50

Run-length encoding

Problems: Ineffective with text

except, perhaps, “Ahhhhhhhhhhhhhhhhh!”, “Noooooooooo!”. Need an escape byte (‘!’)

ok with text not ok with data solution?

used in ITU-T Group 3 fax compression a combination of RLE and Huffman compression ratio: 10:1 or better

Page 51: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types
Page 52: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types
Page 53: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

53

Lempel-Ziv-Welch algorithm

Method due to Ziv, Lempel in 1977, 1978. Improvement by Welch in 1984. Hence, named LZW compression.

used in UNIX compress, GIF, and V.42bis compression modems.

Motivation: Suppose we want to encode an English text. We first encode

the Webster’s English dictionary which contains about 159,000 entries. Here, we surely can use 18 bits for a word. Then why don’t we just transmit each word in the text as an 18 bit number, and so obtain a compressed version of the text?

Page 54: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

54

LZW

Problem: Need to preload a copy of the dictionary before

decoding. Dictionary is huge, 159,000 entries.

Solution: Build the dictionary on-line while decoding. No

bandwidth and storage is wasted for the preloading. Build the dictionary adaptively. Words not used in the

text are not incorporated in the dictionary.

Page 55: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

55

LZW

Example: encode “^WED^WE^WEE^WEB^WET” using LZW.

Assume each symbol (character) is encoded by an 8-bit code (i.e., the codes for the characters run from 0 to 255). We assume that the dictionary is loaded with these 256 codes initially.

While scanning the string, sub-string patterns are encoded into the dictionary. Patterns that appear after the first time would then be substituted by the index into the dictionary.

Page 56: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

56

Example 1 (compression)

^WED^WE^WEE^WEB^WET$ Dict Index Output

^^W ^W 256 ^ 2c WE WE 257 W 2c ED ED 258 E 2c D^ D^ 259 D 2c ^W seen at 256 ^WE ^WE 260 256 3c from 256 E^ E^ 261 E 2c ^W seen at 256 ^WE seen at 260 ^WEE ^WEE 262 260 4c from 260 E^ seen at 261 E^W E^W 263 261 3c from 261 WE seen at 257 WEB WEB 264 257 3c from 257 B^ B^ 265 B 2c ^W seen at 256 ^WE seen at 260 ^WET ^WET 266 260 4c from 260 T$ T

buffer

Page 57: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

57

Example 1 (compression)

^WED^WE^WEE^WEB^WET$ Dict Index Output

^^W ^W 256 ^ 2c WE WE 257 W 2c ED ED 258 E 2c D^ D^ 259 D 2c ^W seen at 256 ^WE ^WE 260 256 3c from 256 E^ E^ 261 E 2c ^W seen at 256 ^WE seen at 260 ^WEE ^WEE 262 260 4c from 260 E^ seen at 261 E^W E^W 263 261 3c from 261 WE seen at 257 WEB WEB 264 257 3c from 257 B^ B^ 265 B 2c ^W seen at 256 ^WE seen at 260 ^WET ^WET 266 260 4c from 260 T$ T

buffer

symbols

should beinterpretedas numericcodes

Page 58: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

58

Example 1 (compression)

^WED^WE^WEE^WEB^WET$ Dict Index Output

^^W ^W 256 ^ 2c WE WE 257 W 2c ED ED 258 E 2c D^ D^ 259 D 2c ^W seen at 256 ^WE ^WE 260 256 3c from 256 E^ E^ 261 E 2c ^W seen at 256 ^WE seen at 260 ^WEE ^WEE 262 260 4c from 260 E^ seen at 261 E^W E^W 263 261 3c from 261 WE seen at 257 WEB WEB 264 257 3c from 257 B^ B^ 265 B 2c ^W seen at 256 ^WE seen at 260 ^WET ^WET 266 260 4c from 260 T$ T

buffer

Page 59: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

59

LZW (encoding rules)

Rule 1: whenever the pattern in the buffer is in the dictionary, we try to extend it.

Rule 2: whenever a pattern is added to the dictionary, we output the codeword of the pattern before the last extension.

Page 60: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

60

Example 1

Original: 19 symbols, 8-bit code each 19 8 = 152 bits

LZW’ed: 12 codes, 9 bits each 12 9 = 108 bits Some compression achieved. Compression

effectiveness is more substantial for long messages. The dictionary is not explicitly stored. In a sense,

the compressor is encoding the dictionary in the output stream implicitly.

Page 61: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

61

LZW decompression

“^WED{256}E{260}{261}{257}B{260}T” is read in.

Assume the decoder also knows about the first 256 entries of the dictionary.

While scanning the code stream for decoding, the dictionary is rebuilt. When an index is read, the corresponding dictionary entry is retrieved to splice in position.

The decoder deduces what the buffer of the encoder has gone through and rebuild the dictionary.

recall that theseare codes

Page 62: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

62

Example 1 (decompression)

^WED{256}E{260}{261}{257}B{260}T$ Dict Index Output

^ ^ ^W ^W 256 W rebld 2c WE WE 257 E rebld 2c ED ED 258 D rebld 2c D{256} D^ 259 ^W rebld 2c {256}E ^WE 260 E 3c ~ 256 E{260} E^ 261 ^WE rebld 2c {260}{261} ^WEE 262 E^ 4c ~ 260 {261}{257} E^W 263 WE 3c ~ 261 {257}B WEB 264 B 3c ~ 257 B{260} B^ 265 ^WE rebld 2c {260}T ^WET 266 T 4c ~ 260 T$

Page 63: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

63

LZW (decoding rule)

Whenever there are 2 codes C1C2 in the buffer, add a new entry to the dictionary: symbols of C1 + first symbol of C2

Page 64: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

w = NIL; while ( read a symbol k ) { if ( wk exists in dictionary ) w = wk; // proceed to next else { output code for w; add wk to dictionary; w = k; } }

Compression code

Decompression coderead a code k; output symbol represented by k; w = k; while ( read a code k ) { ent = dictionary entry for k; output symbol(s) in entry ent; add w + ent[0] to dictionary w = ent; }

Page 65: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

65

Example 2

ababcbababaaaaaaa

Page 66: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

66

Example 2 (compression)

ababcbababaaaaaaa$ Dict Index Output

a 1 b 2 c 3 a ab ab 4 1 ba ba 5 2 abc abc 6 4 cb cb 7 3 bab bab 8 5 baba baba 9 8 aa aa 10 1 aaa aaa 11 10 aaaa aaaa 12 11 a$ 1

Page 67: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

67

LZW: Pros and Cons

LZW (advantages) effective at exploiting:

character repetition redundancyhigh-usage pattern redundancy

an adaptive algorithm (no need to learn about the statistical properties of the symbols beforehand)

one-pass procedure

Page 68: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

68

LZW: Pros and Cons

LZW (disadvantages) Messages must be long for effective compression

It takes time for the dictionary to build up. However, if the message is not homogeneous

redundancy characteristics shift during the message compression efficiency drops.

Page 69: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

69

LZW (Implementation)

decompression is generally faster than compression. Why?

usually uses 12-bit codes 4096 table entries.

need a hash table for compression.

Page 70: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

70

LZW and GIF

11 colorscreate a color map of11 entries:

color code[r0,g0,b0] 0[r1,g1,b1] 1[r2,g2,b2] 2… …[r10,g10,b10] 10

convert the image intoa sequence of codes,apply LZW to thecode stream.

320 x 200

Page 71: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

GIF, 5280 bytes(0.66 bpp)

JPEG, 16327 bytes(2.04 bpp)

bit per pixel

Page 72: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

72

GIF image

Page 73: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

73

JPEG image

Page 74: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

74

V.42bis

based on LZW algorithm. dictionary entries can be recycled. transparent mode: disable compression but

dictionary is still being updated (used in transmitting already-compressed files).

bis is a suffix meaning a modified standard. Derived from the Latin “bis” means twice.

Page 75: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

75

An experiment

Hayes Microcomputer Products, Inc.’s implementation of V.42bis compression.

4 different types of files: a 176,369-byte English essay. (Essays) a 210,931-byte USENET news articles. (News) a 119,779-byte Pascal source. (Pascal) a 201,746-byte Latex manual source. (Tex)

Page 76: 1 Chapter 3: Compression (Part 1). 2 Storage size for media types

76

Test files 512 1024 2048 4096

Essays 0.57 0.49 0.45 0.43

News 0.7 0.64 0.6 0.56

Pascal 0.52 0.41 0.35 0.31

Tex 0.59 0.5 0.45 0.420.5

Experiment result

Compression ratios:Dictionary Size