1 chapter 3: compression (part 1). 2 storage size for media types

1

Chapter 3: Compression(Part 1)

2

Text Image Audio Animation Video

Object Type

-ASCII -EBCDIC -BIG5 -GB

-Bitmapped graphics -Still photos -Faxes

-Noncoded stream of digitized audio or voice

-Synched image and music stream at 15-19 frames/s

-Digital video at 25-30 frames/s

Size and Bandwidth

2KB per page

-Simple (640x480): 1MB per image

Voice/Phone 8kHz/8 bits (mono) 8 KB/s Audio CD 44.1 kHz/16 bits (stereo) 176 KB/s

6.5 MB/s for 320x640x16 pixels/frame (16 bit color) 16 frames/s

27.7 MB/s for 640x480x24 pixels per frame 30 frame/s

Storage size for media types

3

A typical encyclopedia

50,000 pages of text (2KB/page) 0.1 GB 3,000 color pictures

(on average 64048024 bits = 1MB/picture) 3 GB 500 maps/drawings

(on average 64048016 bits = 0.6 MB/map) 0.3 GB 60 minutes of stereo sound (176 KB/s) 0.6 GB 30 animations, on average 2 minutes in duration

(64032016 pixels/frame = 6.5 MB/s) 23.4 GB 50 digitized movies, on average 1 min. in duration

(64048030 frame/sec = 27.6 MB/s) 82.8 GB

TOTAL: 110 GB

(or 170 CD’s, or about 2 feet high of CD’s)

4

Compressing an encyclopedia

Text 1:2

Color Images 1:15

Maps 1:10

Stereo Sound 1:6

Animation 1:50

Motion Video 1:50

0.01

0.1

1

10

100

Text

Co

lor

Imag

es

Map

s

Ste

reo

So

un

d

An

imat

ion

Vid

eo

Uncompressed

Compressed

110 GB -> 2.5 GB

GB

5

Compression and encoding

We can view compression as an encoding method which transforms a sequence of bits/bytes into an artificial format such that the output (in general) takes less space than the input.

CompressorEncoder

CompressorEncoderFat dataFat data

Fat(redundancy)

Slim data

6

Redundancy

Compression is made possible by exploiting redundancy in source data

For example, redundancy found in: Character distribution Character repetition High-usage pattern Positional redundancy

(these categories overlap to some extent)

7

Symbols and Encoding

We assume a piece of data (such as text, image, etc.) is represented by a sequence of symbols. Each symbol is encoded in a computer by a code (or value).

Example: English text: abc (symbols), ASCII (coding) Chinese text: 多媒體 (symbols), BIG5 (coding) Image: color (symbols), RGB (coding)

8

Character distribution

Some symbols are used more frequently than others.

In English text, ‘e’ and space occur most often. Fixed-length encoding: use the same number of bits

to represent each symbol. With fixed-length encoding, to represent n symbols,

we need log2n bits for each code.

9


Fixed-length encoding may not be the most space-efficient.

Example: “abacabacabacabacabacabacabac” With fixed-length encoding, we need 2 bits for each of

the 3 symbols. Alternatively, we could use ‘1’ to represent ‘a’ and ‘00’

and ‘01’ to represent ‘b’ and ‘c’, respectively. How much saving did we achieve?

10


Technique: whenever symbols occur with different frequencies, we use variable length encoding.

For your interest: In 1838, Samuel Morse and Alfred Vail designed the

“Morse Code” for telegraph. Each letter of the alphabet is represented by dots (electric

current of short duration) and dashes (3 dots). e.g., E “.” Z “--..” Morse & Vail determined the code by counting the

letters in each bin of a printer’s type box.

11

Morse code table

Ch

SOS = — — —

12


It can be shown that Morse Code can be improved by no more than 15%.

Principle: use fewer bits to represent frequently occurring symbols, use more bits to represent those rarely occurring ones.

13

Character repetition

when a string of repetitions of a single symbol occurs, e.g., “aaaaaaaaaaaaaabc”

Example: long strings of spaces/tabs in forms/tables.

Example: in graphics, it is common to have a line of pixels using the same color.

Technique: run-length encoding.

14

High usage patterns

Certain patterns (i.e., sequences of symbols) occur very often. Example: “qu”, “th”, ……, in English text. Example: “begin”, “end”, …, in Pascal source

code. Technique: Use a single special symbol to

represent a common sequence of symbols.e.g., = ‘th’ i.e., instead of using 2 codes for ‘t’ and ‘h’, use only 1

code for ‘th’

15

Positional redundancy

A certain thing appears persistently at a predictable place.

Sometimes, symbols can be predicted. Example: the left-uppermost pixel of my

slides is always white. Technique: tell me the “changes” only.

example: differential pulse code modulation (DPCM)

16

Categories of compression techniques

Entropy encoding regard data stream as a simple digital sequence, its

semantics is ignored lossless compression

Source encoding take into account the semantics of the data and the

characteristics of the specific medium to achieve a higher compression ratio

lossy compression

17

Categories of compression techniques

Hybrid encoding A combination of the above two.

A categorizationHuffman Code

Entropy Encoding Run-Length Coding

Arithmetic Coding

Prediction DPCM

Source Coding DM

Transformation FFT

DCT

Vector Quantization

JPEG

Hybrid Coding MPEG

H.261

19

Information theory

Entropy is a measure of uncertainty or randomness. Entropy order

According to Shannon, the entropy of an information source S is defined as:

where pi is the probability that symbol Si in S occurs. log2(1/pi) indicates the amount of information contained in

Si or the number of bits needed to encode it.

H S ppi

ii

( ) log 2

1

20

Information theory

The entropy is the lower bound for lossless compression: when the probability of each symbol is fixed, each symbol should be represented at least with H(S) bits on average.

A guideline: the code lengths for different symbols should vary according to the information carried by the symbol. The coding techniques based on this are called entropy encoding.

21

Information theory

For example, in an image with uniform distribution of gray-level intensity, pi, is 1/256. The number of bits needed to code each gray

level is log21/(1/256) = log2256 = 8 bits. The entropy of this image is

81

log256

1)(

256

1 25612

i

SH

22

Information theory

Q: How about an image in which half of the pixels are white, another half are black? Ans: p(white) = 0.5, p(black) = 0.5,

Q: How about an image such that white occurs ¼ of the time, black ¼ of the time, and gray ½ of the time? Ans: Entropy = 1.5.

Entropy encoding is about wisely using just enough number of bits to represent the occurrence of certain patterns.

15.0

1log5.0)(

2

12

i

SH

23

Variable-length encoding

A problem with variable-length encoding is ambiguity in the decoding process. e.g., 4 symbols: a (‘0’), b (‘1’), c (‘01’), d (‘10’) the bit stream: ‘0110’ can be interpreted as

abba, orabd, orcba, orcd

24

Prefix code

The idea is to translate fixed-sized pieces of input data into variable-length encoding.

We wish to encode each symbol into a sequence of 0’s and 1’s so that no code for a symbol is the prefix of the code of any other symbol – the prefix property.

With the prefix property, we are able to decode a sequence of 0’s and 1’s into a sequence of symbols.

25

Shannon-Fano algorithm

Example: 5 characters are used in a text, statistics:

Algorithm (top-down approach):1. sort symbols according to their probabilities.2. divide symbols into 2 half-sets, each with roughly equal

counts.3. all symbols in the first group are given ‘0’ as the first bit of

their codes; ‘1’ for the second group.4. recursively perform steps 2-3 for each group with the

additional bits appended.

Symbol A B C D ECount 15 7 6 6 5

A

B

26


Example: Gp1: {a,b}; Gp2: {c,d,e} subdivide Gp1:

Gp11:{a}; Gp12{b} subdivide Gp2:

Gp21:{c}; Gp22:{d,e} subdivide Gp22:

Gp221:{d}; Gp222:{e}

Symbol A B C D E Code 0.. 0.. 1.. 1.. 1..

Symbol A B C D E

Code 00 01 1.. 1.. 1..

Symbol A B C D E Code 00 01 10 11. 11.

Symbol A B C D E

Code 00 01 10 110 111

27


Result:

Symbol Count log2(1/p) Code Bits used

A 15 1.38 00 30 B 7 2.48 01 14 C 6 2.70 10 12 D 6 2.70 110 18 E 5 2.96 111 15

28


The coding scheme in a tree representation:

Total # of bits used for encoding is 89. Uncompressed code will need 3 bits for each symbol a

total of 117 bits.

15 7 6

6 5

11

39

22 17

A B C

D E

0 1

0 1 0 1

10

29


It can be shown that the number of bits used to encode a symbol on average using the Shannon-Fano algorithm is between H(S) and H(S) + 1. Hence it is fairly close to the theoretical optimal.

30

Huffman code

used in many compression programs, e.g., PKzip, ARJ.

Each code length in bits is approximately log2(1/pj) example:

Pr[space] = 0.1919182 2.4 bits, Pr[Z] = 0.0007836 10.3 bits

31

Huffman code

Algorithm (bottom-up approach): We start with a forest of n trees. Each tree has only 1

node corresponding to a symbol. Let’s define the weight of a tree = sum of the

probabilities of all the symbols in the tree. Repeat the following steps until the forest has one tree

left: pick 2 trees of the lightest weights, let’s say, T1, T2. replace T1 and T2 by a new tree T, such that T1 and T2 are T’s

children.

32

Huffman code

a(15/39) b(7/39) e(5/39)c(6/39) d(6/39)

a(15/39) b(7/39) c(6/39)e(5/39)d(6/39)

11/39

a(15/39) e(5/39)d(6/39)

11/39

c(6/39)b(7/39)

13/39

33

Huffman code

a(15/39) e(5/39)d(6/39)

11/39

c(6/39)b(7/39)

13/39

24/39

34

Huffman code

result from the previous example:

35

Huffman code

The code of a symbol is obtained by traversing the tree from the root to the symbol, taking a ‘0’ when a left branch is followed; and a ‘1’ when a right branch is followed.

36

Huffman code

char code bits/char # occ. #bits

used

A 0 1 15 15

B 100 3 7 21

C 101 3 6 18

D 110 3 6 18

E 111 3 5 15

total: 87 bits

“ABAAAE” “0100000111”

37

Huffman code (advantages)

Decoding is trivial as long as the coding table is sent before the data. Overhead is relatively negligible if the data file is big.

Prefix property: No code is a prefix to any other code. Forward

scanning is never ambiguous. Good for decoder.

38

Huffman code (advantages)

“Optimal code size” (among any coding scheme that uses a fixed code for each symbol.) Entropy is

Average bits/symbol needed for Huffman coding is

( * . * . * . * . * . ) /

. /

.

15 138 7 2 48 6 2 7 6 2 7 5 2 96 39

85 26 39

2 19

87 39

2 23

/

.

39

Huffman code (disadvantages)

A table is needed that lists each input symbol and its corresponding code.

Need to know the symbol frequency distribution. (It might be ok for English text, but not data files) need two passes over the data.

Frequency distribution may change over the content of the source.

40

Arithmetic coding

Motivation: Huffman coding uses integer number (k) of bits for each

symbol. k is at least 1. Sometimes, e.g., sending bi-level images,

compression does not help. In fact, Huffman is optimal (i.e., can attain the theoretical information lower bound) when and only when the symbol probabilities are integral powers of 1/2.

Arithmetic coding works by representing a message by an interval of real numbers between 0 and 1. As the message becomes longer, the interval needed to represent it becomes smaller, and the number of bits needed to represent the interval increases.

idea: Suppose alphabet is {X,Y} and p(X) = 2/3, p(Y) = 1/3. Map X,Y to interval

[0..1]:

To encode length-2 messages, we map all possible messages to interval [0..1]:

Now, just send enough bits of the binary fraction that uniquely specifies the interval.

X Y

0 2/3 1

XX YX

0 6/9 14/9 8/9

XY YY

XX YX

0 6/9 14/9 8/9

XY YY

¼.01

2/4.10

3/4.110

15/16.11111

42

Arithmetic coding

Similarly, we can map all length-3 messages as follows: Length1 Length2 Length3 Interval

0XXX

XX 8/27XXY

X 12/27XYX

XY 16/27XYY

18/27YXX

YX 22/27YXY

Y 24/27YYX

YY 26/27YYY

126

43

Arithmetic coding

Decoding .110Length1 Length2 Length3 Interval

0XXX

XX 8/27XXY

X 12/27XYX

XY 16/27XYY

18/27YXX

YX 22/27YXY

Y 24/27YYX

YY 26/27YYY

126

on receiving ‘1’marker: .1????interval [0.5, 1]symbol: ?

44

Arithmetic coding


0XXX

XX 8/27XXY

X 12/27XYX

XY 16/27XYY

18/27YXX

YX 22/27YXY

Y 24/27YYX

YY 26/27YYY

126

on receiving ‘11’marker: .11???interval [0.75, 1]symbol: Y

45

Arithmetic coding


0XXX

XX 8/27XXY

X 12/27XYX

XY 16/27XYY

18/27YXX

YX 22/27YXY

Y 24/27YYX

YY 26/27YYY

126

on receiving ‘110’marker: .110?interval [0.75, 0.875)symbol: YX

46

Arithmetic coding


0XXX

XX 8/27XXY

X 12/27XYX

XY 16/27XYY

18/27YXX

YX 22/27YXY

Y 24/27YYX

YY 26/27YYY

126

on receiving ‘110’marker: .110interval [0.75, 0.875)symbol: YXIf no more bits are received, there isinsufficient data to deduce the 3rd symbol.Hence, ‘110’ representsthe symbol string “YX”.

47

Arithmetic coding

Number of bits of codeword determined by the size of interval. First interval is 8/27, needs 2 bits. 2/3 bits per symbol

(efficient). Last interval is 1/27, needs 5 bits. In general, it needs log2(1/p) bits to represent interval of size

p. The code length approaches optimal encoding as message unit

gets longer. Although arithmetic coding gives good compression

efficiency, it is very CPU demanding (in generating the code).

48

Run-length encoding (RLE)

Image, video and audio data streams often contain sequences of the same bytes, e.g., background in images, and silence in sound files.

Substantial compression can be achieved by replacing repeated byte sequences with the number of occurrences.

49

Run-length encoding

If a byte occurs at least 4 consecutive times, the sequence is encoded with the byte followed by a special flag (i.e., a special byte) and the number of its occurrences, e.g., Uncompressed: ABCCCCCCCCDEFGGGHI Compressed: ABC!8DEFGGGHI

If 4Cs (CCCC) are encoded as C!0, and 5Cs as C!1, and so on, then it allows compression of 4 to 259 bytes into 3 bytes only.

50

Run-length encoding

Problems: Ineffective with text

except, perhaps, “Ahhhhhhhhhhhhhhhhh!”, “Noooooooooo!”. Need an escape byte (‘!’)

ok with text not ok with data solution?

used in ITU-T Group 3 fax compression a combination of RLE and Huffman compression ratio: 10:1 or better

53

Lempel-Ziv-Welch algorithm

Method due to Ziv, Lempel in 1977, 1978. Improvement by Welch in 1984. Hence, named LZW compression.

used in UNIX compress, GIF, and V.42bis compression modems.

Motivation: Suppose we want to encode an English text. We first encode

the Webster’s English dictionary which contains about 159,000 entries. Here, we surely can use 18 bits for a word. Then why don’t we just transmit each word in the text as an 18 bit number, and so obtain a compressed version of the text?

54

LZW

Problem: Need to preload a copy of the dictionary before

decoding. Dictionary is huge, 159,000 entries.

Solution: Build the dictionary on-line while decoding. No

bandwidth and storage is wasted for the preloading. Build the dictionary adaptively. Words not used in the

text are not incorporated in the dictionary.

55

LZW

Example: encode “^WED^WE^WEE^WEB^WET” using LZW.

Assume each symbol (character) is encoded by an 8-bit code (i.e., the codes for the characters run from 0 to 255). We assume that the dictionary is loaded with these 256 codes initially.

While scanning the string, sub-string patterns are encoded into the dictionary. Patterns that appear after the first time would then be substituted by the index into the dictionary.

56

Example 1 (compression)

^WED^WE^WEE^WEB^WET$ Dict Index Output

^^W ^W 256 ^ 2c WE WE 257 W 2c ED ED 258 E 2c D^ D^ 259 D 2c ^W seen at 256 ^WE ^WE 260 256 3c from 256 E^ E^ 261 E 2c ^W seen at 256 ^WE seen at 260 ^WEE ^WEE 262 260 4c from 260 E^ seen at 261 E^W E^W 263 261 3c from 261 WE seen at 257 WEB WEB 264 257 3c from 257 B^ B^ 265 B 2c ^W seen at 256 ^WE seen at 260 ^WET ^WET 266 260 4c from 260 T$ T

buffer

57




buffer

symbols

should beinterpretedas numericcodes

58




buffer

59

LZW (encoding rules)

Rule 1: whenever the pattern in the buffer is in the dictionary, we try to extend it.

Rule 2: whenever a pattern is added to the dictionary, we output the codeword of the pattern before the last extension.

60

Example 1

Original: 19 symbols, 8-bit code each 19 8 = 152 bits

LZW’ed: 12 codes, 9 bits each 12 9 = 108 bits Some compression achieved. Compression

effectiveness is more substantial for long messages. The dictionary is not explicitly stored. In a sense,

the compressor is encoding the dictionary in the output stream implicitly.

61

LZW decompression

“^WED{256}E{260}{261}{257}B{260}T” is read in.

Assume the decoder also knows about the first 256 entries of the dictionary.

While scanning the code stream for decoding, the dictionary is rebuilt. When an index is read, the corresponding dictionary entry is retrieved to splice in position.

The decoder deduces what the buffer of the encoder has gone through and rebuild the dictionary.

recall that theseare codes

62

Example 1 (decompression)

^WED{256}E{260}{261}{257}B{260}T$ Dict Index Output

^ ^ ^W ^W 256 W rebld 2c WE WE 257 E rebld 2c ED ED 258 D rebld 2c D{256} D^ 259 ^W rebld 2c {256}E ^WE 260 E 3c ~ 256 E{260} E^ 261 ^WE rebld 2c {260}{261} ^WEE 262 E^ 4c ~ 260 {261}{257} E^W 263 WE 3c ~ 261 {257}B WEB 264 B 3c ~ 257 B{260} B^ 265 ^WE rebld 2c {260}T ^WET 266 T 4c ~ 260 T$

63

LZW (decoding rule)

Whenever there are 2 codes C1C2 in the buffer, add a new entry to the dictionary: symbols of C1 + first symbol of C2

w = NIL; while ( read a symbol k ) { if ( wk exists in dictionary ) w = wk; // proceed to next else { output code for w; add wk to dictionary; w = k; } }

Compression code

Decompression coderead a code k; output symbol represented by k; w = k; while ( read a code k ) { ent = dictionary entry for k; output symbol(s) in entry ent; add w + ent[0] to dictionary w = ent; }

65

Example 2

ababcbababaaaaaaa

66


ababcbababaaaaaaa$ Dict Index Output

a 1 b 2 c 3 a ab ab 4 1 ba ba 5 2 abc abc 6 4 cb cb 7 3 bab bab 8 5 baba baba 9 8 aa aa 10 1 aaa aaa 11 10 aaaa aaaa 12 11 a$ 1

67

LZW: Pros and Cons

LZW (advantages) effective at exploiting:

character repetition redundancyhigh-usage pattern redundancy

an adaptive algorithm (no need to learn about the statistical properties of the symbols beforehand)

one-pass procedure

68

LZW: Pros and Cons

LZW (disadvantages) Messages must be long for effective compression

It takes time for the dictionary to build up. However, if the message is not homogeneous

redundancy characteristics shift during the message compression efficiency drops.

69

LZW (Implementation)

decompression is generally faster than compression. Why?

usually uses 12-bit codes 4096 table entries.

need a hash table for compression.

70

LZW and GIF

11 colorscreate a color map of11 entries:

color code[r0,g0,b0] 0[r1,g1,b1] 1[r2,g2,b2] 2… …[r10,g10,b10] 10

convert the image intoa sequence of codes,apply LZW to thecode stream.

320 x 200

GIF, 5280 bytes(0.66 bpp)

JPEG, 16327 bytes(2.04 bpp)

bit per pixel

72

GIF image

73

JPEG image

74

V.42bis

based on LZW algorithm. dictionary entries can be recycled. transparent mode: disable compression but

dictionary is still being updated (used in transmitting already-compressed files).

bis is a suffix meaning a modified standard. Derived from the Latin “bis” means twice.

75

An experiment

Hayes Microcomputer Products, Inc.’s implementation of V.42bis compression.

4 different types of files: a 176,369-byte English essay. (Essays) a 210,931-byte USENET news articles. (News) a 119,779-byte Pascal source. (Pascal) a 201,746-byte Latex manual source. (Tex)

76

Test files 512 1024 2048 4096

Essays 0.57 0.49 0.45 0.43

News 0.7 0.64 0.6 0.56

Pascal 0.52 0.41 0.35 0.31

Tex 0.59 0.5 0.45 0.420.5

Experiment result

Compression ratios:Dictionary Size

1 chapter 3: compression (part 1). 2 storage size for media types

Documents