source coding ompression

8/12/2019 Source Coding ompression

1/34

Source Coding-Compression

Most Topics from Digital Communications-

Simon Haykin

Chapter 9

9.1~9.4


2/34

Fundamental Limits on Performance

Given an information source, and a noisy channel

1) Limit on the minimum number of bits

per symbol

2) Limit on the maximum rate for reliable

communication

Shannons theorems


3/34

Information Theory

Let the source alphabet,

with the prob. of occurrence

Assume the discrete memory-less source (DMS)

What is the measure of information?

0, 1 -1{ , .. , }KS s s s

-1

0

0,1, .. , -1( ) , 1K

k k k

k

k KP s s p and p


4/34

Uncertainty, Information, and Entropy

(cont)

Interrelations between info., uncertainty or surprise

No surprise no information

If A is a surprise and B is another surprise,

then what is the total info. of simultaneous A and B

The amount of info may be related to the inverse of

the prob. of occurrence.

1( . )Pr .Info ob

.( ) .( ) .( )Info A B Info A Info B

1

( ) log( )kk

I S p


5/34

Property of Information

1)

2)

3)

4)

* Custom is to use logarithm of base 2

k k(s ) 0 for p 1I

k( ) 0 for 0 p 1kI s

k i( ) ( ) for p p

k iI s I s

indep.statist.sandsif),()()( ikikik sIsIssI


6/34


7/34


8/34

Average Length

For a code Cwith associated probabilitiesp(c)the average

length is defined as

We say that a prefix code Cis optimalif for all prefix

codes C, la(C)la(C)

l C p c l cac C

( ) ( ) ( )


9/34

Relationship to Entropy

Theorem (lower bound): For any probability

distribution p(S) with associated uniquely decodable

code C,

Theorem (upper bound): For any probability

distribution p(S) with associated optimal prefix code

C,

H S l Ca( ) ( )

l C H S a ( ) ( ) 1


10/34

Coding Efficiency

Coding Efficiency

n = Lmin/La

where La is the average code-word length

From Shannons Theorem La >= H(S)

Thus Lmin = H(S)

Thus n = H(S)/La


11/34

Kraft McMillan Inequality

Theorem (Kraft-McMillan): For any uniquely decodable codeC,

Also, for any set of lengths L such that

there is a prefix code C such that

NOTE: Kraft McMillan Inequality does not tell uswhether the code is prefix-free or not

2 1

l cc C

( )

2 1

l

l L

l c l i Li i( ) ( ,...,| |) 1


12/34


13/34

Prefix Codes

Aprefix codeis a variable length code in which nocodeword is a prefix of another word

e.g a = 0, b = 110, c = 111, d = 10

Can be viewed as a binary tree with message values at theleaves and 0 or 1s on the edges.

a

b c

d

0

0

0 1

1

1


14/34

Some Prefix Codes for Integers

n Binary Unary Split

1 ..001 0 1|

2 ..010 10 10|0

3 ..011 110 10|1

4 ..100 1110 110|00

5 ..101 11110 110|01

6 ..110 111110 110|10

Many other fixed prefix codes:

Golomb, phased-binary, subexponential, ...


15/34

Data compression implies sending or storing a

smaller number of bits. Although many methods are

used for this purpose, in general these methods can

be divided into two broad categories: lossless and

lossymethods.

Data compression methods


16/34

Run Length Coding


17/34

IntroductionWhat is RLE?

Compression technique Represents data using value and run length

Run length defined as number of consecutive equal values

e.g

1110011111 1 30 21 5RLE

Values Run Lengths


18/34

Introduction

Compression effectiveness depends on input

Must have consecutive runs of values in order to maximize

compression

Best case: all values same

Can represent any length using two values

Worst case: no repeating values

Compressed data twice the length of original!!

Should only be used in situations where we know for sure have

repeating values


19/34

Run-length encoding example


20/34

Run-length encoding for two symbols


21/34

EncoderResults

Input: 4,5,5,2,7,3,6,9,9,10,10,10,10,10,10,0,0

Output: 4,1,5,2,2,1,7,1,3,1,6,1,9,2,10,6,0,2,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1

Best Case:

Input: 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

Output: 0,16,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1

Worst Case:

Input: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15

Output: 0,1,1,1,2,1,3,1,4,1,5,1,6,1,7,1,8,1,9,1,10,1,11,1,12,1,13,1,14,1,15,1

Valid OutputOutput Ends Here


22/34

Huffman Coding


23/34


24/34

Huffman Codes

Huffman Algorithm

Start with a forest of trees each consisting of a single

vertex corresponding to a message s and with weight

p(s) Repeat:

Select two trees with minimum weight rootsp1andp2

Join into single tree by adding root with weightp1

+ p2


25/34

Example

p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5

a(.1) b(.2) d(.5)c(.2)

a(.1) b(.2)

(.3)

a(.1) b(.2)

(.3) c(.2)

a(.1) b(.2)

(.3) c(.2)

(.5)

(.5) d(.5)

(1.0)

a=000, b=001, c=01, d=1

0

0

0

1

1

1Step 1

Step 2Step 3


26/34

Encoding and Decoding

Encoding: Start at leaf of Huffman tree and follow path

to the root. Reverse order of bits and send.

Decoding: Start at root of Huffman tree and take branch

for each bit received. When at leaf can output messageand return to root.

a(.1) b(.2)

(.3) c(.2)

(.5) d(.5)

(1.0)0

0

0

1

1

1

There are even faster methods that

can process 8 or 32 bits at a time


27/34

Huffman codes Pros & Cons

Pros:

The Huffman algorithm generates an optimal prefix code.

Cons: If the ensemble changesthe frequencies and probabilities change

the optimal coding changes

e.g. in text compression symbol frequencies vary with context

Re-computing the Huffman code by running through the entire file in

advance?!

Saving/ transmitting the code too?!


28/34


29/34

Lempel-Ziv Algorithms

LZ77(Sliding Window)

Variants: LZSS (Lempel-Ziv-Storer-Szymanski)

Applications: gzip, Squeeze, LHA, PKZIP, ZOO

LZ78(Dictionary Based) Variants: LZW (Lempel-Ziv-Welch),

LZC (Lempel-Ziv-Compress)

Applications:compress, GIF, CCITT (modems), ARC, PAK

Traditionally LZ77 was better but slower, but the gzip version isalmost as fast as any LZ78.

L l Zi di


30/34

Lempel Ziv encoding

Lempel Ziv (LZ) encoding is an example of a

category of algorithms called dictionary-basedencoding. The idea is to create a dictionary (a table)

of strings used during the communication session. If

both the sender and the receiver have a copy of the

dictionary, then previously-encountered strings canbe substituted by their index in the dictionary to

reduce the amount of information transmitted.

C i


31/34

Compression

In this phase there are two concurrent events:

building an indexed dictionary and compressing a

string of symbols. The algorithm extracts the smallest

substring that cannot be found in the dictionary from

the remaining uncompressed string. It then stores a

copy of this substring in the dictionary as a new entryand assigns it an index value. Compression occurs

when the substring, except for the last character, is

replaced with the index found in the dictionary. Theprocess then inserts the index and the last character

of the substring into the compressed string.


32/34

An example of Lempel Ziv encoding

D i


33/34

Decompression

Decompression is the inverse of the compression

process. The process extracts the substrings from the

compressed string and tries to replace the indexes

with the corresponding entry in the dictionary, which

is empty at first and built up gradually. The idea is

that when an index is received, there is already anentry in the dictionary corresponding to that index.


34/34

An example of Lempel Ziv decoding

source coding ompression

Documents