information theory and coding

Information theory and Coding.

Wadih Sawaya

Communication systems

The Shannon ’s paradigm

General Introduction.

TELECOM LILLE 1 - Février 2010 Information Theory and Coding

2. Communication Systems

• Communication systems are designed to transmit an information generated by

a source to some destination.

• There exists between the source and the destination a communicating channel

affected by various disturbances.

SOURCE CHANNEL RECEIVER

disturbances

Figure: Block diagram of a communication system:

The Shannon’s paradigm



SOURCE CHANNEL RECEIVER

disturbances

Information is emitted from the source by mean of sequence of symbols.

The user of the information have to reproduce the exact emitted

sequence in order to extract information.

The presence of the disturbed channel may introduce changes

in the emitted sequence.



The designer of a communication system will be

asked to:

1. insure a high quality of transmissionwith an “as low as possible”

error rate in the reproduced sequence

� Different user requirements may lead to different criteria of

acceptability.

� Ex: speech transmission, data, audio/video,…

2. provide the higher Information rate through the channel

because:

� The use of a channel is costly.

� The channel employs different limited resources (time, frequency, power…).



• Source : deliver Information as a sequence of source symbols.

• Source coding: provide “in average” the shortest description of the emitted

sequences ⇒ higher information rate

• Channel: generates disturbances.

• Channel Coding: protect the information from errors induced by the channel, by voluntary adding redundancy to information ⇒ higher quality of transmission

SOURCESourceCoding

ChannelCoding C

H

A

N

N

E

LUser

Source

Decoding

Channel

Decoding

Figure: Extension of the Shannon’s paradigm

Part I – An Information measure.

Part II – Source Coding

Part III – The Communication Channel.

Part III – Channel Coding.

Course Contents

TELECOM LILLE 1 - Février 2010 Information Theory and Channel Coding

P R E A M B L E

In 1948 C. E. Shannon developed a “Mathematical theory of Communication”, called

information theory. This theory deals with the most fundamental aspects of a communication

system. It emphasis on probability theory and has a primary concern with encoders and

decoders, in terms of their functional role, and in terms of existence of encoders and decoders

that achieve a given level of performance. The latter aspect of this theory is established by

mean of two fundamental theorems.

As in any mathematical theory, this theory deals only with mathematical models and not

with physical sources and physical channels. To proceed we will study the simplest classes of

mathematical models of sources and channels. Naturally the choice of these models will be

influenced by the more important existing real physical sources and physical channels.

After understanding the theory we will emphasize on practical implementation of channel

coding and decoding, provided the important relationships established by the theory, which

appears being useful indications of tradeoffs that exist in constructing encoders and decoders.


9. An Information measure

• A discrete source deliver a sequence of symbols from the alphabet {x1, x2

,…xM}.

� Each symbol from this sequence is thus a random outcome taking value from the finite alphabet {x1, x2 ,…xM}.

• To construct a mathematical model we consider the set X of all possible

outcomes as the alphabet of the source, say {x1, x2 ,…xM}.

� Each outcome s = xi will correspond to one particular symbol of the set.

� A probability measure Pk is associated to each symbol.

;1)( MkxsPP kk ≤≤==

=∑

=

11

M

kkP


10. An Information measure

• If a symbol emitted by a source is known exactly, there would be no need to

transmit it.

• The information content carried by one particular symbol is thus strictly

related to its uncertainty.� Example: In the city of Madrid, in July, the weather prediction report: “Rain” contains much more

information than the event “Sunny”.

• The Information content of one symbol xi is a decreasing function of the

probability of its realization.

• The Information content associated with two independent symbols xi and xjwill be the sum of their two individual information contents:

( ) ( )jiji xPxPxQxQ <⇔> )()(

( ) ( ) ( ) ( ) ( ) ( )jijijiji xQxQxxQxPxPxxP +=⇔= ;,


11. An Information Measure

• The mathematical function that satisfy these two conditions is

indeed the logarithm function.

• Each symbol xi has its information content defined by:

• The base (a) of the logarithm determines the unit of the measure assigned to

the information content. When the base a = 2, the unit is the “bit” measure.

∆i

ai PxQ

1log)(


12. An Information Measure

• Examples:

1) The correct identification of one of two equally likely symbols, that is, P(x1) = P(x2 ),

conveys an amount of information equal to Q(x1) = Q(x2) = log22 = 1 bit of

information.

2) The information content of each outcome when tossing a fair coin is Q(“Head”) =

Q(“Tail”) = log22 = 1 bit of information.

3) Consider the Bernoulli distribution (probability measure of two possible events

"1" and "0") with P(X="0") =2/3 and P(X="1") = 1/3. The information content of each

outcome is:

bits585.03/2

1log)"0(" 2 =

=Q bits585.13/1

1log)"1(" 2 =

=Q


13. Entropy of a finite alphabet

• We define the Entropy of a finite alphabet as the average

information content over all its possible outcomes:

• The entropy characterizes in average the finite source's alphabet and is measured in bits/symbol.

∑∑==

==

M

k kkk

M

kk P

PxQPXH11

1log)()(



Example 1:

� Alphabet :

� Probabilities:

⇒⇒⇒⇒ Entropy bits/symbol

{ }4321 ,,, xxxx

81

;41

;21

4321 ==== PPPP

75.11

log)( 2

4

1

=

= ∑= kk

k PPXH



• Example 2:

� Alphabet of M equally likely distributed symbols:

⇒Entropy : bits/symbol

• Example 3:

� Binary alphabet : {0, 1}

�

⇒ Entropy :

{ }MkM

Pk ,...,11 ∈∀=

)(log)(log1

)( 21

2 MMM

XHM

k=∑=

=

xx PpPp −== 1; 10

)(1

1log)1(

1log)( 22 x

xx

xx P

PP

PPXH fH∆

−−+=



0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1ENTROPY OF A BINARYALPHABET

probability Px

Ent

ropy

in b

its/s

ymbo

l



• The maximum occurs for Px = 0.5, that is, when the two

symbols are equally likely. This results is fairly general:

� Theorem 1: The entropy H(X) of a discrete alphabet of M

symbols satisfies the inequality:

with equality when the symbols are equally likely.

� Exercise: Proof theorem 1.

MXH log)( ≤


18. Conditional Entropy

• We now extend the definition to a random variable given

another one: the conditional entropy H(X/Y) is defined as:

� Example:

∑∑==

=====YM

llk

lk

XM

k yYxXPyYxXPYXH

11 )/(1log),()/(

X

Y

1 2 3 4

1 1/8 1/16 1/32 1/32

2 1/16 1/8 1/32 1/32

3 1/16 1/16 1/16 1/16

4 1/4 0 0 0

Determine H(X), H(Y) and H(X/Y) ?


19. Relative Entropy or Kullback Leibler divergence.

• The entropy of a random variable is a measure of its uncertainty or

the amount of information needed on the average to describe it.

•Relative entropy is the measure between two distributions. It is the

measure of the inefficiency of assuming that the distribution is q when

the true one is p.

� Definition: The relative entropy or the Kullback-Leibler divergence between two probability mass functions p(x) and q(x) is defined as:

� Example: Determine for p(0)=p(1)=1/2 and q(0)= 3/4, q(1)=1/4.

� Relative entropy is always non-negative and is zero if and only if q = p.

∑ℵ∈

=x xq

xpxpqpD)()(log)()(

)( qpD


20. Mutual Information

• The mutual information is a measure of the amount of

information that one random variable contains about another

random variable. � Definition: Consider two random variables X and Y with a joint probability mass

function p(x,y) and marginal probability mass function p(x) and p(y). The mutual information I(X;Y) is the relative entropy between the joint distribution and theproduct distribution p(x)p(y):

� Theorem 2:

∑∑==

====

===YM

llk

lk

lk

XM

k yYpxXp

yYxXpyYxXpYXI

11 )()(),(

log),();(

)/()();( YXHXHYXI −=

)/()();( XYHYHYXI −=

),()()();( YXHYHXHYXI −+=



• From theorem 2 the mutual information is in the form:

•The relationship between all these entropies is expressed in a

Venn diagram:

∑∑==

===

===YM

lk

lk

lk

XM

k xXp

yYxXpyYxXpYXI

11 )()/(

log),();(

H(X/Y) H(Y/X)

H(X) H(Y)

I(X ; Y)

H(X,Y)



• Example: You have a jar containing 30 red cubes, 20 red

spheres, 10 white cubes and 40 white spheres. In order to

quantify the amount of information that the geometrical form

contains about the color you have to determine the mutual

information between the two random variables.

• We will emphasize later on the mutual information as the

amount of information that can reliably pass through a

communication channel .


23. Chain Rules

• Definition: The joint entropy of a pair of discrete random

variables (X,Y) with joint distribution p(x,y) is defined as:

•Theorem 3 (Chain rule):

•Definition: The conditional mutual information of random

variables X and Y given Z is defined by

•Theorem 4 (Chain rule for mutual information)

∑∑==

=====YM

llk

lk

XM

k yYxXPyYxXPYXH

11 ),(1log),(),(

)/()(),( XYHXHYXH +=

),/()/()/;( ZYXHZXHZYXI −=

)/;();();,(12121

XYXIYXIYXXI +=


24. Information inequalities

• Using Jensen’s inequality ( for a convex function f of a random

variable X, E[f(X)] ≥ f(E[X]) ), we ca prove the following inequality:

•and then:

•Conditioning reduces entropy:

0)( ≥qpD

0);( ≥YXI

)()/( XHYXH ≤


25. Data Processing Inequality

• Definition: Random variables X, Y and Z form a Markov chain

(X→Y→Z):

� Markovity implies conditional independency:

•Theorem 5 (Data processing inequality): If X→Y→Z, then:

� No processing of Y, deterministic or random, can increase the information that Ycontains about X.

� If X→Y→Z, then:

)/()/()(),,( yzPxyPxPzyxP =

)/()/()/,( yzPyxPyzxP =

);();( ZXIYXI ≥

);()/;( YXIZYXI ≤


26. The discrete stationary source

• We have studied until now the average information content of a set of all

possible outcomes recognized as the alphabet of the discrete source.

• We are interested by the knowledge of the information content per symbol

in a long sequence of symbols delivered by the discrete source, disregarding

if the emitted symbols are correlated in time or not.

• The source can be identified as a stochastic process. A source is stationary if

it has the same statistics no matter the time origin is.

• Let be a sequence of k non-independent random variables emitted

by a source with an alphabet of size M.

� The entropy of the k-dimensional alphabet is:

� The entropy per symbolof a sequence of k-symbols is fairly defined as:

( )k

XXX ,...,,21

)( kXH

)(1

)( kk XH

kXH =


27. The discrete stationary source

• Definition: the entropy rate of the source as the average information

content per source symbol, that is:

• Theorem 6: For a stationary source this limit exists and is equal to the limit

of the conditional entropy

� For a discrete memoryless source (DMS), each symbol emitted is independent from

all previous ones and the entropy rate of the source is equivalent to the entropy of

the alphabet of the source:

� Otherwise one can show the relation:

( ) )(XHXH =∞

( ) )(0 XHXH ≤≤ ∞

symbolbitsXHk

XH k

k/)(

1)( lim

∞→∞ =

),...,(lim11

XXXHkkk −∞→


28. Entropy of a continuous ensemble

• The symbol delivered by the source is a continuous random variable x taking values in the set of real number, with a probability density function p(x).

• The entropy of a continuous alphabet with probability density p(x) is:

Remark: This entropy is not necessarily positive, not necessarily finite.

• Theorem 7: Let x be a continuous random variable with probability

density function p(x). If x has a finite variance σx², then H(X) exists

and satisfies the inequality:

with equality if and only if X ~ N(µµµµ , σσσσx²)

dxxpxpXH ∫∞+

∞−−= )(log)()( 2

)e2(log21

)( 22 xXH σπ≤


30. Coding of the source alphabet.

• Suppose that we want to transmit each symbol, using a binary

channel (a channel able to communicate binary symbols).

• The role of the source encoder is to represent each symbol of the

source by a finite string of digits (a codeword).

• Efficient communication would involve transmitting a symbol in

the shortest possible time. This implies representing the symbol

with an as short as possible codewords .� More generally, the best source coding is one that have “in average” the

shortest description length assigned for each message to be transmitted by

the source.



• Each symbol will be affected to a codeword with a different length. The

average length over all codewords is:

where nk is the length (number of digits) of the codeword representing the symbol xk of probability Pk .

• The source encoder must be conceived in order to convey messages with an as small as possible “average length” of binary codewords strings (concise messages).

• The source encoder must also be conceived to be uniquely decodable. In other words, any sequence of codewords have only one possible sequence of source symbols producing it.

∑=

∆M

kkknPn

1



• Example:

the binary sequence 010010 could correspond to any of the

five messages : x1x3x2x1 , x1x3x1x3 , x1x4x3 ,, x2x1x1x3 or x2x1x2x1

⇒ this code is ambiguous, and is not uniquely decipherable.

Symbol codeword

x1 0

x2 01

x3 10

x4 100



• Condition that ensures unique decipherability : « no code word be a prefix of a

longer codeword » . Codes satisfying this constraint are called prefix codes.

• Theorem 8 (Kraft Inequality): If the integers n1, n2,, …nK satisfy the inequality

then a prefix binary code exists with these integers as codeword lengths

Note: The theorem does not say that any code whose lengths satisfy this inequality is a prefix code.

0

10

1 0

1

x1

x2

x3

x4

111x4

110x3

10x2

0x1

Code WordSymbol

121

≤∑=

−K

k

nk


34. Bound on optimal codelength.

• Theorem 9: A binary code satisfying the prefix constraint can be found

for any alphabet of entropy H(X) with an average codeword length

satisfying the inequality:

• we can define the efficiency of a code as:

• Exercise: Proof theorem 3

� Hint: 1) Proof that

2) choose nk to be integer satisfying:

1)()( +<≤ XHnXH

( )nXH

∆ε

0)( ≤− nXH

12)(2 +−− <≤ kk nk

n xP


35. Source Coding example:The Huffman Coding algorithm.

• A method for the construction of such a code is given by Huffman.

1. Arrange the symbols with increasing values of their probabilities.

2. Group the last two symbols xM and xM-1 into an equivalent symbol, with probability PM + PM-1.

3. Repeat steps 1 and 2 until only one “symbol” is left.

4. Associate the binary digits 0 and 1 to each pair of branches in the tree departing from intermediate nodes.


36. Huffman Coding algorithm.

• Example:

0.1x4

0.1x3

0.35x2

0.45x1

ProbabilitySymbol

111

110

10

0

HuffmanCodelbits/symbo712.1)( =XHFixed length

code

11

10

01

00

98%

digits/sym75.1

=

=

ε

n

* HuffmanCode:

* Fixed length code :

%85

digits/sym2

=

=

ε

n

0.55

0

1

0

1

Huffman coding:

0.35 x2

x10.45

x3

0.1

0.1

x4

0.2

0

1


37. The Asymptotic EquipartitionProperty (AEP)

• The AEP is the analog of the weak law of large numbers,

which states that for independent, identically distributed

(i.i.d.) random variables, the sample mean will approach

its statistical mean E[X] with probability 1, as n tends

toward infinity.

• Theorem 10: If X1, X2, … Xn are i.i.d. ~ p(x), then:

� Definition: The “typical set” is the set defined as:

∑=

n

iix

n1

1

( ) )(),...,,(log121

XHxxxpn n

→− in probability

{ }ε))((

21

ε))((

21

)(

ε

2),...,,(2:),...,,( −−+− ≤≤= XHn

n

XHn

n

n xxxpxxxA



• Theorem 11: If then:

1.

2.

3.

4.

)(

ε21),...,,( n

nAxxx ∈

ε)(),...,,(1ε)(21

+≤−≤− XHxxxpn

XHn

{ } ε1Pr )(

ε

−>nA

ε))(()(

ε

2 +≤ XHnnA

ε))(()(

ε

2ε)1( −−≥ XHnnA

for n sufficiently large

for n sufficiently large

denotes the number of elements in the set AA



� Data compression

� Theorem 12: Let X1, X2, … Xn are i.i.d. ~ p(x), let e >0. There exists

a code which maps sequences (x1,…,xn) such that:

Typical set)(

ε

nA

Non typical setXn

From property 3 above:

ε))(()(

ε

2 +≤ XHnnA

=> Indexing requires no more than n(H+ε)+1binary elements + prefixed by 0

=> Indexing requires no more than nlog(X) elements + prefixed by 1

ε)(),...,(),...,(11

+≤=∑ XHxxnxxPnnnX

n


40. Encoding the stationary source

• Until now we didn’t take into account the possible interdependency between symbols emitted at different time.

• Let us recall the entropy per symbol in a sequence of length k:

• Theorem 13: It is possible to encode sequences of k source symbols into a prefix condition code in such a way that the average number of digits satisfies:

• Increasing the bloc length k makes the code more efficient and thus: for any δ > 0 it is possible to choose k large enough so that satisfies:

δ+<≤ ∞∞ )()( XHnXH

kXHnXH kk

1)()( +<≤

)(1

)( kk XH

kXH =

n

n


41. Huffman Coding algorithm (2).

• Example: Huffman code for source X , k=1:

0.2x3

0.35x2

0.45x1

ProbabilitySymbol

11

10

0

Code =)(XH bits/sym518.1

%9,97

lbits/symbo55.1

=

=

ε

n

Huffman coding , for k =1

0.35

0.2

x2

x30.55

x10.45

0

1

0

1

Huffman coding:


42. Huffman Coding algorithm (2).

Example: Huffman code for source Y=Xk , k=2:

(x1,x1)

Symbol Y

0.07

0.1575

0.1225

0.09

0.09

0.07

0.04

0.1575

0.2025

Probability

1100

010

011

111

0000

0001

1101

001

10

Code

bits/sym0675.3=kn

bits/sym036.3)(2)( =×= XHYH

Huffman Coding of alphabet Y:

% 99

bits/sym534.1

=

==

ε

kn

n k

Average length per symbol from set X:

(x1,x2)

(x2,x1)

(x2,x2)

(x1,x3)

(x3,x1)

(x2,x3)

(x3,x2)

(x3,x3)


43. Huffman Coding algorithm

Exercise: Let X be the source alphabet with X= {A,B,C,D,E}, and

probabilities 0.35, 0.1, 0.15, 0.2, 0.2 respectively. Construct the binary

Huffman code for this alphabet and compute its efficiency.


45. Introduction

• A communication channel is used to connect the source of information and

its user.

Between the channel encoder output and the input of the demodulator we may consider a continuous channel with discrete input alphabet.

• As a practical example, the AWGN ("Additive" White Gaussian Noise) channel is well known, and is completely characterized by the probability distribution of the noise.

Between the channel encoder output and the channel decoder input, we may consider a discrete channel.

• The input and output of the channel are discrete alphabets. As a practical example the Binary Channel.

DiscreteSource

SourceEncoder

ChannelEncoder

ChannelDecoder

SourceDecoder

User

Modulator

Demodulator

Transmission

Channel


46. The discrete memoryless channel.

• A discrete channel is characterized by :

� An input alphabet:

� An output alphabet:

� A set of conditional probabilities pijwhere

{ } XNiixX 1==

{ } YN

jjyY1=

=

.

.

.

.

.

.

p12p21

p11

p22

YX NNp

x1

x2

XNx

y1

y2

YNy

)( ijij xyPp ∆


47. Discrete memoryless channel.

• represents the probability of receiving the

symbol yj, given that the symbol xi has been

transmitted.

• The channel is memoryless :

� and represent n consecutive transmitted and received

symbols respectively.

)( ijij xyPp ∆

)(),...,,,...,,(1

2121 i

n

iinn xyPxxxyyyP ∏

==

nxx ,...,1 nyy ,...,1



• Example 1:

� The binary channel: NX = NY = 2

� Obviously we have the relationship:

� and

� When p12 = p21 = p the channel is called binary symmetric channel (BSC) .

11

=∑=

YN

jijp

11211 =+ pp 12221 =+ pp

p22

p11

p 21

p12

x1

x2

y1

y2

1 − p

pp

x1

x2

y1

y21 − p



• We define the channel matrix P by:

� The sum of the elements in each row of P is 1:

⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅

⋅⋅⋅⋅⋅⋅

∆

YXXX

Y

Y

NNNN

N

N

ppp

ppp

ppp

21

22221

11211

P

11

=∑=

YN

jijp



• Example 2:

• The noiseless channel: NX = NY = N

� The symbols of the input alphabet are in one-to-one correspondence with the

symbols of the output alphabet.

≠=

=ji

jipij 0

1

NNp

.

.

.

.

.

.

p11

p22

x1

x2

Nx

y1

y2

NyNI=

=

10

0

010

001

L

OM

M

L

P



• Example 3:

� The useless channel: NX = NY = N

� The matrix P has identical rows. The useless channel completely scrambles all input symbols, so that the received symbol gives any useful information to

decide upon the transmitted one.

ijN

xyP ij ,1

)( ∀=

=11

111

L

MOM

L

NP

ijyPxyP jij ,)()( ∀=⇒

NxP

N

xPxyPyP

ii

ii

ijj

1)(

1

)()()(

==

=

∑

∑

)()()()( ijijij xPyxPyPxyP =⇔=


52. Conditional Entropy

• Definition:

�The conditional entropy H(X|Y) measures the average information quantity needed to specify the input symbol X when the output (or received) symbol is

known.

• This conditional entropy represents the average amount of information that has been

lost in the channel, and it is called equivocation.

• Examples:

The noiseless channel: H(X|Y) = 0

� No loss in the channel.

The useless channel: H(X|Y) = H(X)

� All transmitted information is lost on the channel

bits/sym)(

1log),()(

1 1∑ ∑

∆

= =

X YN

i

N

j jiji yxP

yxPYXH


53. The average mutual information

• Consider a source with alphabet X transmitting through a channel having the same

input alphabet.

• A basic point is the knowledge of the average information flow that can reliably pass

through the channel.

• Remark: We can define the average information at the output end of the channel:

CHANNELemitted message received message

Information lost in

the channel

Average Information flow = Entropy of the input alphabet

— Average Information lost in the channel

bits/sym)(

1log)()(

1∑

∆

=

YN

j jj yP

yPYH



• We define the average information flow (the average mutual information

between X and Y) through the channel:

� Note that:

� Remark: The mutual information has a more general definition than “an information

flow”. It is the average information provided about the set X by set the Y, excluding all

average information about X from X itself (the average self-information is H(X)).

bits/sym)()();( YXHXHYXI −∆

)()(

)()();(

XYHYH

YXHXHYXI

−=

−=



• Application on the BSC Channel :

+

+

+

=

)(1

log),()(

1log),(

)(1

log),()(

1log),()(

22222

21212

12221

11211 xyP

yxPxyP

yxPxyP

yxPxyP

yxPXYH

)(1

1log)1(

1log)( 22 p

pp

ppXYH fH=

−−+

=

=×−≠×

==jixPp

jixPpxPxyPxyP

iij

iijiijij for )()1(

for )()()(),(

p12 = p21 = p

⇒

P(x1)= 1 - P(x2)

1. bits/sym)(

1log),()(

1 1∑∑= =

∆

X YN

i

N

j ijji

xyPyxPXYH

)()();( XYHYHYXI −=

1-p

1-p

pp

x1

x2

y1

y2



)()(

),()(

ii

ij

iijj

xPxyp

xypyP

∑

∑

=

=

2. ( ) ( ) ( )

+

=)(

1log

)(1

log2

221

21 yPyP

yPyPYH

( )pxPpyP 21)()( 11 −+=⇒

( ) ( )pxPpyP 21)(1)( 12 −−−=

and we plot I(X;Y) as a function of P(x1 ) for different values of p3. )()();( pHYHYXI f−=

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

bits

/sym

bol

Mutual Information for a BSC

p =0.p=0.1p=0.2p=0.3p=0.5

P(x1)


57. Capacity of a discrete memorylesschannel.

• Considering the set of curves I(X;Y) function of P(x1 ) we can observe

that the maximum of I(X;Y) is always obtained for P(x1)=P(x2)=0.5

� when the input symbols are equally likely.

• The Channel Capacity is defined as the maximum information flew

through the channel that a communication system can theoretically

expect.

• This maximum is achieved for a given probability distribution of the

input symbols:

The maximum value of I(X;Y) is called the channel capacity C.

);()(

YXIC MaxxP

∆



• For BSC this capacity is obtained when the channel input symbols

are equally likely.

• This result can be extended for more general case of symmetric

discrete memoryless channels (NX inputs) .

• Theorem 14: For a symmetric discrete memoryless channel,

capacity is achieved by using the inputs with equal probability.

XX

i , .... N i N

xP 1 allfor 1

)( ==



• Example: The Binary Symmetric Channel.

DiscreteSource

SourceEncoder

SourceDecoder

User

BPSK

AWGN

h(t)

x

x

cos(2πf0t)

cos(2πf0t)

h(-t)

{1, 0}

{1, 0}

=

0

2

N

EQp b

5.0)1()0( == PP

BSC


60. Capacity of a discrete memorylesschannel

• Example: The Binary Symmetric Channel

0 5 10 15 20 25

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

SNR (dB)

bits

/sym

bole );( YXI

P(x1)=0.25 ; P(x2) = 0.75

Capacity of BSC P(x1)=0.5 ; P(x2) = 0.5


61. Capacity of the additive Gaussianchannel

• The channel disturbance has the form of a continuous Gaussian

random value ν with variance , added to transmitted signal.

� The assumption that the noise is Gaussian is desirable from the mathematical point of view, and is reasonable in a wide variety of physical settings.

• In order to study the capacity of the AWGN channel, we drop the hypothesis

of discrete input alphabet and we consider the input X as a random

continuous variable with variance

2νσ

2Xσ

X

ν

Y+

ν ~ N(0, σν)

)(xpX )(ypY


62. Capacity of the additive Gaussianchannel

• We recall the expression of the capacity:

• Theorem 15: The capacity of a discrete-time, continuous additive Gaussian channel is achieved when the continuous input has a Gaussian probability distribution.

);()(

YXIC MaxxP

∆

)()();( XYHYHYXI −=

σ

σ+==

ν2

2

21log

21 XC


63. Capacity of a bandlimited GaussianChannel with waveform input

• We deal now with a waveform input signal in a bandlimited channel in the

frequency interval (-B, +B).

• The noise is white and Gaussian with two-sided power spectral density N0/2. In

the band (-B, +B), the noise mean power is σν² = (N0/2).(2B) = N0B

• For zero mean and a stationary input each sample will have a variance σX²equal to the signal power P, i.e. σX² = P

• Using the sampling theorem we can represent the signal using at least 2B

samples per second. Transmitting at a sample rate 1/2B we express the

capacity in bits/sec as:

bits/sec

+=BN

PBCs

02 1log


64. Capacity of a bandlimited GaussianChannel with waveform input

0 2 4 6 8 10 12 14 16 18 200.5

1

1.5

2

2.5

3

3.5

SNR (dB)

AW

GN

Cap

acity

bits

/sym

bol

( )SNRC += 1log21

2


66. The noisy channel coding theorem

• In its more general definition, channel coding is the operation of mapping each sequence emitted from a source to another sequence belonging to the set of all possible sequences that the channel can convey. The functional role of channel coding in a communication system is to insure reliable communication. The performance limits of this coding are stated in the fundamental channel coding theorem.

• The noisy channel coding theorem introduced by C. E. Shannon in 1948 is one of the most important results in information theory.

• In imprecise terms, this theorem states that if a noisy channel has capacity Cs in bits per second, and if binary data enters the channel encoder at a rate Rs < Cs , then by an appropriate design of the encoder and decoder, it is possible to reproduce the emitted data after decoding with a probability of error as small as desired.

• Hence the noise appears to be no more a limiting parameter on the quality of a communication system, but rather to the information rate that can be transmitted through the channel.

Source ChannelCoding

Channel



• This result enlightens the significance of the channel capacity. Let us recall

the average information rate that passes through the channel:

� The equivocation represents the amount of information lost in the channel, where X and Y are its input and output alphabets respectively.

� The capacity C is defined as the maximum of I(X;Y). The maximum is taken over all input distributions [P(x1), P(x2), …. ].

� If an attempt is made to transmit at a higher rate than C, say C + r, then there will be necessary an equivocation equal to or greater than r.

• Theorem 16: Let a discrete channel have a capacity C and a discrete source have an

entropy rate R. If R ≤ C there exists a coding system such that the output of the sourcecan be transmitted over the channel with an arbitrarily small frequency of errors (or an

arbitrarily small equivocation). If R > C there is no method of encoding which gives an

equivocation less than R − C (Shannon 1948).

bits/sym)()();( YXHXHYXI −∆

)( YXH



• To proof the theorem, Shannon shows that a code having this desired property must exist in a certain group of codes. Shannon proposed to average the frequency of errors over this group of codes, and shows that this average can be made arbitrarily small.

• Hence the noisy channel coding theorem states on the existence of such a code but didn’t exhibit the way of constructing it.

• Consider a source with entropy rate R, . Consider then a random mapping of each sequence of the source to a possible channel sequence. One can then compute the average error probability over an ensemble of long sequences of the channel. This will give rise to an upper bounded average error probability:

• E(R) is a convex ∪ , decreasing function of R, with 0 < R < C and n the length of the emitted sequences.

)(2)( RnEeP −<

CR≤



• In order to make this bound as small as desired, the exponent factor has to

be as large as possible. A typical behavior of E(R) is shown in figures below.

• The average probability can be made as small as desired by increasing E(R) :

E(R)

RCR1R2

E(R)

RC2C1

Reducing R is not a desirable solution as it is antinomic with the objective of transmitting a higher information rate.

Higher capacity is achieved with a greater signal to noise ratio. Again, this solution is not adequate since power is costly and, in almost all applications power is limited.



• The informal proof by Shannon of the noisy channel coding theorem considers randomly

chosen long sequences of channel symbols.

� Thus it is obvious that the average error probability could be rendered arbitrarily small by choosing long sequences of codewords n.

� In addition, the theorem considers randomly chosen codewords. Practically this appears to be incompatible with reality, unless a genius observer deliver to the user of information, the rule of coding (the mapping) for each received sequence.

� The number of codewords and the number of possible received sequences are exponentiallyincreasing functions of n Thus for large n, it is impractical to store the codewords in the encoder and decoders when a deterministic rule of mapping is adopted.

� We shall continue our study on channel coding by discussing now techniques that avoid these difficulties and we hope that progressively, after introducing simple coding techniques we can emphasize on concatenated codes (known as turbo codes) which approaches capacity limits as they behaves as random like codes.


71. Improving transmission reliability: Channel Coding

• The role of channel coding in a digital communication system is essential in order to

improve the error probability at the receiver. In almost all practical applications the

need of channel coding is indubitably required to achieve reliable communication

especially in Digital Mobile Communication .

Gc (2.5 dB) is the

coding gain at Pe=10-4

for the first code

G'c (3.75 dB) is the

coding gain at Pe=10-5

for the second code

3 4 5 6 7 8 9 1010

-6

10-5

10-4

10-3

10-2

10-1

QPSK

TCM

6D TCM

Gc

G'c


72. Linear Binary Codes

• Data and codewords are formed by binary digits 0 and 1.

� The channel input alphabet is binary and accepts symbols 0 and 1.� If the output of the channel is binary we will deal essentially with BSC

� If the output of the channel is continuous we will deal essentially with AWGN channel.

� We assume an ideal source coding, i. e. , each digit at the output of the source block will convey an information amount of 1 bit:

• We will present two families of binary channel coding:

� Block Codes

� Convolutional codes

5.0)1()0( == PP

Source and source

codingChannelCoding

Channel{0,1,1,0….} { 1,0,1,0….}


73. Channel Coding techniques:Linear binary block codes

• A block of n digits (codeword) generated by the encoder

depends only on the corresponding block of k bits generated

by the source.

• The code is defined by the set of all 2k sequences of length n

(codewords) generated by the encoder and is referred as an (n,

k) code .

Block Encoderk/n

u= (u1, u2, ... , uk) x= (x1, x2, ... , xn)

1<=nkρ



• In a BSC channel (Binary symmetric channel) the received n-sequence is:

• e is a binary n-sequence representing the error vector. If ei =1 than an error

has occurred at digit i .

Source (& source coding)

ChannelEncoder

Modulator

DemodulatorChannelDecoder

User

Transmission

Channel

u x

Rs bit/s Rs /ρsymbol/s

y

][ 21 kuuu ⋅⋅⋅=u ][ 21 nxxx ⋅⋅⋅=x 1<=nkρ

exy ⊕= ][ 21 neee ⋅⋅⋅=e



• Examples:

� Repetition Code (3 , 1)

� Parity Check code (3, 2)

where ⊕ denotes the modulo 2 sum.

131211 ,, uxuxux ===

2132211 ,, uuxuxux ⊕===



� Hamming Code (7, 4)

� The encoding rule can be represented by the generator matrix G : x =

uG

4217

4326

3215

4,3,2,1

uuux

uuux

uuux

iux ii

⊕⊕=⊕⊕=⊕⊕=

==

=

1101000

0110100

1110010

1010001

G



• A systematic encoder is an encoder where all the k information

datas belong to the codeword. Thus G assumes the canonical form:

where Ik is the k x k identity matrix and P is a k x (n-k) matrix which

specifies the parity check equations.

• An encoder introduces r = (n – k) redundant

binary digits.

[ ]PI k=G


78. Properties of linear block codes

• Property 1: All linear combination of codewords is a codeword.

a. The block code consists of all possible sums of the rows of the generator matrix.

b. The sum of two code words is a code word.

• Property 2: The n-sequence of all zeros is always a code word.

• Property 3: A block code is a commutative group over ⊕ operation.

a. The all zeros codeword is the identity element of the code.

b. If x1, x2 andx3 are codewords then:

( ) ( )321321 xxxxxx ⊕⊕=⊕⊕

2121 xxxx =⇒=⊕ 0

1221 xxxx ⊕=⊕


79. Hamming distance

• We define the Hamming distance between two codewords as the

number of places where they differ.

� One can verify that the Hamming distance is a metric that indeed satisfies the

triangle inequality dH(x1,x3) ≤ dH(x1,x2) + dH(x2,x3)

• The minimum distance of a linear block code is:

∑=

⊕=n

iiiH xxd

1

)'()',( xx

[ ])'(

'

, xx

xxx'x,

,dMind HminH

≠

∆


80. Error detecting capabilities

• Consider a systematic encoder transmitting over a BSC.

• The received sequence contains independent random errors caused by the

channel noise.

• Using the first k received symbols y1, …, yk, an algebraic decoder compute

the n-k parity equations and compare them to the received last (n-k)

symbols yk+1, …yn

exy ⊕= ][ 21 neee ⋅⋅⋅=e

kiux ii ≤≤= 1

nikugxk

jjiji ≤≤+=∑

=1

1

(n-k) parity equations

nikygyk

jjiji ≤≤+=∑

=1'

1

nikyy ii ≤≤+⊕ 1'


81. Maximum likelihood detection in a AWGN channel

• Consider a communication system with channel coding and decoding

processes and having these properties:

� The channel is memoryless and the noise is AWGN

� The channel's input alphabet is binary and its output is the set of real numbers.

� The source coding is ideal in the sense that each binary digit delivered by the ensemble bloc "source and source coding" will convey an amount of information of 1 bit (P(0)=P(1) =0.5).

Source and Source Coding

ChannelEncoder

Maximum Likelihood Detection

User

BPSK

AWGN

h(t)

x

x

cos(2πf0t)

cos(2πf0t)

h(-t)

{1, 0, …}

x

r


82. Lower bound on error probability

• Considering only nearest neighbors errors we have:

where dE,min is the minimum Euclidean distance between two sequences and where is the average number of nearest neighbors in the code separated by dE,min .

≥

0

min,min 2 N

dQNP E

e

minN



• Considering again BPSK modulation case with symbols (-A, +A) one

can easy show that:

• The lower bound could be expressed as follows:

• Comments: A code having a greater minimum Hamming distance may

exhibit better asymptotic performances. But when is very large

one can experiment significant losses in the global performance. In

addition this bound may be loose for small values of SNR as errors

may occur between codewords separated by a distance greater then

minimum Euclidean distance.

2min,

2min, 4 Add HE ××=

bEnk

A 2=with

××≥

0min,min

2NE

dQNP bHe ρ

minN

nk=ρ



0 1 2 3 4 5 6 7 8 9 10

10-6

10-5

10-4

10-3

10-2

10-1

100

Eb/N0 (dB)

Pe

Lower Bound on word error probability

BPSK

Hamming (7,4)Golay (23,12)


85. Maximum likelihood detection in BSC Channel

• Consider again the communication system with channel coding and decoding

but consider now the that decoding processes after demodulation.

� The channel is memoryless and the noise is AWGN.

� The channel's input and outputs alphabets are binary.

� The source coding is ideal in the sense that each binary digit delivered by the ensemble bloc "source and source coding" will convey an amount of information of 1 bit (P(0)=P(1) =0.5).

Source and Source Coding

ChannelEncoder

Channel Decoding

User

BPSK

AWGN

h(t)

x

x

cos(2πf0t)

cos(2πf0t)

h(-t)

{1, 0, …}

x

y{1, 0, …}


86. Maximum likelihood detection in BSC channel

• The channel encoder deliver a codeword x of the code (n,k), and BPSK

demodulation deliver a binary n-sequence vector y performs with transition

probability p:

• We have then y = x ⊕ e where e =[e1 e2, … en] is a sequence of errors, ei = 1 when an error occurs at position i , ei = 0 otherwise. The ML detection of

codewords in a BSC is given by:

is the Hamming distance between the received sequence and the

codeword x(l).

( ) ( )(m)

x

(l)(l) xyxyxx(m)

PP Maxˆ =⇔=

=

0

2NE

nk

Qp b

0

1 1

0

pp

1-p

1-p

( ) ll dnd ppP −−= )1((l)xy

),( (l)xyHl dd =


87. Maximum likelihood detection in BSC Channel

• The ML detection criterion in BSC can be expressed after taking the logarithm function

over conditional probabilities. As p < ½, P(y|x(l)) is a monotonic decreasing function of dl.

Therefore, ML criterion in BSC is resumed by the following rule:

• As ML detection in BSC resumes in selecting the closest codeword, in terms of Hamming distance, to the received binary sequence, the minimum Hamming distance appears once more to be an influent parameter to the error performance of linear block codes.

� In designing a good linear binary code one must search on codes maximizing the minimum Hamming distance, and having a small average number of nearest neighbors.

� The receptor operating on received binary sequences (after demodulation,i.e. a BSC channel) is known as hard decision decoder

( ) ( )(m)

x

(l)(l) xyxyxx(m)

,,ˆ inM HH dd =⇔=


88. Hard v/s Soft decoding

Input HardOutput HardHard decoder

Input SoftOutput HardSoft decoding

0 2 4 6 8 10

10-5

10-4

10-3

10-2

10-1

Eb/N0 (dB)

Pe

Lower Bounds for word error probability of (7,4) Hamming code

Hard DecodingSoft decoding


89. Hard v/s Soft decoding

• We will study now correcting capabilities of a ML receptor in a BSC. We will introduce

then a practical method of decoding linear block codes. This method can be stated as an

"error controlling and correcting technique". We will derive then detection capabilities.

Correction and detection capabilities both are related to the minimum Hamming

distance.

-5 0 5 10 150.2

0.4

0.6

0.8

1

1.2Capacity of BPSK in AWGN

Eb/No (dB)

bits

/sym

bol

Soft decision channel

Hard Decision Channel (BSC)

90. Error correcting and detecting capabilities

• Theorem 17: A linear block code (n, k) with minimum

Hamming distance dH,min can correct all error vectors of

weight not greater than t = (dH,min – 1 )/2 (a is the natural value of a).

• Theorem 18: A linear block code (n,k) with minimum

distance dH,min detects all error vectors of weight not greater

than (dH,min – 1 ).



91. Error correcting and detecting capabilities

• A code which has a capacity of correction equal to t

is often denoted as an (n, k, t) code.

• The Parity Check code has a minimum distance 2.

It can detect all single errors but cannot correct

any.

• The (7, 4) Hamming code has a minimum distance

3. It is expected to correct all single errors.


92. Cyclic codes

• A linear (n, k) block code is a cyclic code if and only if any cyclic shift

of a code word produces another code word.

• Cyclic codes are parity-check codes that present a large

amount of algebraic structure.

• Cyclic codes have the peculiar properties that allows easy

encoding operations and simple decoding algorithms.

� Cyclic codes are of great practical interest.


93. Cyclic Codes.

� The Hamming code (7, 4) is a cyclic code. For instance, there are six different cyclic shifts of the code word 0111010:

1110100 1101001 1010011 0100111 1001110 0011101

� they all all belong to the the set of code words.

� In dealing with cyclic codes it is useful to represent a binary sequence of n digits as a polynomial in the indeterminate Z.

� A code word x = [xn - 1, xn - 2, … , x0] is represented as follow:

012

21

1 ...)( xZxZxZxZx nn

nn ⊕⊕⊕⊕= −

−−

−


94. BCH codes

• Bose-Chaudhuri-Hocquenghem codes.

• This class of cyclic codes is one of the most useful

for correcting random errors mainly because the

decoding algorithms can be implemented with an

acceptable amount of complexity.

• For any pair of positive integers m and t, there is a

binary BCH code with the following parameters:

12,,12min,

+≥≤−−= tdmtknnH

m


95. BCH codes

• This code can correct all combinations of t or fewer

errors.

• These codes are interesting because of the

flexibility in choice of parameters (block length and

code rate), and the available decoding algorithms

that can be implemented.


96. Reed- Solomon Codes

• A subclass of BCH codes generalized to the non binary case

(symbols belonging to a set of cardinality q = 2m ).

• Each symbol can be represented as a binary m-tuple, and the

code can be considered a special type of binary code.

• The parameters of an RS code are:

� Symbol m binary digits

� Block length n 2m – 1 symbols

� Parity checks (n – k)2t symbols

• These codes are capable of correcting all combinations of t or

fewer symbol errors. They are well suited for correction of

burst binary errors.


97. Convolutional Codes

• A sequential machine:

• The registry content ui-1ui-2 define the state of the machine at

instant i :

+

+

+

ui

x1

x2

ui-1 ui-2

S311

S110

S201

S000

Stateui-1ui-2



• Transitions rules from state to state:

• Each transition will be assigned by the label (ui/ x1,

x2).

S0

S1

S2

S3



S0

S1

S2

S3

(0/ 0, 0)

(1/ 0,

0)(0/ 1, 1

)

(1/ 1, 1)

(0/ 1, 0

)

(1/ 0, 1)

(0/ 0, 1)(1/ 1, 0)



• Thus a convolutional code (n , k , ν) is a code of rate k/n with 2ν states for its trellis representation.

• A convolutional code has a minimal Hamming

distance dH,min.

• This distance can be evaluated by looking up on

the trellis structure when the latter has relatively

simple behavior.


101. Viterbi Algorithm

• The Viterbi Algorithm can be used to decode a convolutionaly

coded sequence taking advantage from the inherent trellis

structure of the code.

• The Viterbi Algorithm proved to be MLSE (maximum

likelihood sequence estimate) and asymptotically optimal.

• For a BPSK modulation and a convolutional code (n, k) with

dfree the lower bound of the error probability is:

=

≥

nN

kdEQN

dQNP freeb

free

free

freee

0

2

2σ


102. Viterbi Algorithm

• The practical implementation of the Viterbi Algorithm makes

Convolutional codes widely used in communication systems.

• High performance are achieved with small amount of

complexity.

• For a same constraint length, decoder complexity grows

linearly with n (with a direct computation of MLSE complexity

would grows exponentially with n ).

• Soft decoding Viterbi Algorithm is commonly used, and

improve performance with up to 3 dB over hard decoding

technique.


ANNEXE 1: Comparison between digital modulations

• We define the spectral efficiency η of the transmitted waveform signaling:

• Let

• Theorem A.1: To transmit information reliably on an additive white

Gaussian noise channel with spectral efficiency η any digital communication system requires a signal-to-noise ratio satisfying:

B

R∆η bit/sec/Hz

η

η 12

0

−≥c

N

Eb

+=

BN

REBC b

s0

2 1log bits/secB

Csc ∆η bit/sec/Hz



• for B → ∞ (η → 0 ) the limit of yields:

� As information source rate R must be less than the channel capacity (to insure an error free transmission, i.e. Pe → 0) :

elog20NRE

C b=∞

693.0elog

1

20=≥⇒≤ ∞ N

ECR b

dB6.10

−≥NEb

bits/s

+=BNRE

BC bs

02 1log



510−=eP

-5 0 5 10 15 20 2510

-1

100

101

Eb/N0 (dB)

Spe

ctra

lEff

icie

ncy

(bit/

s/H

z i

n lo

g)

-1.6 dB Bandwidth-limited region

Power-limited region

64QAM

16QAM

QPSK

BPSK

8PSK

16PSK

M = 8

M = 16

M = 32

M = 64 Orthogonal signalsCoherent detection

Channel Capacity

Region where error-free

transmission is possible:

η

η 12

0

−≥

N

Eb

Back to text


ANNEXE 2:The noisy channel coding theoremShannon’s informal interpretation

• Let us consider a source with alphabet X matched to the channel in a way that the source

achieves the channel capacity C, the entropy rate of the source being H(X).

� Each source’s sequence of length n is a codeword and is represented by a point in the figure

below.

� We know that for large n, there are approximately 2nH(X) input typical sequences x having

probability 2− nH(X) and similarly 2nH(Y) output typical sequences y having probability 2− nH(Y) ,

and finally 2nH(X,Y) typical pairs (x,y) .

� For each output sequence y, there are 2n[H(X,Y) −H(Y)] = 2nH(X|Y) input sequences x such that (x,y)

is a typical pair. S will be the set of 2nH(X|Y) input sequences x associated with y .

2nH(X|Y)

2nH(X)2nH(Y)


ANNEXE 2:The noisy channel coding theoremShannon’s informal interpretation

• Let us consider now another source with entropy rate delivering sequences

or codewords of length n. This source will have 2nR high probability sequences. We wish to

associate each of these sequences with one of the possible channel inputs in such a way

to get an arbitrarily small error probability. One way is to randomly associate each source

sequence to a channel input sequence, and calculate the frequency of errors.

• If a codeword x(i) is transmitted through the channel and the sequence y is received, an

error in decoding is possible only if at least one codeword x(j) , j ≠ i belongs to the set S

associated with y:

( ){ } ( ){ }∑≠=

∈≤≠nR

ijj

jj SxPSijxP2

1

tobelongs , , oneleast at

( ) ∞→→=≤ nnC

nR

XnH

YXnHnR as 0

2

2

2

22

)(

)(

)(XHCR ≤≤

information theory and coding

Documents

ca information theory

information measure

communication channel

q x q x x q x p x p

source coding

communication systems

disturbed channel

channel coding9