information and communication theory lecture 6

Information and Communication Theory

Lecture 6

Channel Coding

Mario A. T. Figueiredo

DEEC, Instituto Superior Tecnico, University of Lisbon, Portugal

2021

Lecture 6 (Channels) Information and Communication Theory 2021 1/∞

Channel Coding

Message W ∈ {1, ...,M}

Channel input alphabet X ; output alphabet Y.

Encoder: f : {1, ...,M} → X n.

Decoder: g : Yn → {1, ...,M}.

Message estimate: W = g(Y n)

Memoryless channel model: (X , p(y|x),Y)

p(y1, ..., yn︸︷︷︸yn

|x1, ..., xn︸︷︷︸xn

) =

n∏i=1

p(yi|xi)

An (M,n) code: ({1, ...,M}, f, g)


Channels

Memoryless channel model: (X , p(y|x),Y)

Channel matrix: |X | × |Y|, with P = [Pi,j ] = p(Y = j|X = i).

Channel capacity:C = max

p(x)I(X;Y )

the maximum mutual information over all input distributions.

Example: noiseless binary channel (Y = X):

I(X;Y ) = H(Y ) = H(X).

C = maxp(x)

I(X;Y ) = maxp(x)

H(X) = 1 bit/symbol

A noiseless binary channel, can transmit 1 bit/symbol.Lecture 6 (Channels) Information and Communication Theory 2021 3/∞

Binary Symmetric Channel

Binary symmetric channel: (X , p(y|x),Y)

H(Y |X = x) = H(α, 1− α), for x = 0 or x = 1.

H(Y |X) = H(α, 1− α)

I(X;Y ) = H(Y )−H(α, 1− α)

Capacity: let P(X = 0) = β

C = maxβ

H(Y )−H(α, 1− α)

= 1−H(α, 1− α) bits/symbol

...achieved for β = 1/2.

For α = 0 or α = 1, C = 1; for α = 1/2, C = 0.


Binary Erasure Channel

Binary erasure channel: (X , p(y|x),Y)

H(X|Y = 1) = 0, for y = 0 or y = 1. H(X|Y = ∗) = 1

H(Y |X) = δ

I(X;Y ) = H(X)−H(X|Y ) = H(X)− δ

Capacity: let P(X = 0) = β

C = maxβ

H(X)− δ

= 1− δ bits/symbol

...achieved for β = 1/2.

For α = 0, C = 1; for α = 1, C = 0.


Noisy Typewriter

Noisy typewriter: (X , p(y|x),Y)

H(Y |X = x) = 1, for x = 1, 2, 3, 4. H(Y |X) = 1

I(X;Y ) = H(Y )−H(Y |X) = H(Y )− 1

Capacity:

C = maxp(x)

H(Y )− 1

= 1 bit/symbol

...achieved for p(x) uniform.


Properties of Channel Capacity

Because I(X;Y ) ≥ 0, then C ≥ 0

Because I(X;Y ) = H(X)−H(X|Y ) ≤ H(X),

C = maxp(x)

I(X,Y ) ≤ maxp(x)

H(X) ≤ log |X |

Because I(X;Y ) = H(Y )−H(Y |X) ≤ H(Y ),

C = maxp(x)

I(X,Y ) ≤ maxp(x)

H(Y ) ≤ log |Y|

Corollary: C ≤ min{log |X |, log |Y|}


Exercises

Compute the capacity of a series connection of two binary symmetricchannels. Hint: consider the equivalent BSC.

Consider the parallel of two independent channels (X1, p1(y|x),Y1)and (X2, p2(y|x),Y2), i.e., the channel

(X1 ×X2, p(y1|x1)p(y2|x2),Y1 × Y2).

what is the capacity of this channel?

Consider N channels with |X | = |Y| and non-maximal capacity, i.e.,C < log |X |, connected in series. Show that the capacity of theresulting channel converges to zero as N goes to infinity. Hint: usethe data processing inequality.


Exercises

Compute the capacity and the maximizing p(x) for the Z channel.

Consider a channel obtained by taking two conditional independentlooks at the output of a channel of capacity C, for each input: Y1and Y2. Show that the resulting capacity C ′ ≤ 2C. Hint: begin byshowing that I(X;Y1, Y2) = 2I(X;Y1)− I(Y1;Y2).

A symmetric channel is one in which every row of the channel matrixis a permutation of every other row and every column is a permutationof every other column. Show that in this case, the capacity is

C = log |Y| −H(any row of the channel matrix).

Show that the same result applies to weakly symmetric channels,where the columns are only required to sum to he same number.


Channel Coding

Conditional probability of error (for i ∈ {1, ...,M})

λi = P(g(Y n) 6= i|Xn = f(i)

)=∑yn∈Yn

P(Y n = yn|Xn = f(i)) 1g(yn)6=i

where 1A = 1, if A is true, and 1A = 0, if A is false.

Maximum probability of error: λ(n) = maxi∈{1,...,M}

λi

Average probability of error: P (n)e = 1

M

M∑i=1

λi

Probability of error: P(g(Y n) 6= i)

If the message symbols are equiprobable: P(g(Y n) 6= i) = P (n)e

Of course, P(n)e ≤ λ(n) and P(g(Y n) 6= i) ≤ λ(n)


Channel Coding

The rate of an (M,n) code

R =log2M

n

A rate R is achievable if there is a sequence of (d2nRe, n) codes, s.t.

limn→∞

λ(n) = 0

(Operational) capacity of a channel is

Coper = sup{R : R is achievable}

The channel coding theorem essentially states that

Coper = C = maxp(x)

I(X;Y )


Channel Coding: Examples

Consider a quarternary source M = 4, thus log2M = 2 bits

Using a binary noiseless channel, C = 1 bits/transmission, we needn = 2 transmissions to send each symbol:

R =log2M

n=

2

2= 1 bits/transmission

is this rate achievable? Yes, because the channel is noiseless.

What if C < 1? Rate 1 is no longer achievable!

Using a noiseless quarternary channel (|X | = |Y| = 4): only needn = 1 transmissions,

R =log2M

1= 2 bits/transmission

is this rate achievable? Yes, because the channel is noiseless: C = 2


Channel Coding: Example

Consider a binary symmetric channel with α = 1/5, thus C = 0.21

Thus, R = 0.25 is not achievable; R = 0.2 is achievable.

Examples of sequences (d2nRe, n), for R = 0.2:

(21, 5), (22, 10), (23, 15), ... e.g., use 10-bit codewords to send 2 bits

Examples of sequences (d2nRe, n), for R = 0.25:

(21, 4), (22, 8), (23, 12), ... e.g., use 8-bit codewords to send 2 bits

The channel coding theorem states that

X There is a sequence of (d2n 0.2e, n) codes, s.t. limn→∞ λ(n) = 0.

X For any sequence of (d2n 0.25e, n) codes, limn→∞ λ(n) 6= 0.


Channel Coding: Example

The noisy typewriter is a simple example of the theorem.

Capacity: C = 1 bit/transmission.

Input and output alphabets X = Y = {1, 2, 3, 4}.

In this case, R = C = 1 is achievable.

Codes (d2ne, n) have λ(n) = 0, for any n.

Encoder (for n = 1, thus M = 2n = 2): f(1) = 1; f(2) = 3.

Decoder: g(1) = g(2) = 1; g(3) = g(4) = 2.

Since λ(n) = 0, C = 1 is the zero-error capacity.


Asymptotic Equipartition: Motivation 1

Toss a fair coin 100 times: X1, ..., X100 ∈ {0, 1}.

...there are 2100 ' 1.27× 1030 possible outcomes,

...each with probability 2−100 ' 7.9× 10−30

The overwhelming majority has close to 50/50 heads/tails

How overwhelming? Let S = X1 + · · ·+X2,

P(S ∈ {47, ..., 53}) = 2−10053∑j=47

(100j

)' 0.52

How many sequences are in this set?

|{

(x1, ..., x100) : S ∈ {47, ..., 53}}| =

53∑j=47

(100j

)' 6.54× 1029

...fraction of the total: ' 6.54× 1029/2100 ' 0.52


Asymptotic Equipartition: Motivation 2

Unfair coin (P(heads) = P(Xi = 1) = 0.9) 100 tosses: X1, ..., X100.

...there are 2100 ' 1030 possible outcomes,

The overwhelming majority has close to 90/10 heads/tails

How overwhelming? Let S = X1 + · · ·+X2,

P(S ∈ {87, ..., 93}) =

93∑j=87

0.9 j 0.1100−j(

100j

)' 0.76

How many sequences are in this set?

|{

(x1, ..., x100) : S ∈ {87, ..., 93}}| =

93∑j=87

(100j

)' 8.3× 1015

...fraction of the total: ' 8.3× 1015/2100 ' 6.5× 10−15


Asymptotic Equipartition: Law of Large Numbers

Consider X1,...,Xn i.i.d. with E[Xi] = µ.

Weak law of large numbers (WLLN) (Bernoulli, 1713)

limn→∞

1

n

(X1 + · · ·+Xn

)= µ, (in probability)

Applying to log p(X1, ..., Xn),

− 1

nlog p(X1, ..., Xn) = − 1

n

n∑i=1

log p(Xi) −→n→∞

E[− log p(X)] = H(X)

This convergence is in probability: for any ε > 0,

limn→∞

P[∣∣∣∣− log p(X1, ..., Xn)

n−H(X)

∣∣∣∣ < ε

]= 1


Asymptotic Equipartition Property (AEP)

Definition: for n i.i.d. samples x1, ..., xn of X ∈ X , the set of

ε-typical sequences A(n)ε (called typical set) is

A(n)ε = {(x1, ..., xn) :

∣∣− 1

nlog p(x1, ..., xn)−H(X)

∣∣ ≤ ε}The condition can also be written as

2−n(H(X)+ε) ≤ p(x1, ..., xn) ≤ 2−n(H(X)−ε)

AEP theorem: for any ε > 0 and n sufficiently large,

P[(x1, ..., xn) ∈ A(n)

ε

]≥ 1− ε

(1− ε)2n(H(X)−ε) ≤ |A(n)ε | ≤ 2n(H(X)+ε)


AEP Corollary

AEP theorem: for any ε > 0 and n sufficiently large,

P[(x1, ..., xn) ∈ A(n)

ε

]≥ 1− ε

(1− ε)2n(H(X)−ε) ≤ |A(n)ε | ≤ 2n(H(X)+ε)

Corollary: if H(X) < log |X |, for a sufficiently small ε,

limn→∞

|A(n)ε ||X n|

≤ limn→∞

2n(H(X)+ε−log |X |) = 0,

if ε < log |X | −H(X).

Typical set: vanishingly small volume with arbitrarily high probability.


AEP: Example 1

Toss a fair coin n times: X1, ..., Xn ∈ {0, 1}; H(X) = 1

...there are 2n possible outcomes,

...each with probability p(x1, ..., xn) = 2−n

With ε = 0.02,

A(n)0.02 = {(x1, ..., xn) : 2−n(1.02) ≤ p(x1, ..., xn) ≤ 2−n(0.98)} = {0, 1}n

0.98 2n(0.98) ≤ |A(n)0.02| ≤ 2n(1.02) (|A(n)

0.02| = 2n)

1 = P[(x1, ..., xn) ∈ A(n)

0.02

]≥ 0.98

For maximum entropy (H(X) = 1), the AEP is uninformative.


AEP: Example 2

Toss a unfair coin (probability of heads 0.8) n times: X1, ..., Xn.

Entropy: H(X) ' 0.72

With ε = 0.02,

A(n)0.02 = {(x1, ..., xn) : 2−n(0.74) ≤ p(x1, ..., xn) ≤ 2−n(0.70)}

|A(n)0.02| ≤ 2n 0.74

P[(x1, ..., xn) ∈ A(n)

0.02

]≥ 0.98, for n large enough

For non-maximum entropy, the AEP is very informative:

|A(n)0.02|

|{0, 1}n|≤ 2−n 0.26 (e.g., for n = 100, 2−26 ' 10−8 )


Interlude: AEP and Source Coding

Source X ∈ X ; order n extension of : Xn = (X1, ..., Xn) ∈ X n.

Coding method: given a sequence (x1, ...., xn),

X if (x1, ...., xn) ∈ A(n)ε code it using dn(H(X) + ε)e bits.

...enough, because #A(n)ε ≤ 2n(H(X)+ε)

X if (x1, ...., xn) 6∈ A(n)ε , code it using dlog |Xn|e = dn log |X |e bits.

...enough, because |A(n)ε | ≤ |Xn| = |X |n

X to distinguish the two cases, use a 1-bit prefix.

Length of this coding scheme:

lC(x1, .., xn) =

{1 + dn(H(X) + ε)e if (x1, ...., xn) ∈ A(n)

ε

1 + dn log |X |e if (x1, ...., xn) 6∈ A(n)ε


Interlude: AEP and Source Coding

Length of the coding scheme in the previous slide:

lC(x1, .., xn) <

{2 + n(H(X) + ε) if (x1, ...., xn) ∈ A(n)

ε

2 + n log |X | if (x1, ...., xn) 6∈ A(n)ε

Expected length L[C] = E[lC(X1, ..., Xn)] (in bits/(n symbols)), for0 < ε� 1 and sufficiently large n,

L[C] < P[A(n)ε ]︸︷︷︸≤1

(2 + n(H(X) + ε)) +(1− P[A(n)

ε ])︸︷︷︸

≤ε

(2 + n log |X |)

≤ 2 + (n(H(X) + ε)) + ε(n log |X |)= 2 + n

(H(X) + ε+ ε log |X |

)Normalize to bits/symbol, dividing by n:

L[C]

n≤

2 + n(H(X) + ε+ ε log |X |

)n

,

...that is, L[C]/n can be arbitrarily close to H(X).


Channel Coding TheoremConsider a discrete memoryless channel with capacity C. Then,

1) Any R < C is achievable: there exist sequences of (d2nRe, n) codessuch that lim

n→∞λ(n) = 0.

2) Any sequence of (d2nRe, n) codes with limn→∞

λ(n) = 0, must have

R ≤ C

Intuition: for large n, every channel looks like a noisy typewriter.


Channel Coding Theorem: Overview of the Proof

Take p∗(x) = arg maxp(x)

I(X;Y ), and p(xn) =∏ni=1 p

∗(xi)

For every xn, consider H(Y n|Xn = xn); conditional typical set

A(n)ε (xn) = {yn :

∣∣− 1

nlog p(yn|xn)−H(Y |X)

∣∣ ≤ ε}For arbitrarily small ε and large n, AEP states that

|A(n)ε (xn)| ' 2nH(Y |X)

The unconditional typical set is

A(n)ε = {yn :

∣∣− 1

nlog p(yn)−H(Y )

∣∣ ≤ ε}For arbitrarily small ε and large n, AEP states that

|A(n)ε | ' 2nH(Y )


Channel Coding Theorem: Overview of the Proof

To have (asymptotically as n→∞) error-free communication:

X Different A(n)ε (xn) must be disjoint:

X All A(n)ε (xn) must be inside A

(n)ε

The maximum number of words that we can have is thus,

M = 2nR ≤ |A(n)ε |

|A(n)ε (xn)|

' 2n(H(Y )−H(Y |X)) = 2nI(X;Y ) ≤ 2nC

...which leads to R < C.

This was not a rigorous proof; if you’re interested in the details, seethe recommended reading.


Repetition Codes

Unlike for source coding (where we have Huffman codes), buildingcapacity-approaching codes is harder.

Simplest code: repetition; e.g., (d2n/3e, n) codes, rate R = 1/3.

For n = 3, we have (2, 3)-codes, thus M = 2 words, W ∈ {0, 1},

encoder f(0) = 000, f(1) = 111

decoder g(y3) = arg mini∈{0,1}

dH(y3, f(i)),

where dH is the Hamming distance (number of bits in which thewords differ): minimum distance decoding.

For higher n, we have (22, 6) codes (M = 4), ..., (25, 15) codes(M = 32), ...


Error Correction and Error Detection

A binary encoder f : {1, ...,M} → {0, 1}n defines a set of codewords:

{f(1), f(2), ..., f(M)}.

Minimum distance decoding of received word yn:

g(yn) = arg mini∈{1,...,M}

dH(yn, f(i))

Minimum distance of the code:

dmin = mini 6=j

dH(f(i), f(j))

Error correction: a code corrects up todmin − 1

2errors.

Error detection: a code detects up to dmin − 1 errors.

Exercise: show that a repetition code corrects up to 1−R2R errors and

detects up to (1−R)/R errors.


Hamming Codes

Binary linear codes are built on binary linear algebra.

Before proceeding, we need need binary arithmetic: ({0, 1},+,×)

X addition: 0 + 0 = 0, 0 + 1 = 1, 1 + 0 = 1, 1 + 1 = 0.

X multiplication: 0× 0 = 0, 0× 1 = 0, 1× 0 = 0, 1× 1 = 1.

X both are clearly commutative a+ b = b+ a and a× b = b× a.

X also associative: a+ (b+ c) = b+ (a+ c) and (a× b)× c = a× (a× c).

X distributive property: a× (b+ c) = a× b+ a× c.

In binary arithmetic, a+ b = a− b.

Based on binary arithmetic, we may build binary linear algebra, withbinary vectors and matrices.

Can be extended to other Galois fields GF (q); e.g., GF (q) ternaryarithmetic, with modulo-3 addition and multiplication.


Hamming Codes

Generalizes the idea of parity check for error detection/correction.

Hamming(n, k) code is (in the previous notation) a (2k, n) code.

Rate of a Hamming(n, k) code: R = k/n.

Classical example: Hamming(7, 4) generator matrix:

G =

1 0 0 0 1 1 00 1 0 0 1 0 10 0 1 0 0 1 10 0 0 1 1 1 1

=[I4|A

]

Generation of codeword x from message: example m = (1101):

x = mG = (1101)G = (1101100)

where the vector-matrix product is in binary arithmetic.


Hamming CodesGeneration of codeword x from message: example m = (1101):

x = mG = (1101)G = (1101100)

Checking codewords:parity-check matrix H such that

HGT = 0 ⇒ HxT = H(mG)T = HGTmT = 0

For G =[I4|A

], then H =

[AT |I3

]HGT =

[AT |I3

] [ I4AT

]= AT + AT = 0

For matrix G in the previous slide

H =[AT |I3

]=

1 1 0 1 1 0 01 0 1 1 0 1 00 1 1 1 0 0 1

...the columns are the (23 − 1 = 7) 3-bit binary words, except (000).


Hamming Codes

Let x + e be a received codeword, with error vector e.

Checking: H(x + e)T = HxT︸︷︷︸0

+HeT = HeT

No errors detected if and only if HeT = 0. Conditions:

X Zero errors, HeT = 0.

X One error, HeT 6= 0; it is one of the columns of H.

X Two errors, HeT 6= 0; it is the sum of two columns of H, which are alldifferent.

Any two errors are detected, but three errors may be undetected,since the sum of any two columns equals another column.

Exercise: show that for a Hamming(7, 4) code, dmin = 3, thus itcorrects 1 error and detects up to 2 errors.


Hamming Codes

Minimum distance of Hamming(7, 4) code is 3.

Thus it can correct 1 error; how?

Permute the columns of H (and similarly of G) into

H =

0 0 0 1 1 1 10 1 1 0 0 1 11 0 1 0 1 0 1

Check x + e, assuming only one error in position, say 5,

(H(xT + eT ))T = eHT = (0000100)HT = (101)

...precisely the binary word for 5, the error position.


Hamming CodesGeneral Hamming(n, k) codes.

For some r ≥ 2: n = 2r − 1 and k = 2r − r − 1.

The Hamming(7, 4) code: r = 3, n = 23− 1 = 7, k = 23− 3− 1 = 4.

Columns H: all n = 2r − 1 binary words of r bits, except zero.

Put H is systematic form H =[AT |I3

]and build G =

[I4|A

].

Rate: R = k/n = (2r − r − 1)/(2r − 1)

Exercise: show that, for any r, dmin = 3.

Remarkably, limr→∞

2r − r − 1

2r − 1= 1

The repetition code also has minimum distance 3, but R = 1/3.

Error-correcting codes are a huge R&D area, without which moderncommunications would not be possible.Lecture 6 (Channels) Information and Communication Theory 2021 34/∞

Exercises

Show that the repetition code of R = 1/3 is a Hamming(n, k) code.Find r, n, k, and the matrices H and G.

Consider a Hamming(7, 4) code in systematic form. Decode the word(1011011).

A Hamming code is a particular case of the more general family oflinear codes, i.e., where the code words are generated as x = mG.Show that for any binary linear code,

a) the zero word is a valid codeword;b) dmin is the weight (number of 1s) is the minimum-weight code word.

Assuming a Hamming(7, 4) code is used on a BSC with probability oferror α, what is the probability of an erroneous decoding?


Gaussian Channel

Gaussian channel: X = Y = R, Y = X + Z.

Mutual information

I(X;Y ) = h(Y )− h(Y |X)

= h(Y )− h(X + Z|X)

= h(Y )− h(Z)

= h(Y )− 1

2log(2πeN

)since differential entropy is shift-invariant, and Z is Gaussian andindependent of X.

Since adding a constant to X does not affect I(X;Y ), assumeE[X] = 0, thus var[X] = E[X2] = power.


Gaussian Channel

Gaussian channel: X = Y = R, Y = X + Z.

Mutual information,

I(X;Y ) = h(Y )− 1

2log(2πeN

)≤ 1

2log(2πe(N + E[X2])

)− 1

2log(2πeN

)=

1

2log(

1 +E[X2]

N

)Without a constraint on E[X2], I(X;Y ) is arbitrarily large.

With a power constrain E[X2] ≤ P ,

C = maxfX :E[X2]≤P

1

2log(

1 +E[X2]

N

)=

1

2log(

1 +P

N

)achieved for fX = N (0, P ). P/N = SNR, signal to noise ratio.


Coding for a Gaussian Channel

An (M,n) code for a Gaussian channel, under power constraint P .

X a set of message indices W ∈ {1, ...,M};X an encoder f : {1, ...,M} → Rn, i.e., f(i) = [f1(i), ..., fn(i))] ∈ Rn,

‖f(i)‖2 =

n∑j=1

fj(i)2 ≤ nP.

X a decoder g : Rn → {1, ...,M}.

Conditional, average, and maximum probability of error, λ(n), aredefined as in the discrete channel.

Rate R is achievable if there exists a sequence of (2nR, n) codessatisfying the power constraint P , such that limn→∞ λ

(n) = 0.

The (operational) capacity is: Coper = sup{R : R is achievable}.

The Gaussian channel theorem: Coper = C.


Coding for a Gaussian Channel

Outline of the proof of the Gaussian channel theorem.

We known that Y n = Xn + Zn, that E[‖Xn‖2] ≤ nP , thusE[‖Zn‖2] ' n(P +N).

All the received vectors are, with high probability (w.h.p.), in a sphereof radius

√n(N + P ).

Each received vector is, w.h.p., in a sphere around f(i) of radius√nN .

The volume of a radius-r sphere is V (r) = Cnrn.

The maximum number of (asymptotically) non-intersecting spheres is

M = 2nR ≤ (n(N + P ))n/2

(nN)n/2= 2

n2log(

P+NN

)= 2

n2log(1+ P

N

)...thus R < C.


Sphere Packing

Thus picture becomes accurate for large n, since ”in high dimensions,Gaussian distributions are soap bubbles.”1

1www.inference.vc/high-dimensional-gaussian-distributions-are-soap-bubble/


www.inference.vc/high-dimensional-gaussian-distributions-are-soap-bubble/

Exercises

Consider the multi-path channel where the noises Z1 an Z2 follow aGaussian joint probability density function with zero mean and covariance K

where σ2 is the noise variance and ρ the correlation coefficient. Find thecapacity of the channel. What is the capacity for ρ = 1, ρ = 0, ρ = −1;interpret the results.

Continuous channel with discrete input: consider a channel with inputX ∈ {0, 1} and ouput Y = X + Z, where Z ∈ [0, a] with uniform density.Assuming a > 1, find the capacity of the channel. Repeat for a < 1 andinterpret the result.


Recommended Reading

T. Cover and J. Thomas, “Elements of Information Theory”, JohnWiley & Sons, 2006 (Sections 7.1 to 7.6, 7.11, 9.1).

https://en.wikipedia.org/wiki/Hamming_code


https://en.wikipedia.org/wiki/Hamming_code

information and communication theory lecture 6

Documents