ece 723 – information theory and coding

ECE 723 – Information Theory and Coding

Dr. Steve HranilovicDepartment of Electrical and Computer Engineering

McMaster UniversityEmail: [email protected]

Winter 2004

Portions of this document are c©Copyright by Steve Hranilovic 2004

2

Department of Electrical and Computer Engineering ECE 723–Winter 2004McMaster University Course Outline

Information Theory and CodingInstructor: Dr. Steve HranilovicEmail: [email protected]

Web: http://www.ece.mcmaster.ca/∼hranilovicLectures: Thursdays 9am-12pm (starting Jan. 8th), CRL/B102Office Hours: By appointment in office ITB-147.

Course Description:This course will provide an introductory look into the broad areas of information theory andcoding theory. As stated in the course text,

Information theory answers two fundamental questions in communication theory:what is the ultimate data compression (answer: the entropy H) and what is theultimate transmission rate of communication (answer: the channel capacity C).

In addition to discussing these fundamental performance limits, the course will also presentcoding techniques which approach these limits.

Tentative Outline (time permitting):

• Entropy: entropy, relative entropy, mutual information, chain rules, data processinginequality, the asymptotic equipartition property, entropy rates for stochastic processes.

• Data Compression: the source coding theorem, Kraft inequality, Shannon-Fano codes,Huffman codes, universal source codes.

• Channel Capacity: discrete channels, random coding bound and converse, Gaussianchannels, parallel Gaussian channels and “water-pouring”, bandlimited channels.

• Error Control Coding: linear block codes and their properties, hard-decision de-coding, convolutional codes, Viterbi decoding algorithm, iterative decoding.

Course Text: T.M. Cover and J.A. Thomas, Elements of InformationTheory, John Wiley & Sons, 1991. (on reserve Thode Library)

Reference Texts: S.B. Wicker, Error Control Systems for DigitalCommunication and Storage, Prentice-Hall, 1995. (on reserve Thode)R.G. Gallager, Information Theory and Reliable Communication,John Wiley & Sons, 1965. (on reserve Thode)

Assessment: Mini-Project I – 15%, Midterm – 30%,Mini-Project II– 15%, Final – 40%.

Permitted Aids: Course notes, handwritten class notes and a non-programmablecalculator will be permitted for midterm and final exams.

McMaster University ECE 723 – Information Theory and Coding

3

Policy Reminders

Senate and the Faculty of Engineering require all course outlines to include the following re-minders:

The Faculty of Engineering is concerned with ensuring an environment that is free of all adversediscrimination. If there is a problem, that cannot be resolved by discussion among the personsconcerned, individuals are reminded that they should contact the Department Chair, the SexualHarassment Officer or the Human Rights Consultant, as soon as possible.Students are reminded that they should read and comply with the Statement on AcademicEthics and the Senate Resolutions on Academic Dishonesty as found in the Senate PolicyStatements distributed at registration and available in the Senate Office.Academic dishonesty consists of misrepresentation by deception or by other fraudulent meansand can result in serious consequences, e.g. the grade of zero on an assignment, loss of creditwith a notation on the transcript (notation reads: ”Grade of F assigned for academic dishon-esty”), and/or suspension or expulsion from the university.It is your responsibility to understand what constitutes academic dishonesty. For informa-tion on the various kinds of academic dishonesty please refer to the Academic Integrity Policy,specifically Appendix 3, located at

http://www.mcmaster.ca/senate/academic/ac integrity.htm

The following illustrates only three forms of academic dishonesty:

1. Plagiarism, e.g. the submission of work that is not one’s own or for which other credithas been obtained.

2. Improper collaboration in group work.

3. Copying or using unauthorized aids in tests and examinations.

ECE 723 – Information Theory and Coding McMaster University

4

Note to the Reader:

These lecture notes provide a brief overview of the topics covered in a one semester (12 week)

course on information theory and coding intended for first year graduate students.

I have drawn on a number of sources to produce these course notes. These notes are based

on the course text by Cover and Thomas, Gallager’s classic text on information theory, course

notes provided by Prof. Frank R. Kschischang of the University of Toronto, as well as other

sources which are referenced in the text.

These course notes are a perpetual work in progress. Please report any typographical or other

errors to the author by email.

Note that these lecture notes are not intended to replace the course text or the references.

They merely serve as a starting point for learning and applying the material. Students are

encouraged to read the course text or other references in information and coding theory to gain

a more complete understanding of the area.

I hope you find these notes useful in your studies of information theory and coding.

Steve HranilovicJanuary 8, 2004.


Chapter 1

Introduction

Please refer to the course outline to get details of the course administration, grading, and lecture

times. Also, please consult the course homepage 1 regularly to download course resources,

references, course projects as well as to be aware of recent issues in the course.

This chapter presents an overview of the topics covered in the course. It provides motivation

for further study and places the work in the context of communications theory in general.

1http://www.ece.mcmaster.ca/∼hranilovic/teaching/ece723/ece723.html5

6 CHAPTER 1. INTRODUCTION

1.1 A Short History

This is a brief history of the events leading up to and following Shannon’s pioneering work 2.

In no way is this list meant to be comprehensive, but it is merely to establish the context of

the discoveries of Shannon and others which are presented in this course.

1838 Samuel Morse and Alfred Vail – code book derived to assign sequences of long and short

electrical current pulses to each letter of the alphabet. Their intuition was to assign

shortest sequences to the most frequently used letters. Relative frequencies of letters

estimated by counting the number of types in the various compartments of a printer’s

toolbox. Within 15% of optimal code rate.

1858 Discovery of noise and limited transmit power – It is conjectured that large transmit

voltages caused the failure of the first transatlantic telegraph cable.

1874 Thomas Edison – “quadriplex” telegraph system used two intensities of current as well

as two directions. Rate increases to two bits per symbol.

1876 Alexander Graham Bell demonstrates the telephone at the Centennial Exhibition in

Philadelphia.

1924 Harry Nyquist – Nyquist rate and reconstruction of bandlimited signals from their sam-

ples. Also stated formula R = K log m where R is the rate of transmission, K is a measure

of the number of symbols per second and m is the number of message amplitudes avail-

able. Amount of information that can be transmitted is proportional to the product of

bandwidth and time of transmission.

1928 R.V.L. Hartley (inventor of the oscillator) – In paper entitled, “Transmission of Infor-

mation” proposed formula H = n log s, where H is the “information” of the message, s

is the number of possible symbols, n is the length of the message in symbols.

2This historical time-line is adapted from the introductory chapters of the books by John R. Pierce, entitledAn Introduction to Information Theory: Symbols, Signals and Noise (second edition, Dover Publications Inc,New York, NY, 1980), John G. Proakis’ book, Digital Communications (fourth edition, McGraw-Hill, Boston,NY, 2001.)


CHAPTER 1. INTRODUCTION 7

1938 C.E. Shannon, in his Master’s thesis A Symbolic Analysis of Relay and Switching Circuits

at MIT, makes the link for the first time between Boolean algebra and the construction

of logic circuits.

During WWII Wiener and Kolmogorov - optimal linear filter for the estimation of a signal

in the presence of additive noise.

1948 C.E. Shannon – efficient source representation, reliable information transmission, dig-

itization – foundation of communication and information theory. Made the startling

discovery that arbitrarily reliable communications are possible at non-zero rates. Prior to

Shannon, it was conventionally believed that in order to get arbitrarily low probability of

error, the transmission rate must go to zero. His Mathematical Theory of Communication

reconciled the work of a vast number of researchers and proved to be the foundation of

modern communications theory.

1950 R. Hamming – developed a family of error-correcting codes to mitigate channel impair-

ments.

1952 D. Huffman – efficient source encoding

1950-60’s Muller, Reed, Solomon, Bose, Ray-Chaudhuri, Hocquenghem – Algebraic Codes.

1970’s Wozencraft, Reiffen, Fano, Forney, Viterbi – Convolutional Codes

1980’s Ungerboeck, Forney, Wei – coded modulation

1990’s Berrou, Glavieux, Gallager – Near capacity achieving coding schemes: Turbo Codes,

Low-Density Parity Check Codes and Iterative Decoding.



1.2 Course Overview

Generalized communication system:

Channel

SourceInfo. Transmitter Receiver Destination

Noise

• Information Source–This consists of any source of data we wish to transmit or store.

• Transmitter–Performs task of mapping data source to the channel alphabet in an efficient

manner.

• Receiver–Performs mapping from channel to data to ensure “reliable” reception.

• Destination–Data sink.

Question: Under what conditions can the output of the source be conveyed reliably to the

destination? What is reliable? Low prob. of error? Low distortion (high fidelity)?



Expanded communication system:

Transmitter

ChannelEncoder

SourceEncoder

Source

SourceDecoder

ChannelDecoder

Noise

Channel

Characterized by Capacity (C)Characterized by Entropy (H)

Receiver

Destination

Results:

• Shannon realized that the information source can be represented by a universal represen-

tation termed the bit.

• No loss in optimality in separation of source and channel codes (Joint Source-Channel

Coding Theorem).



Source Encoder

• map from source to bits

• “matched” to the information source

• Goal is to get an efficient representation of the source (i.e., least number of bits per

second, minimum distortion ... etc)

Channel Encoder

• map from bits to channel

• depends on channel available (channel model, bandwidth, noise, distortion, ... etc)

– In communications theory we work with hypothetical channels which in some way

capture the essential features of the physical world.

• Goal is to get reliable transmission

1.2.1 Source Encoder

Goal: To achieve an economical representation (i.e., small number of binary symbols) of the

source on average.

Example: An urn containing 8 numbered balls. One ball is selected. How many binary

symbols are required to represent the outcome?

Outcome 1 2 3 4 5 6 7 8

Representation 000 001 010 011 100 101 110 111

Answer: Require 3 binary symbols to represent any given outcome.



Example: Consider a horse race with 8 horses. It was determined that the probability of horse

i winning is

Pr[horse i wins] =

(1

2,1

4,1

8,

1

16,

1

64,

1

64,

1

64,

1

64

)

Answer 1: Lets try the code of the previous section

Outcome Probability Representation 1

0 12

000

1 14

001

2 18

010

3 116

011

4 164

100

5 164

101

6 164

110

7 164

111

Note that we require 3 binary symbols to represent any given outcome. Note also that the

average number of binary symbols to represent a given outcome is also ¯̀1 = 3 binary symbols.

Answer 2: What if we allow the length of each representation to vary amongst the outcomes.

For example, a Huffman code for this source would give the following representation:

Outcome Probability Representation 2

0 12

0

1 14

10

2 18

110

3 116

1110

4 164

111100

5 164

111101

6 164

111110

7 164

111111



Note: Each bit of each codeword can be thought of as asking a “yes-or-no” question about

the outcome. Bit 1 asks the question, “Did horse 0 win?” (0=yes and 1=no). Similarly for the

remaining bits. We will see this again when we discuss Huffman source codes in greater detail.

The average number of binary symbols required to represent the source, ¯̀2, can be computed

as

¯̀2 =

1

2(1) +

1

4(2) +

1

8(3) +

1

16(4) +

4

64(6) = 2 binary symbols

which is less than ¯̀1 = 3 binary symbols. In fact, the code above provides the minimum average

codeword length of any representation for the source.

Definition: The source entropy, H(X), is defined as

H(X) =∑x∈SX

p(x) log2

1

p(x)

bits. As we will show later in the course, the most economical representation have average

codeword length (¯̀) is

H(X) ≤ ¯̀< H(X) + 1

For the source considered in the example,

H(X) =1

2log 2 +

1

4log 4 +

1

8log 8 +

1

16log 16 +

4

64log 64 = 2 bits

Thus, the above Huffman code for the source is optimal.

One time versus average performance

Note that although ¯̀2 < ¯̀

1, the variance of the codeword lengths in Representation 2 is larger

than the case in Representation 1.

If we had a “one-off’ experiment where it was necessary to ever encode a single outcome, it is

possible that Representation 1 may outperform Representation 2.



Example:

In the English language the frequency of the letter “E” is approximately 13% nearly indepen-

dently of the writer (and it is 5 times more likely then the next most frequent letter). We can

develop a coder for this case which minimizes the average number of bits required to represent

the source but it must be noted that this only minimizes the average number of bits required

In 1939, E.V. Wright published a book of more that 50000 words, entitled Gadsby, in which he

did not use the letter “E” at all! He even avoided abbreviations such as Mr. and Mrs. which

when expanded contain the letter “E”. Here is an excerpt,

Upon this basis I am going to show you how a bunch of bright young folks did find

a champion; a man with boys and girls of his own; a man of so dominating and happy

individuality that Youth is drawn to him as is a fly to a sugar bowl. It is a story

about a small town. It is not a gossipy yarn; nor is it a dry, monotonous account, full

of such customary “fill-ins” as “romantic moonlight casting murky shadows down a

long, winding country road.” Nor will it say anything about tinklings lulling distant

folds; robins carolling at twilight, nor any “warm glow of lamplight” from a cabin

window. No. It is an account of up-and-doing activity; a vivid portrayal of Youth

as it is today; and a practical discarding of that worn-out notion that “a child don’t

know anything.”

In this case the encoder designed to minimize the average representation length for conventional

language would not do very well since the behavior of the text in Gadsby is not “typical”.

• Information theory and coding deal with the “typical” or expected behavior of the source.

• Entropy is a measure of the average uncertainty associated with the source.

• We will demonstrate later in the course that for sequences of outcomes from a source,

that nearly all of the probability mass is contained in a relatively small set termed the

typical set.



• The most likely outcomes need not be typical (Example: Bernoulli source), but there is

a collection of low probability event all with nearly the same probability.

• We will shown that the entropy, H(X), is a measure the size of the typical set, i.e., H(X)

is combinatorial in nature.

1.2.2 Channel Encoder

Goal: To achieve an economical (high rate) and reliable (low probability of error) transmission

of bits over a channel.

• With a channel code we add redundancy to the transmitted data sequence which allows

for the correction of errors that are introduced by the channel

• Alternatively, we can view a channel code as imposing a structure on the set of transmitted

codewords which is exploited at the receiver to improve detection.

Example: Nature’s coding: DNA and RNA contain a coded recipe to synthesize proteins in

organisms. Error protection in replication of DNA.



All possiblereceived vectorsnoisy channel

Input vectors to

• Each codeword which is transmitted is corrupted by the channel. Each transmitted

codeword corresponds to a set of possible received vectors (set of “typical” outcomes)

• Specify a code (i.e., a set of codewords) so that at the receiver it is possible to distinguish

which element was sent with high-probability (i.e., probability of overlap of regions is

small).

• The channel coding theorem tells us the maximum number of such codewords we can

define and still maintain completely distinguishable outputs.

Shannon’s Channel Coding Theorem There is a quantity called the capacity, C, of a

channel such that for every rate R < C there exists a sequence of ( 2nR︸︷︷︸# codewords

, n︸︷︷︸# chan. uses

)

codes such that Pr[error] → 0 as n →∞. Conversely, for any code, if Pr[error] → 0 as n →∞then R < C.



Example: Binary Symmetric Channel

10

1 1

0

p

p

(1−p)

(1−p) 1/2p

C

• Input channel alphabet = Output channel alphabet = {0, 1}

• Assume independent channel uses (i.e., no memory).

• Channel randomly flips the bit with probability p.

• For p = 0 or p = 1, C = 1 bits/channel use (noiseless channel or inversion channel

• Worst case, p = 1/2, in which case the input and the output are statistically independent

→ C = 0

Question: How do we devise codes which perform well on this channel ?

Repetition Code

In this code, we take the bit to be transmitted an repeat it 2m + 1 times, for some integer

m ≥ 0. The code consists of two possible codewords,

C = {000 · · · 0︸︷︷︸2m+1

, 111 · · · 1︸︷︷︸2m+1

}

At the receiver decoding is done by a majority voting scheme: If there are mores 0’s than 1’s

in the receive codeword then declare 0 transmitted, else 1.



Consider the case when m = 3.

Received codewords Received codewords

decoding to message ‘0’ decoding to message ‘1’

000 111

001 110

010 101

100 011

It is clear that as long as the number of flipped bits is less than half the length of the repetition

codewords, it is possible to recover the message exactly.

As m → ∞ we would expect that the channel to flip a proportion of p < 1/2 of the bits. In

fact, we will show that as m → ∞ it is possible to make the probability that more than a

proportion of p of the bits is flipped negligibly small (by the weak law of large numbers).

Prob

abili

ty

increasing

1/2p

Proportion of bits flipped

m

Therefore, Pr[error] → 0 as m →∞, however, rate→ 0 ! Therefore, this is not an efficient code.

Shannon demonstrated that there exist codes which are capacity achieving at non-zero rates.

Hamming Code

Consider forming a code with a little more algebraic structure. Define a code as the span of

a basis set over the binary field (i.e., take all additions modulo 2). A (7, 4) Hamming code is



defined as the set of all linear combinations of the four rows of the generator matrix,

G =

1 1 1 0 0 0 0

0 1 0 1 1 0 0

0 0 1 1 0 1 0

0 0 0 1 1 1 1

i.e, for x ∈ {0, 1}4 (binary vectors of length 4), every code word c = xG. Thus, the rate of this

code is 4/7 The set of all possible codewords is

C =

0000000 1110000 0101100 1011100

0011010 1101010 0110110 1000110

0001111 1111111 0100011 1010011

0010101 1100101 0111001 1001001

Let c ∈ C be represented as c = (u1, u2, u3, u4, p1, p2, p3).

Decoding is done by multiplying the received vector by the parity check matrix, H. The code

C is the kernel or null-space of HT , that is, for c ∈ C, cHT = 0. For the (7, 4) Hamming code

presented above, the parity check matrix is

H =

1 0 1 1 1 0 0

1 1 0 1 0 1 0

0 1 1 1 0 0 1

This decoding essentially solves the parity check equations given by each row of H, namely

1. u1 + u3 + u4 + p1 = 0

2. u1 + u2 + u4 + p2 = 0

3. u2 + u3 + u4 + p3 = 0

These equations can be represented in a Venn diagram. Consider all the possible single bit



errors. Say p1 is in error. It is clear that only equation 1 will not be satisfied while equations 2

and 3 will be. Additionally, if u2 is corrupted, then equation1 will be satisfied and equations 2

and 3 will not be. In this way the Hamming code is able to detect and correct every single bit

error.

Eqn. 1

p1

p3

u3 u2

u4

u1 p2

Eqn. 3

Eqn. 2

Thus, the (7, 4) Hamming code is a single error correcting code operating at rate 4/7 ≈ 0.57.

By comparison, the single error correcting repetition code of length three operates at a rate of

1/3.

1.3 Review of Probability Theory

In order study information theory and coding it is necessary to have some background in

probability theory. Data transmission on these channels is viewed as a random experiment

while the entropy and mutual information are computed with respect to the underlying random

variables at the source and receiver. Here we present only a brief description of some of the

essential concepts required to understand the initial sections of the course. Additional theory

will be introduced as required in the course. More complete references on probability theory

and stochastic processes can found in undergraduate 3 level as well as graduate 4 level texts.

3A. Leon-Garcia, Probability and Random Processes for Electrical Engineering, second edition, Addison-Wesley Publishing Company, Reading, MA, 1994.

4A. Papoulis and S. U. Pillai, Probability, Random Variables and Stochastic Processes, fourth edition, Mc-Graw Hill Companies, 2002.



1.3.1 Discrete Probability Models

Discrete probability models consist of a random experiment with a finite or countable number

of outcomes. For example: toss of a die, flip of a coin, number of data packets arriving in a

time interval, etc.

The sample space, S, of the experiment is the set of all possible outcomes and contains a finite

or countable number of elements. Let S = {ζ1, ζ2, . . .}.

An event is a subset of S. Events consisting of a single outcome are termed elementary events.

Example: S is the certain event and ∅ is the null or impossible event.

Note that S and ∅ are events of every sample space.

Every event A ⊆ S has a real number, p(A) assigned to it which is the probability of event A.

Probabilities must satisfy the axioms:

1. p(A) ≥ 0

2. p(S) = 1

3. for A,B ∈ S, if A ∩B = ∅,then p(A ∪B) = p(A) + p(B).

Formally, a probability space is defined by the triple (S,B, p) where B is the Borel field of all

events of S. For the countable discrete models considered here, we let B be the power set of S,

i.e. the set of all subsets of S. We can assign probabilities to all subsets of S so that the axioms

of probability are satisfied (Note that this is not the case for continuous random variables where

B is defined as the smallest Borel field that includes all half lines x ≤ xi for any xi).

Since S is countable, the probability of every event can be written in terms of the probabilities

of elementary events, ζi, namely p({ζi}).

Random Variables: A random variable, X(ζ), is a function which assigns a real number to

every outcome ζi ∈ S, i.e.,

X : S → RMcMaster University ECE 723 – Information Theory and Coding


Notice that X is neither random nor a variable!

Let SX be the set of all values taken by X and define

pX(x) := p[{ζi : X(ζi) = x}]. (1.1)

Notation: X(ζ) will be abbreviated to X to simplify notation with the link between random

variables and the underlying random ensemble taken as given. We further abuse notation by

writing (1.1) as

pX(x) := Pr[X = x].

By the axioms of probability

pX(x) ≥ 0∑x∈SX

pX(x) = 1.

The values pX(x) are called the probability mass function (pmf) of X.

Notation: For convenience we will abbreviate pX(x) to p(x). Although this is an abuse of

notation it will be clear from the context which random variable is referred to.

Vector Random Variables The above can be extended to the case of vector random variables.

Vector random variables assign a vector of real numbers to each outcome of S. For example,

consider the vector random variable Z consisting of all two-tuples of the form (X, Y ). This

random vector can be viewed as the combination of two random variables describing each of

the coordinates. The pmf of Z is often termed the joint pmf of X and Y and can be written as

pZ(x, y) = pX,Y (x, y)

= Pr(X = x, Y = y)

The pmf’s of the coordinates, p(x) and p(y), are termed the marginals of pX,Y (x, y). Let SX

and SY denote the range of values for each co-ordinate of Z. The maginals can then be written



as

p(x) =∑y∈SY

pX,Y (x, y)

and

p(y) =∑x∈SY

pX,Y (x, y).

Notation: We similarly abuse notation in the case of vector random variables and let pX,Y (x, y)

be denoted by p(x, y).

1.3.2 Conditional Probability and Independence

Take two random variables X and Y defined on the same probability space. The conditional

probability mass function, pX|Y (xk|yj), describes the probability of the event [X = xk] given

that the event [Y = yj] has occurred. Formally it can be defined as,

pX|Y (xk|yj) = Pr[X = xk|Y = yj]

=Pr[X = xk, Y = yj]

Pr[Y = yj]

=p(xk, yj)

p(yj)

whenever p(yj) > 0.

Notation: We will again simplify notation and represent pX|Y (x|y) as p(x|y).

The random variables X and Y as said to be independent if their joint distribution can be

factored as

∀(x, y) ∈ SX × SY p(x, y) = p(x)p(y).



For independent X and Y ,

p(x|y) =p(x, y)

p(y)=

p(x)p(y)

p(y)= p(x)

Notice that knowledge of Y does not impact the distribution of X. In a future lecture we will

demonstrate that when X and Y are independent, Y provides no information about X.

1.3.3 Expected Value

The expected value or mean of a random variable X is defined as

E[X] =∑

xk∈SX

xkp(xk).

The expected value of a function of X, f(X), is itself a random variable as has mean

E[f(X)] =∑

xk∈SX

f(xk)p(xk).


ece 723 – information theory and coding

Documents