information theory, coding and cryptography unit-1 by arun pratap singh

UNIT : I

INFORMATION THEORY, CODING & CRYPTOGRAPHY (MCSE 202)

PREPARED BY ARUN PRATAP SINGH 5/26/14 MTECH 2nd SEMESTER

PREPARED BY ARUN PRATAP SINGH 1

1

INTRODUCTION TO INFORMATION THEORY :

Information theory is a branch of science that deals with the analysis of a communications

system

We will study digital communications using a file (or network protocol) as the channel

Claude Shannon Published a landmark paper in 1948 that was the beginning of the branch of

information theory

We are interested in communicating information from a source to a destination

In our case, the messages will be a sequence of binary digits

Does anyone know the term for a binary digit?

One detail that makes communicating difficult is noise

noise introduces uncertainty

Suppose I wish to transmit one bit of information what are all of the possibilities?

tx 0, rx 0 - good

tx 0, rx 1 - error

tx 1, rx 0 - error

tx 1, rx 1 - good

Two of the cases above have errors this is where probability fits into the picture

In the case of steganography, the noise may be due to attacks on the hiding algorithm

UNIT : I


2

INFORMATION MEASURES :

Any information source, analog or digital, produces an output that is random in nature. If it were not

random, i.e., the output were known exactly, there would be no need to transmit it. We live in an analog

world and most sources are analog sources, for example , speech, temperature fluctuations etc. The

discrete sources are man-made sources, for example, a source (say, a man) that generates a sequence of

letters from a finite alphabet (typing his email).

Before we go on to develop a mathematical measure of information , let us develop an intuitive feel for

it. Read the following sentences :


3

(A) Tomorrow, the sun will rise from the East

(B) The phone will ring in the next one hour

(C) It will snow in Delhi this winter

The three sentences carry different amounts of information. In fact, the first sentence hardly carries any

information. Everybody knows that the sun rises in the East and the probability of this happening again is

almost unity.

Sentence (B) appears to carry more information than sentence (A). The phone may ring, or it may not.

There is a finite probability that the phone will ring in the next one .

The last sentence probably made you read it over twice . This is because it has never snowed in Delhi, and

the probability of snowfall is very low. It is interesting to note that the amount of information carried by

the sentences listed above have something to do with the probability of occurrence of the events stated

in the sentences. And we observe an inverse relationship. Sentence (A) which talks about an event which

has a parobability of occurrence very close to 1 carries almost no information. Sentence (C), which has a

very low probability of occurrence , appears to carry a lot of information. The other interesting thing to

note is that the length of the sentence has nothing to do with the amount of information it conveys. In

fact, sentence (A) is longest but carries the minimum information.

We will now develop a mathematical measure of information.


4


5


6

REVIEW PROBABILITY THEORY :

We could choose one of several technical definitions for probability, but for our purposes it refers to an

assessment of the likelihood of the various possible outcomes in an experiment or some other situation

with a random outcome.

Why Probability Theory?

Information is exchanged in a computer network in a random way, and events that modify the

behavior of links and nodes in the network are also random

We need a way to reason in quantitative ways about the likelihood of events in a network, and to

predict the behavior of network components.

Example 1:

Measure the time between two packet arrivals into the cable of a local area network.

Determine how likely it is that the interarrival time between any two packets is less than T sec.

A mathematical model used to quantify the likelihood of events taking place in an experiment in

which events are random.

It consists of:

A sample space: The set of all possible outcomes of a random experiment.

The set of events: Subsets of the sample space.

The probability measure: Defined according to a probability law for all the events of the

sample space.


7


8

RANDOM VARIABLES :

In probability and statistics, a random variable, aleatory variable or stochastic variable is

a variable whose value is subject to variations due to chance (i.e. randomness, in a mathematical

sense).

A random variable's possible values might represent the possible outcomes of a yet-to-be-performed

experiment, or the possible outcomes of a past experiment whose already-existing value is uncertain

(for example, as a result of incomplete information or imprecise measurements). They may also

conceptually represent either the results of an "objectively" random process (such as rolling a die), or

the "subjective" randomness that results from incomplete knowledge of a quantity. The meaning of the


9

probabilities assigned to the potential values of a random variable is not part of probability theory itself,

but instead related to philosophical arguments over the interpretation of probability.

The mathematical function describing the possible values of a random variable and their associated

probabilities is known as a probability distribution. Random variables can be discrete, that is, taking

any of a specified finite or countable list of values, endowed with a probability mass function,

characteristic of a probability distribution; or continuous, taking any numerical value in an interval or

collection of intervals, via a probability density function that is characteristic of a probability distribution;

or a mixture of both types. The realizations of a random variable, that is, the results of randomly

choosing values according to the variable's probability distribution function, are called random variates.

Example :

The possible outcomes for one coin toss can be described by the sample

space . We can introduce a real-valued random variable that models a $1

payoff for a successful bet on heads as follows:

If the coin is equally likely to land on either side then Y has a probability mass function given

by:


10


11


12


13


14

RANDOM PROCESS :

In probability theory, a stochastic process or sometimes random process (widely used) is a

collection of random variables; this is often used to represent the evolution of some random value, or

system, over time. This is the probabilistic counterpart to a deterministic process (or deterministic

system). Instead of describing a process which can only evolve in one way (as in the case, for example,

of solutions of an ordinary differential equation), in a stochastic or random process there is some

indeterminacy: even if the initial condition (or starting point) is known, there are several (often infinitely

many) directions in which the process may evolve.

In the simple case of discrete time, as opposed to continuous time, a stochastic process involves

a sequence of random variables and the time series associated with these random variables (for

example, see Markov chain, also known as discrete-time Markov chain). Another basic type of a

stochastic process is a random field, whose domain is a region of space, in other words, a random

function whose arguments are drawn from a range of continuously changing values. One approach to

stochastic processes treats them as functions of one or several deterministic arguments (inputs, in

most cases regarded as time) whose values (outputs) are random variables: non-deterministic (single)

quantities which have certain probability distributions. Random variables corresponding to various

times (or points, in the case of random fields) may be completely different. The main requirement is

that these different random quantities all have the same type. Type refers to the codomain of the

function. Although the random values of a stochastic process at different times may be independent

random variables, in most commonly considered situations they exhibit complicated statistical

correlations.


15


16


17


18

MUTUAL INFORMATION :

In probability theory and information theory, the mutual information or (formerly) transinformation of

two random variables is a measure of the variables' mutual dependence. The most common unit of

measurement of mutual information is the bit.

Formally, the mutual information of two discrete random variables X and Y can be defined as:

where p(x,y) is the joint probability distribution function of X and Y, and and are

the marginal probability distribution functions of X and Y respectively.

In the case of continuous random variables, the summation is replaced by a definite double

integral:


19

where p(x,y) is now the joint probability density function of X and Y, and and are

the marginal probability density functions of X and Y respectively.

If the log base 2 is used, the units of mutual information are the bit.

Intuitively, mutual information measures the information that X and Y share: it measures how

much knowing one of these variables reduces uncertainty about the other. For example,

if X and Y are independent, then knowing X does not give any information about Y and vice

versa, so their mutual information is zero. At the other extreme, if X is a deterministic function

of Y and Y is a deterministic function of X then all information conveyed by X is shared with Y:

knowing X determines the value of Y and vice versa. As a result, in this case the mutual

information is the same as the uncertainty contained in Y (or X) alone, namely

the entropy of Y (or X). Moreover, this mutual information is the same as the entropy of X and

as the entropy of Y. (A very special case of this is when X and Y are the same random

variable.)

Mutual information is a measure of the inherent dependence expressed in the joint

distribution of X and Y relative to the joint distribution of X and Y under the assumption of

independence. Mutual information therefore measures dependence in the following

sense: I(X; Y) = 0 if and only if X and Y are independent random variables. This is easy to see

in one direction: if X and Y are independent, then p(x,y) = p(x) p(y), and therefore:

Moreover, mutual information is nonnegative (i.e. I(X;Y) 0; see below)

and symmetric (i.e. I(X;Y) = I(Y;X)).

OR


20


21


22


23

ENTROPY :

In information theory, entropy is a measure of the uncertainty in a random variable.[1] In this context,

the term usually refers to the Shannon entropy, which quantifies the expected value of

the information contained in a message.[2] Entropy is typically measured in bits, nats,

or bans.[3] Shannon entropy is the average unpredictability in a random variable, which is equivalent

to its information content. Shannon entropy provides an absolute limit on the best


24

possible lossless encoding or compression of any communication, assuming that the communication

may be represented as a sequence of independent and identically distributed random variables.

A single toss of a fair coin has an entropy of one bit. A series of two fair coin tosses has an entropy of

two bits. The number of fair coin tosses is its entropy in bits. This random selection between two

outcomes in a sequence over time, whether the outcomes are equally probable or not, is often referred

to as a Bernoulli process. The entropy of such a process is given by the binary entropy function. The

entropy rate for a fair coin toss is one bit per toss. However, if the coin is not fair, then the uncertainty,

and hence the entropy rate, is lower. This is because, if asked to predict the next outcome, we could

choose the most frequent result and be right more often than wrong. The difference between what we

know, or predict, and the information that the unfair coin toss reveals to us is less than one heads-or-

tails "message", or bit, per toss.


25

SHANNONS THEOREM :

Shannon's theorem, proved by Claude Shannon in 1948, describes the maximum possible efficiency of

error correcting methods versus levels of noise interference and data corruption.

The theory doesn't describe how to construct the error-correcting method, it only tells us how good the

best possible method can be. Shannon's theorem has wide-ranging applications in both communications

and data storage applications.

Considering all possible multi-level and multi-phase encoding techniques, the ShannonHartley

theorem states the channel capacity C, meaning the theoretical tightest upper bound on

the information rate (excluding error correcting codes) of clean (or arbitrarily low bit error rate) data

that can be sent with a given average signal power S through an analog communication channel

subject to additive white Gaussian noise of power N, is:

Where

C is the channel capacity in bits per second;


26

B is the bandwidth of the channel in hertz (passband bandwidth in case of a modulated

signal);

S is the average received signal power over the bandwidth (in case of a modulated signal,

often denoted C, i.e. modulated carrier), measured in watts (or volts squared);

N is the average noise or interference power over the bandwidth, measured in watts (or volts

squared); and

S/N is the signal-to-noise ratio (SNR) or the carrier-to-noise ratio (CNR) of the

communication signal to the Gaussian noise interference expressed as a linear power ratio

(not as logarithmic decibels).

Example :

If the SNR is 20 dB, and the bandwidth available is 4 kHz, which is appropriate for telephone

communications, then C = 4 log2(1 + 100) = 4 log2 (101) = 26.63 kbit/s. Note that the value of

100 is appropriate for an SNR of 20 dB.

If it is required to transmit at 50 kbit/s, and a bandwidth of 1 MHz is used, then the minimum

SNR required is given by 50 = 1000 log2(1+S/N) so S/N = 2C/W -1 = 0.035 corresponding to an

SNR of -14.5 dB. This shows that it is possible to transmit using signals which are actually much

weaker than the background noise level.

Shannon's law is any statement defining the theoretical maximum rate at which error free digits

can be transmitted over a bandwidth limited channel in the presence of noise.

Shannon theorem puts a limit on transmission data rate, not on error probability:

Theoretically possible to transmit information at any rate Rb , where Rb C with an arbitrary small error probability by using a sufficiently complicated coding

scheme.

For an information rate Rb > C , it is not possible to find a code that can achieve an arbitrary small error probability.


27


28

Noisy channel coding theorem and capacity :

Claude Shannon's development of information theory during World War II provided the next big

step in understanding how much information could be reliably communicated through noisy

channels. Building on Hartley's foundation, Shannon's noisy channel coding theorem (1948)

describes the maximum possible efficiency of error-correcting methods versus levels of noise

interference and data corruption.[5][6] The proof of the theorem shows that a randomly constructed

error-correcting code is essentially as good as the best possible code; the theorem is proved

through the statistics of such random codes.

Shannon's theorem shows how to compute a channel capacity from a statistical description of a

channel, and establishes that given a noisy channel with capacity C and information transmitted

at a line rate R, then if

there exists a coding technique which allows the probability of error at the receiver to be made

arbitrarily small. This means that theoretically, it is possible to transmit information nearly without

error up to nearly a limit of C bits per second.

The converse is also important. If

the probability of error at the receiver increases without bound as the rate is increased. So no

useful information can be transmitted beyond the channel capacity. The theorem does not

address the rare situation in which rate and capacity are equal.


29

The ShannonHartley theorem establishes what that channel capacity is for a finite-

bandwidth continuous-time channel subject to Gaussian noise. It connects Hartley's result with

Shannon's channel capacity theorem in a form that is equivalent to specifying the M in Hartley's

line rate formula in terms of a signal-to-noise ratio, but achieving reliability through error-correction

coding rather than through reliably distinguishable pulse levels.

If there were such a thing as a noise-free analog channel, one could transmit unlimited amounts

of error-free data over it per unit of time (Note: An infinite-bandwidth analog channel can't transmit

unlimited amounts of error-free data, without infinite signal power). Real channels, however, are

subject to limitations imposed by both finite bandwidth and nonzero noise.

So how do bandwidth and noise affect the rate at which information can be transmitted over an

analog channel?

Surprisingly, bandwidth limitations alone do not impose a cap on maximum information rate. This

is because it is still possible for the signal to take on an indefinitely large number of different

voltage levels on each symbol pulse, with each slightly different level being assigned a different

meaning or bit sequence. If we combine both noise and bandwidth limitations, however, we do

find there is a limit to the amount of information that can be transferred by a signal of a bounded

power, even when clever multi-level encoding techniques are used.

In the channel considered by the ShannonHartley theorem, noise and signal are combined by

addition. That is, the receiver measures a signal that is equal to the sum of the signal encoding

the desired information and a continuous random variable that represents the noise. This addition

creates uncertainty as to the original signal's value. If the receiver has some information about

the random process that generates the noise, one can in principle recover the information in the

original signal by considering all possible states of the noise process. In the case of the Shannon

Hartley theorem, the noise is assumed to be generated by a Gaussian process with a known

variance. Since the variance of a Gaussian process is equivalent to its power, it is conventional

to call this variance the noise power.

Such a channel is called the Additive White Gaussian Noise channel, because Gaussian noise is

added to the signal; "white" means equal amounts of noise at all frequencies within the channel

bandwidth. Such noise can arise both from random sources of energy and also from coding and

measurement error at the sender and receiver respectively. Since sums of independent Gaussian

random variables are themselves Gaussian random variables, this conveniently simplifies

analysis, if one assumes that such error sources are also Gaussian and independent.

Implications of the theorem

Comparison of Shannon's capacity to Hartley's law -

Comparing the channel capacity to the information rate from Hartley's law, we can find the

effective number of distinguishable levels M:


30

The square root effectively converts the power ratio back to a voltage ratio, so the number of

levels is approximately proportional to the ratio of rms signal amplitude to noise standard

deviation.

This similarity in form between Shannon's capacity and Hartley's law should not be interpreted to

mean that M pulse levels can be literally sent without any confusion; more levels are needed, to

allow for redundant coding and error correction, but the net data rate that can be approached with

coding is equivalent to using that M in Hartley's law.

Alternative forms

Frequency-dependent (colored noise) case-

In the simple version above, the signal and noise are fully uncorrelated, in which case S + N is

the total power of the received signal and noise together. A generalization of the above equation

for the case where the additive noise is not white (or that the S/N is not constant with frequency

over the bandwidth) is obtained by treating the channel as many narrow, independent Gaussian

channels in parallel:

where

C is the channel capacity in bits per second;

B is the bandwidth of the channel in Hz;

S(f) is the signal power spectrum

N(f) is the noise power spectrum

f is frequency in Hz.

Note: the theorem only applies to Gaussian stationary process noise. This formula's way of

introducing frequency-dependent noise cannot describe all continuous-time noise processes. For

example, consider a noise process consisting of adding a random wave whose amplitude is 1 or

-1 at any point in time, and a channel that adds such a wave to the source signal. Such a wave's

frequency components are highly dependent. Though such a noise may have a high power, it is

fairly easy to transmit a continuous signal with much less power than one would need if the

underlying noise was a sum of independent noises in each frequency band.


31

Approximations -

For large or small and constant signal-to-noise ratios, the capacity formula can be approximated:

If S/N >> 1, then

where

Similarly, if S/N


32

the logarithm of the cardinality of the message space, or alphabet. (This formula is sometimes

called the Hartley function.) This is the maximum possible rate of information that can be

transmitted with that alphabet. (The logarithm should be taken to a base appropriate for the

unit of measurement in use.) The absolute rate is equal to the actual rate if the source is

memory less and has a uniform distribution.

The absolute redundancy can then be defined as

the difference between the absolute rate and the rate.

The quantity is called the relative redundancy and gives the maximum possible data

compression ratio, when expressed as the percentage by which a file size can be

decreased. (When expressed as a ratio of original file size to compressed file size, the

quantity gives the maximum compression ratio that can be achieved.)

Complementary to the concept of relative redundancy is efficiency, defined as so

that . A memory less source with a uniform distribution has zero

redundancy (and thus 100% efficiency), and cannot be compressed.

A measure of redundancy between two variables is the mutual information or a normalized

variant. A measure of redundancy among many variables is given by the total correlation.

Redundancy of compressed data refers to the difference between the expected compressed data

length of messages (or expected data rate ) and the entropy (or

entropy rate ). (Here we assume the data is ergodic and stationary, e.g., a memoryless source.)

Although the rate difference can be arbitrarily small as increased, the actual

difference , cannot, although it can be theoretically upper-bounded by 1 in the

case of finite-entropy memoryless sources.

HUFFMAN CODING :

In computer science and information theory, Huffman coding is an entropy encoding algorithm used

for lossless data compression. The term refers to the use of a variable-length code table for encoding

a source symbol (such as a character in a file) where the variable-length code table has been derived

in a particular way based on the estimated probability of occurrence for each possible value of the

source symbol. It was developed by David A. Huffman while he was a Ph.D. student at MIT, and

published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".


33

Huffman coding uses a specific method for choosing the representation for each symbol, resulting in

a prefix code (sometimes called "prefix-free codes", that is, the bit string representing some particular

symbol is never a prefix of the bit string representing any other symbol) that expresses the most

common source symbols using shorter strings of bits than are used for less common source symbols.

Huffman was able to design the most efficient compression method of this type: no other mapping of

individual source symbols to unique strings of bits will produce a smaller average output size when

the actual symbol frequencies agree with those used to create the code. The running time of Huffman's

method is fairly efficient, it takes operations to construct it. A method was later found to

design a Huffman code in linear time if input probabilities (also known as weights) are sorted.

We will now study an algorithm for constructing efficient source codes for a DMS with source

symbols that are not equally probable. A variable length encoding algorithm was suggested by

Huffman in 1952, based on the source symbol probabilities P(xi ) i=1,2,.,L. The algorithm is

optimal in the sense that the average number of bits it requires to represent the source symbols

is a minimum, and also meets the prefix condition. The steps of Huffman coding algorithm are

given below :


34


35


36


37


38


39


40

RANDOM VARIABLES :

A random variable, usually written X, is a variable whose possible values are numerical outcomes of a

random phenomenon. There are two types of random variables, discrete and continuous.

DISCRETE RANDOM VARIABLES :

A discrete random variable is one which may take on only a countable number of distinct values such as

0,1,2,3,4,........ Discrete random variables are usually (but not necessarily) counts. If a random variable can

take only a finite number of distinct values, then it must be discrete. Examples of discrete random

variables include the number of children in a family, the Friday night attendance at a cinema, the number

of patients in a doctor's surgery, the number of defective light bulbs in a box of ten.

The probability distribution of a discrete random variable is a list of probabilities associated with each of

its possible values. It is also sometimes called the probability function or the probability mass function.

(Definitions taken from Valerie J. Easton and John H. McColl's Statistics Glossary v1.1)

Suppose a random variable X may take k different values, with the probability that X = xi defined to be P(X

= xi) = pi. The probabilities pi must satisfy the following:

1: 0 < pi < 1 for each i

2: p1 + p2 + ... + pk = 1.

Example :

Suppose a variable X can take the values 1, 2, 3, or 4.

The probabilities associated with each outcome are described by the following table:

Outcome 1 2 3 4


41

Probability 0.1 0.3 0.4 0.2

The probability that X is equal to 2 or 3 is the sum of the two probabilities: P(X = 2 or X = 3) = P(X = 2) +

P(X = 3) = 0.3 + 0.4 = 0.7. Similarly, the probability that X is greater than 1 is equal to 1 - P(X = 1) = 1 - 0.1

= 0.9, by the complement rule.

This distribution may also be described by the probability histogram shown:

CONTINUOUS RANDOM VARIABLES :

A continuous random variable is one which takes an infinite number of possible values. Continuous

random variables are usually measurements. Examples include height, weight, the amount of sugar in an

orange, the time required to run a mile.

(Definition taken from Valerie J. Easton and John H. McColl's Statistics Glossary v1.1)

A continuous random variable is not defined at specific values. Instead, it is defined over an interval of

values, and is represented by the area under a curve (in advanced mathematics, this is known as an

integral). The probability of observing any single value is equal to 0, since the number of values which

may be assumed by the random variable is infinite.

Suppose a random variable X may take all values over an interval of real numbers. Then the probability

that X is in the set of outcomes A, P(A), is defined to be the area above A and under a curve. The curve,

which represents a function p(x), must satisfy the following:

1: The curve has no negative values (p(x) > 0 for all x)

2: The total area under the curve is equal to 1.

A curve meeting these requirements is known as a density curve.


42

A gaussian random variable is completely determined by its mean and variance.

The function that is frequently used for the area under the tail of the gaussian pdf

(Probability Distribution Function) is the denoted by Q(x).


43

The Q-function is a standard form for expressing error probabilities without a closed form

BOUNDS ON TAIL PROBABILITY :

General bounds on tail probability of a random variable (that is, probability that a random variable

deviates far from its expectation)

In probability theory, the Chernoff bound, named after Herman Chernoff, gives exponentially decreasing

bounds on tail distributions of sums of independent random variables. It is a sharper bound than the


44

known first or second moment based tail bounds such as Markov's inequality or Chebyshev inequality,

which only yield power-law bounds on tail decay. However, the Chernoff bound requires that the variates

be independent a condition that neither the Markov nor the Chebyshev inequalities require.


45

information theory, coding and cryptography unit-1 by arun pratap singh

Documents