information theory, coding and cryptography unit-1 by arun pratap singh

Upload: arunpratapsingh

Post on 14-Oct-2015

185 views

Category:

Documents


5 download

DESCRIPTION

8878061993 OIST BhopalMTech -CSE II Semester RGPV Bhopal

TRANSCRIPT

  • UNIT : I

    INFORMATION THEORY, CODING & CRYPTOGRAPHY (MCSE 202)

    PREPARED BY ARUN PRATAP SINGH 5/26/14 MTECH 2nd SEMESTER

  • PREPARED BY ARUN PRATAP SINGH 1

    1

    INTRODUCTION TO INFORMATION THEORY :

    Information theory is a branch of science that deals with the analysis of a communications

    system

    We will study digital communications using a file (or network protocol) as the channel

    Claude Shannon Published a landmark paper in 1948 that was the beginning of the branch of

    information theory

    We are interested in communicating information from a source to a destination

    In our case, the messages will be a sequence of binary digits

    Does anyone know the term for a binary digit?

    One detail that makes communicating difficult is noise

    noise introduces uncertainty

    Suppose I wish to transmit one bit of information what are all of the possibilities?

    tx 0, rx 0 - good

    tx 0, rx 1 - error

    tx 1, rx 0 - error

    tx 1, rx 1 - good

    Two of the cases above have errors this is where probability fits into the picture

    In the case of steganography, the noise may be due to attacks on the hiding algorithm

    UNIT : I

  • PREPARED BY ARUN PRATAP SINGH 2

    2

    INFORMATION MEASURES :

    Any information source, analog or digital, produces an output that is random in nature. If it were not

    random, i.e., the output were known exactly, there would be no need to transmit it. We live in an analog

    world and most sources are analog sources, for example , speech, temperature fluctuations etc. The

    discrete sources are man-made sources, for example, a source (say, a man) that generates a sequence of

    letters from a finite alphabet (typing his email).

    Before we go on to develop a mathematical measure of information , let us develop an intuitive feel for

    it. Read the following sentences :

  • PREPARED BY ARUN PRATAP SINGH 3

    3

    (A) Tomorrow, the sun will rise from the East

    (B) The phone will ring in the next one hour

    (C) It will snow in Delhi this winter

    The three sentences carry different amounts of information. In fact, the first sentence hardly carries any

    information. Everybody knows that the sun rises in the East and the probability of this happening again is

    almost unity.

    Sentence (B) appears to carry more information than sentence (A). The phone may ring, or it may not.

    There is a finite probability that the phone will ring in the next one .

    The last sentence probably made you read it over twice . This is because it has never snowed in Delhi, and

    the probability of snowfall is very low. It is interesting to note that the amount of information carried by

    the sentences listed above have something to do with the probability of occurrence of the events stated

    in the sentences. And we observe an inverse relationship. Sentence (A) which talks about an event which

    has a parobability of occurrence very close to 1 carries almost no information. Sentence (C), which has a

    very low probability of occurrence , appears to carry a lot of information. The other interesting thing to

    note is that the length of the sentence has nothing to do with the amount of information it conveys. In

    fact, sentence (A) is longest but carries the minimum information.

    We will now develop a mathematical measure of information.

  • PREPARED BY ARUN PRATAP SINGH 4

    4

  • PREPARED BY ARUN PRATAP SINGH 5

    5

  • PREPARED BY ARUN PRATAP SINGH 6

    6

    REVIEW PROBABILITY THEORY :

    We could choose one of several technical definitions for probability, but for our purposes it refers to an

    assessment of the likelihood of the various possible outcomes in an experiment or some other situation

    with a random outcome.

    Why Probability Theory?

    Information is exchanged in a computer network in a random way, and events that modify the

    behavior of links and nodes in the network are also random

    We need a way to reason in quantitative ways about the likelihood of events in a network, and to

    predict the behavior of network components.

    Example 1:

    Measure the time between two packet arrivals into the cable of a local area network.

    Determine how likely it is that the interarrival time between any two packets is less than T sec.

    A mathematical model used to quantify the likelihood of events taking place in an experiment in

    which events are random.

    It consists of:

    A sample space: The set of all possible outcomes of a random experiment.

    The set of events: Subsets of the sample space.

    The probability measure: Defined according to a probability law for all the events of the

    sample space.

  • PREPARED BY ARUN PRATAP SINGH 7

    7

  • PREPARED BY ARUN PRATAP SINGH 8

    8

    RANDOM VARIABLES :

    In probability and statistics, a random variable, aleatory variable or stochastic variable is

    a variable whose value is subject to variations due to chance (i.e. randomness, in a mathematical

    sense).

    A random variable's possible values might represent the possible outcomes of a yet-to-be-performed

    experiment, or the possible outcomes of a past experiment whose already-existing value is uncertain

    (for example, as a result of incomplete information or imprecise measurements). They may also

    conceptually represent either the results of an "objectively" random process (such as rolling a die), or

    the "subjective" randomness that results from incomplete knowledge of a quantity. The meaning of the

  • PREPARED BY ARUN PRATAP SINGH 9

    9

    probabilities assigned to the potential values of a random variable is not part of probability theory itself,

    but instead related to philosophical arguments over the interpretation of probability.

    The mathematical function describing the possible values of a random variable and their associated

    probabilities is known as a probability distribution. Random variables can be discrete, that is, taking

    any of a specified finite or countable list of values, endowed with a probability mass function,

    characteristic of a probability distribution; or continuous, taking any numerical value in an interval or

    collection of intervals, via a probability density function that is characteristic of a probability distribution;

    or a mixture of both types. The realizations of a random variable, that is, the results of randomly

    choosing values according to the variable's probability distribution function, are called random variates.

    Example :

    The possible outcomes for one coin toss can be described by the sample

    space . We can introduce a real-valued random variable that models a $1

    payoff for a successful bet on heads as follows:

    If the coin is equally likely to land on either side then Y has a probability mass function given

    by:

  • PREPARED BY ARUN PRATAP SINGH 10

    10

  • PREPARED BY ARUN PRATAP SINGH 11

    11

  • PREPARED BY ARUN PRATAP SINGH 12

    12

  • PREPARED BY ARUN PRATAP SINGH 13

    13

  • PREPARED BY ARUN PRATAP SINGH 14

    14

    RANDOM PROCESS :

    In probability theory, a stochastic process or sometimes random process (widely used) is a

    collection of random variables; this is often used to represent the evolution of some random value, or

    system, over time. This is the probabilistic counterpart to a deterministic process (or deterministic

    system). Instead of describing a process which can only evolve in one way (as in the case, for example,

    of solutions of an ordinary differential equation), in a stochastic or random process there is some

    indeterminacy: even if the initial condition (or starting point) is known, there are several (often infinitely

    many) directions in which the process may evolve.

    In the simple case of discrete time, as opposed to continuous time, a stochastic process involves

    a sequence of random variables and the time series associated with these random variables (for

    example, see Markov chain, also known as discrete-time Markov chain). Another basic type of a

    stochastic process is a random field, whose domain is a region of space, in other words, a random

    function whose arguments are drawn from a range of continuously changing values. One approach to

    stochastic processes treats them as functions of one or several deterministic arguments (inputs, in

    most cases regarded as time) whose values (outputs) are random variables: non-deterministic (single)

    quantities which have certain probability distributions. Random variables corresponding to various

    times (or points, in the case of random fields) may be completely different. The main requirement is

    that these different random quantities all have the same type. Type refers to the codomain of the

    function. Although the random values of a stochastic process at different times may be independent

    random variables, in most commonly considered situations they exhibit complicated statistical

    correlations.

  • PREPARED BY ARUN PRATAP SINGH 15

    15

  • PREPARED BY ARUN PRATAP SINGH 16

    16

  • PREPARED BY ARUN PRATAP SINGH 17

    17

  • PREPARED BY ARUN PRATAP SINGH 18

    18

    MUTUAL INFORMATION :

    In probability theory and information theory, the mutual information or (formerly) transinformation of

    two random variables is a measure of the variables' mutual dependence. The most common unit of

    measurement of mutual information is the bit.

    Formally, the mutual information of two discrete random variables X and Y can be defined as:

    where p(x,y) is the joint probability distribution function of X and Y, and and are

    the marginal probability distribution functions of X and Y respectively.

    In the case of continuous random variables, the summation is replaced by a definite double

    integral:

  • PREPARED BY ARUN PRATAP SINGH 19

    19

    where p(x,y) is now the joint probability density function of X and Y, and and are

    the marginal probability density functions of X and Y respectively.

    If the log base 2 is used, the units of mutual information are the bit.

    Intuitively, mutual information measures the information that X and Y share: it measures how

    much knowing one of these variables reduces uncertainty about the other. For example,

    if X and Y are independent, then knowing X does not give any information about Y and vice

    versa, so their mutual information is zero. At the other extreme, if X is a deterministic function

    of Y and Y is a deterministic function of X then all information conveyed by X is shared with Y:

    knowing X determines the value of Y and vice versa. As a result, in this case the mutual

    information is the same as the uncertainty contained in Y (or X) alone, namely

    the entropy of Y (or X). Moreover, this mutual information is the same as the entropy of X and

    as the entropy of Y. (A very special case of this is when X and Y are the same random

    variable.)

    Mutual information is a measure of the inherent dependence expressed in the joint

    distribution of X and Y relative to the joint distribution of X and Y under the assumption of

    independence. Mutual information therefore measures dependence in the following

    sense: I(X; Y) = 0 if and only if X and Y are independent random variables. This is easy to see

    in one direction: if X and Y are independent, then p(x,y) = p(x) p(y), and therefore:

    Moreover, mutual information is nonnegative (i.e. I(X;Y) 0; see below)

    and symmetric (i.e. I(X;Y) = I(Y;X)).

    OR

  • PREPARED BY ARUN PRATAP SINGH 20

    20

  • PREPARED BY ARUN PRATAP SINGH 21

    21

  • PREPARED BY ARUN PRATAP SINGH 22

    22

  • PREPARED BY ARUN PRATAP SINGH 23

    23

    ENTROPY :

    In information theory, entropy is a measure of the uncertainty in a random variable.[1] In this context,

    the term usually refers to the Shannon entropy, which quantifies the expected value of

    the information contained in a message.[2] Entropy is typically measured in bits, nats,

    or bans.[3] Shannon entropy is the average unpredictability in a random variable, which is equivalent

    to its information content. Shannon entropy provides an absolute limit on the best

  • PREPARED BY ARUN PRATAP SINGH 24

    24

    possible lossless encoding or compression of any communication, assuming that the communication

    may be represented as a sequence of independent and identically distributed random variables.

    A single toss of a fair coin has an entropy of one bit. A series of two fair coin tosses has an entropy of

    two bits. The number of fair coin tosses is its entropy in bits. This random selection between two

    outcomes in a sequence over time, whether the outcomes are equally probable or not, is often referred

    to as a Bernoulli process. The entropy of such a process is given by the binary entropy function. The

    entropy rate for a fair coin toss is one bit per toss. However, if the coin is not fair, then the uncertainty,

    and hence the entropy rate, is lower. This is because, if asked to predict the next outcome, we could

    choose the most frequent result and be right more often than wrong. The difference between what we

    know, or predict, and the information that the unfair coin toss reveals to us is less than one heads-or-

    tails "message", or bit, per toss.

  • PREPARED BY ARUN PRATAP SINGH 25

    25

    SHANNONS THEOREM :

    Shannon's theorem, proved by Claude Shannon in 1948, describes the maximum possible efficiency of

    error correcting methods versus levels of noise interference and data corruption.

    The theory doesn't describe how to construct the error-correcting method, it only tells us how good the

    best possible method can be. Shannon's theorem has wide-ranging applications in both communications

    and data storage applications.

    Considering all possible multi-level and multi-phase encoding techniques, the ShannonHartley

    theorem states the channel capacity C, meaning the theoretical tightest upper bound on

    the information rate (excluding error correcting codes) of clean (or arbitrarily low bit error rate) data

    that can be sent with a given average signal power S through an analog communication channel

    subject to additive white Gaussian noise of power N, is:

    Where

    C is the channel capacity in bits per second;

  • PREPARED BY ARUN PRATAP SINGH 26

    26

    B is the bandwidth of the channel in hertz (passband bandwidth in case of a modulated

    signal);

    S is the average received signal power over the bandwidth (in case of a modulated signal,

    often denoted C, i.e. modulated carrier), measured in watts (or volts squared);

    N is the average noise or interference power over the bandwidth, measured in watts (or volts

    squared); and

    S/N is the signal-to-noise ratio (SNR) or the carrier-to-noise ratio (CNR) of the

    communication signal to the Gaussian noise interference expressed as a linear power ratio

    (not as logarithmic decibels).

    Example :

    If the SNR is 20 dB, and the bandwidth available is 4 kHz, which is appropriate for telephone

    communications, then C = 4 log2(1 + 100) = 4 log2 (101) = 26.63 kbit/s. Note that the value of

    100 is appropriate for an SNR of 20 dB.

    If it is required to transmit at 50 kbit/s, and a bandwidth of 1 MHz is used, then the minimum

    SNR required is given by 50 = 1000 log2(1+S/N) so S/N = 2C/W -1 = 0.035 corresponding to an

    SNR of -14.5 dB. This shows that it is possible to transmit using signals which are actually much

    weaker than the background noise level.

    Shannon's law is any statement defining the theoretical maximum rate at which error free digits

    can be transmitted over a bandwidth limited channel in the presence of noise.

    Shannon theorem puts a limit on transmission data rate, not on error probability:

    Theoretically possible to transmit information at any rate Rb , where Rb C with an arbitrary small error probability by using a sufficiently complicated coding

    scheme.

    For an information rate Rb > C , it is not possible to find a code that can achieve an arbitrary small error probability.

  • PREPARED BY ARUN PRATAP SINGH 27

    27

  • PREPARED BY ARUN PRATAP SINGH 28

    28

    Noisy channel coding theorem and capacity :

    Claude Shannon's development of information theory during World War II provided the next big

    step in understanding how much information could be reliably communicated through noisy

    channels. Building on Hartley's foundation, Shannon's noisy channel coding theorem (1948)

    describes the maximum possible efficiency of error-correcting methods versus levels of noise

    interference and data corruption.[5][6] The proof of the theorem shows that a randomly constructed

    error-correcting code is essentially as good as the best possible code; the theorem is proved

    through the statistics of such random codes.

    Shannon's theorem shows how to compute a channel capacity from a statistical description of a

    channel, and establishes that given a noisy channel with capacity C and information transmitted

    at a line rate R, then if

    there exists a coding technique which allows the probability of error at the receiver to be made

    arbitrarily small. This means that theoretically, it is possible to transmit information nearly without

    error up to nearly a limit of C bits per second.

    The converse is also important. If

    the probability of error at the receiver increases without bound as the rate is increased. So no

    useful information can be transmitted beyond the channel capacity. The theorem does not

    address the rare situation in which rate and capacity are equal.

  • PREPARED BY ARUN PRATAP SINGH 29

    29

    The ShannonHartley theorem establishes what that channel capacity is for a finite-

    bandwidth continuous-time channel subject to Gaussian noise. It connects Hartley's result with

    Shannon's channel capacity theorem in a form that is equivalent to specifying the M in Hartley's

    line rate formula in terms of a signal-to-noise ratio, but achieving reliability through error-correction

    coding rather than through reliably distinguishable pulse levels.

    If there were such a thing as a noise-free analog channel, one could transmit unlimited amounts

    of error-free data over it per unit of time (Note: An infinite-bandwidth analog channel can't transmit

    unlimited amounts of error-free data, without infinite signal power). Real channels, however, are

    subject to limitations imposed by both finite bandwidth and nonzero noise.

    So how do bandwidth and noise affect the rate at which information can be transmitted over an

    analog channel?

    Surprisingly, bandwidth limitations alone do not impose a cap on maximum information rate. This

    is because it is still possible for the signal to take on an indefinitely large number of different

    voltage levels on each symbol pulse, with each slightly different level being assigned a different

    meaning or bit sequence. If we combine both noise and bandwidth limitations, however, we do

    find there is a limit to the amount of information that can be transferred by a signal of a bounded

    power, even when clever multi-level encoding techniques are used.

    In the channel considered by the ShannonHartley theorem, noise and signal are combined by

    addition. That is, the receiver measures a signal that is equal to the sum of the signal encoding

    the desired information and a continuous random variable that represents the noise. This addition

    creates uncertainty as to the original signal's value. If the receiver has some information about

    the random process that generates the noise, one can in principle recover the information in the

    original signal by considering all possible states of the noise process. In the case of the Shannon

    Hartley theorem, the noise is assumed to be generated by a Gaussian process with a known

    variance. Since the variance of a Gaussian process is equivalent to its power, it is conventional

    to call this variance the noise power.

    Such a channel is called the Additive White Gaussian Noise channel, because Gaussian noise is

    added to the signal; "white" means equal amounts of noise at all frequencies within the channel

    bandwidth. Such noise can arise both from random sources of energy and also from coding and

    measurement error at the sender and receiver respectively. Since sums of independent Gaussian

    random variables are themselves Gaussian random variables, this conveniently simplifies

    analysis, if one assumes that such error sources are also Gaussian and independent.

    Implications of the theorem

    Comparison of Shannon's capacity to Hartley's law -

    Comparing the channel capacity to the information rate from Hartley's law, we can find the

    effective number of distinguishable levels M:

  • PREPARED BY ARUN PRATAP SINGH 30

    30

    The square root effectively converts the power ratio back to a voltage ratio, so the number of

    levels is approximately proportional to the ratio of rms signal amplitude to noise standard

    deviation.

    This similarity in form between Shannon's capacity and Hartley's law should not be interpreted to

    mean that M pulse levels can be literally sent without any confusion; more levels are needed, to

    allow for redundant coding and error correction, but the net data rate that can be approached with

    coding is equivalent to using that M in Hartley's law.

    Alternative forms

    Frequency-dependent (colored noise) case-

    In the simple version above, the signal and noise are fully uncorrelated, in which case S + N is

    the total power of the received signal and noise together. A generalization of the above equation

    for the case where the additive noise is not white (or that the S/N is not constant with frequency

    over the bandwidth) is obtained by treating the channel as many narrow, independent Gaussian

    channels in parallel:

    where

    C is the channel capacity in bits per second;

    B is the bandwidth of the channel in Hz;

    S(f) is the signal power spectrum

    N(f) is the noise power spectrum

    f is frequency in Hz.

    Note: the theorem only applies to Gaussian stationary process noise. This formula's way of

    introducing frequency-dependent noise cannot describe all continuous-time noise processes. For

    example, consider a noise process consisting of adding a random wave whose amplitude is 1 or

    -1 at any point in time, and a channel that adds such a wave to the source signal. Such a wave's

    frequency components are highly dependent. Though such a noise may have a high power, it is

    fairly easy to transmit a continuous signal with much less power than one would need if the

    underlying noise was a sum of independent noises in each frequency band.

  • PREPARED BY ARUN PRATAP SINGH 31

    31

    Approximations -

    For large or small and constant signal-to-noise ratios, the capacity formula can be approximated:

    If S/N >> 1, then

    where

    Similarly, if S/N

  • PREPARED BY ARUN PRATAP SINGH 32

    32

    the logarithm of the cardinality of the message space, or alphabet. (This formula is sometimes

    called the Hartley function.) This is the maximum possible rate of information that can be

    transmitted with that alphabet. (The logarithm should be taken to a base appropriate for the

    unit of measurement in use.) The absolute rate is equal to the actual rate if the source is

    memory less and has a uniform distribution.

    The absolute redundancy can then be defined as

    the difference between the absolute rate and the rate.

    The quantity is called the relative redundancy and gives the maximum possible data

    compression ratio, when expressed as the percentage by which a file size can be

    decreased. (When expressed as a ratio of original file size to compressed file size, the

    quantity gives the maximum compression ratio that can be achieved.)

    Complementary to the concept of relative redundancy is efficiency, defined as so

    that . A memory less source with a uniform distribution has zero

    redundancy (and thus 100% efficiency), and cannot be compressed.

    A measure of redundancy between two variables is the mutual information or a normalized

    variant. A measure of redundancy among many variables is given by the total correlation.

    Redundancy of compressed data refers to the difference between the expected compressed data

    length of messages (or expected data rate ) and the entropy (or

    entropy rate ). (Here we assume the data is ergodic and stationary, e.g., a memoryless source.)

    Although the rate difference can be arbitrarily small as increased, the actual

    difference , cannot, although it can be theoretically upper-bounded by 1 in the

    case of finite-entropy memoryless sources.

    HUFFMAN CODING :

    In computer science and information theory, Huffman coding is an entropy encoding algorithm used

    for lossless data compression. The term refers to the use of a variable-length code table for encoding

    a source symbol (such as a character in a file) where the variable-length code table has been derived

    in a particular way based on the estimated probability of occurrence for each possible value of the

    source symbol. It was developed by David A. Huffman while he was a Ph.D. student at MIT, and

    published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".

  • PREPARED BY ARUN PRATAP SINGH 33

    33

    Huffman coding uses a specific method for choosing the representation for each symbol, resulting in

    a prefix code (sometimes called "prefix-free codes", that is, the bit string representing some particular

    symbol is never a prefix of the bit string representing any other symbol) that expresses the most

    common source symbols using shorter strings of bits than are used for less common source symbols.

    Huffman was able to design the most efficient compression method of this type: no other mapping of

    individual source symbols to unique strings of bits will produce a smaller average output size when

    the actual symbol frequencies agree with those used to create the code. The running time of Huffman's

    method is fairly efficient, it takes operations to construct it. A method was later found to

    design a Huffman code in linear time if input probabilities (also known as weights) are sorted.

    We will now study an algorithm for constructing efficient source codes for a DMS with source

    symbols that are not equally probable. A variable length encoding algorithm was suggested by

    Huffman in 1952, based on the source symbol probabilities P(xi ) i=1,2,.,L. The algorithm is

    optimal in the sense that the average number of bits it requires to represent the source symbols

    is a minimum, and also meets the prefix condition. The steps of Huffman coding algorithm are

    given below :

  • PREPARED BY ARUN PRATAP SINGH 34

    34

  • PREPARED BY ARUN PRATAP SINGH 35

    35

  • PREPARED BY ARUN PRATAP SINGH 36

    36

  • PREPARED BY ARUN PRATAP SINGH 37

    37

  • PREPARED BY ARUN PRATAP SINGH 38

    38

  • PREPARED BY ARUN PRATAP SINGH 39

    39

  • PREPARED BY ARUN PRATAP SINGH 40

    40

    RANDOM VARIABLES :

    A random variable, usually written X, is a variable whose possible values are numerical outcomes of a

    random phenomenon. There are two types of random variables, discrete and continuous.

    DISCRETE RANDOM VARIABLES :

    A discrete random variable is one which may take on only a countable number of distinct values such as

    0,1,2,3,4,........ Discrete random variables are usually (but not necessarily) counts. If a random variable can

    take only a finite number of distinct values, then it must be discrete. Examples of discrete random

    variables include the number of children in a family, the Friday night attendance at a cinema, the number

    of patients in a doctor's surgery, the number of defective light bulbs in a box of ten.

    The probability distribution of a discrete random variable is a list of probabilities associated with each of

    its possible values. It is also sometimes called the probability function or the probability mass function.

    (Definitions taken from Valerie J. Easton and John H. McColl's Statistics Glossary v1.1)

    Suppose a random variable X may take k different values, with the probability that X = xi defined to be P(X

    = xi) = pi. The probabilities pi must satisfy the following:

    1: 0 < pi < 1 for each i

    2: p1 + p2 + ... + pk = 1.

    Example :

    Suppose a variable X can take the values 1, 2, 3, or 4.

    The probabilities associated with each outcome are described by the following table:

    Outcome 1 2 3 4

  • PREPARED BY ARUN PRATAP SINGH 41

    41

    Probability 0.1 0.3 0.4 0.2

    The probability that X is equal to 2 or 3 is the sum of the two probabilities: P(X = 2 or X = 3) = P(X = 2) +

    P(X = 3) = 0.3 + 0.4 = 0.7. Similarly, the probability that X is greater than 1 is equal to 1 - P(X = 1) = 1 - 0.1

    = 0.9, by the complement rule.

    This distribution may also be described by the probability histogram shown:

    CONTINUOUS RANDOM VARIABLES :

    A continuous random variable is one which takes an infinite number of possible values. Continuous

    random variables are usually measurements. Examples include height, weight, the amount of sugar in an

    orange, the time required to run a mile.

    (Definition taken from Valerie J. Easton and John H. McColl's Statistics Glossary v1.1)

    A continuous random variable is not defined at specific values. Instead, it is defined over an interval of

    values, and is represented by the area under a curve (in advanced mathematics, this is known as an

    integral). The probability of observing any single value is equal to 0, since the number of values which

    may be assumed by the random variable is infinite.

    Suppose a random variable X may take all values over an interval of real numbers. Then the probability

    that X is in the set of outcomes A, P(A), is defined to be the area above A and under a curve. The curve,

    which represents a function p(x), must satisfy the following:

    1: The curve has no negative values (p(x) > 0 for all x)

    2: The total area under the curve is equal to 1.

    A curve meeting these requirements is known as a density curve.

  • PREPARED BY ARUN PRATAP SINGH 42

    42

    A gaussian random variable is completely determined by its mean and variance.

    The function that is frequently used for the area under the tail of the gaussian pdf

    (Probability Distribution Function) is the denoted by Q(x).

  • PREPARED BY ARUN PRATAP SINGH 43

    43

    The Q-function is a standard form for expressing error probabilities without a closed form

    BOUNDS ON TAIL PROBABILITY :

    General bounds on tail probability of a random variable (that is, probability that a random variable

    deviates far from its expectation)

    In probability theory, the Chernoff bound, named after Herman Chernoff, gives exponentially decreasing

    bounds on tail distributions of sums of independent random variables. It is a sharper bound than the

  • PREPARED BY ARUN PRATAP SINGH 44

    44

    known first or second moment based tail bounds such as Markov's inequality or Chebyshev inequality,

    which only yield power-law bounds on tail decay. However, the Chernoff bound requires that the variates

    be independent a condition that neither the Markov nor the Chebyshev inequalities require.

  • PREPARED BY ARUN PRATAP SINGH 45

    45