lecture0 print

Upload: kostasntougias5453

Post on 02-Jun-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/10/2019 Lecture0 Print

    1/30

    Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities

    EE514A Information Theory I

    Fall 2013

    K. Mohan, Prof. J. Bilmes

    University of Washington, Seattle

    Department of Electrical Engineering

    Fall Quarter, 2013

    http://j.ee.washington.edu/~bilmes/classes/ee514a_fall_2013/

    Lecture 0 - Sep 26th, 2013

    K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 1

    http://j.ee.washington.edu/~bilmes/classes/ee514a_fall_2013/http://j.ee.washington.edu/~bilmes/classes/ee514a_fall_2013/http://j.ee.washington.edu/~bilmes/classes/ee514a_fall_2013/http://j.ee.washington.edu/~bilmes/classes/ee514a_fall_2013/
  • 8/10/2019 Lecture0 Print

    2/30

    Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities

    1

    Probability basics

    2 Basics of Convexity

    3 Basic Probabilistic Inequalities

    4 Concentration Inequalities

    5 Relative Entropy Inequalities

    K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 2

  • 8/10/2019 Lecture0 Print

    3/30

    Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities

    Sample Space and Events

    Random Experiment: An experiment whose outcome is unknown.

    Sample Space : Set of all possible outcomes of a random

    experiment.

    E.g. Flip a fair coin. Sample Space: H,T. E.g. Roll a die.

    Events: Any subset of the Sample Space.E.g. Roll a die. E1={Outcome (1, 2)}, E2={Outcome 4}.

    K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 3

  • 8/10/2019 Lecture0 Print

    4/30

    Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities

    Probability

    Frequentist viewpoint. To compute the probability of an event

    A , count the # occurences ofA in Nrandom experiments.

    Then P(A) = limn

    N(A)

    N .

    E.g. A coin was tossed 100 times and 51 heads were counted.

    P(A) 51/100. By W.L.L.N (see later slides) this estimate

    converges to the actual probability of heads. If its a fair coin, thisshould be 0.5.

    IfA and B are two disjointevents, N(A B) =N(A) + N(B).

    Thus, the frequentist viewpoint seems to imply

    P(A B) =P(A) +P(B) ifA, B are disjoint events.

    K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 4

  • 8/10/2019 Lecture0 Print

    5/30

    Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities

    Probability

    Axioms of Probability1 P() = 02 P() = 13 IfA1, A2, . . . , Ak are a collection of mutually disjoint events (i.e.

    Ai , Ai Aj =). Then, P(Ai) =

    k

    i=1P(Ai).

    E.g.: Two simultaneous fair coin tosses. Sample Space:{(H, H), (H, T), (T, H), (T, T)}.

    P(Atleast one head) =P((H, H) (H, T) (T, H)) = 0.75.

    E.g. Roll of a die. P(Outcome3) =P(Outcome=

    1) +P(Outcome= 2) +P(Outcome= 3) = 0.5.

    K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 5

    P b bili b i B i f C i B i P b bili i I li i C i I li i R l i E I li i

  • 8/10/2019 Lecture0 Print

    6/30

    Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities

    Conditional Probability, Indpendence, R.V.

    Conditional Probability: Let be a sample space associated with

    a random experiment. Let A, B be two events. Then define theConditional Probability ofA given B as:

    P(A|B) =P(A B)

    P(B)

    Independence: Two events A, B are independent if

    P(A B) =P(A)P(B). Stated another way, A, B are independent

    if P(A|B) =P(A), P(B|A) =P(B). Note that A,B are

    Independent necessarily means that A, B are not mutually

    exclusive i.e. A B =.K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 6

    P b bilit b i B i f C it B i P b bili ti I liti C t ti I liti R l ti E t I liti

  • 8/10/2019 Lecture0 Print

    7/30

    Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities

    Independence and mutual exclusiveness

    Example: Consider simultaneous flips of two fair coins. Event A: First

    Coin was a head. Event B: Second coin was a tail. Note that A and B

    are indeed independentevents but are not mutually exclusive. On theother hand, let Event C: First coin was a tail. Then A and Care not

    independent. Infact they are mutually exclusive - I.e. knowing that one

    has occurred (e.g. heads) implies the other has not (tails).

    K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 7

    Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities

  • 8/10/2019 Lecture0 Print

    8/30

    Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities

    Bayes rule

    A simple outcome of the definition of Conditional Probability is the very

    useful Bayes rule:

    P(A B) =P(A|B)P(B) =P(B|A)Prob(A)

    Lets say we are given P(A|B1

    ) and we need to compute P(B1

    |A). Let

    B1, . . . Bk be disjoint events the probability of whose union is 1. Then,

    P(B1|A) = P(A|B1)P(B1)

    P(A)

    = P(A|B1)P(B1)k

    i=1P(ABi)

    = P(A|B1)P(B1)ki=1

    P(A|Bi)P(Bi)

    With additional information on P(Bi) and P(A|Bi), the required

    conditional probability is easily computed.

    K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 8

    Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities

  • 8/10/2019 Lecture0 Print

    9/30

    Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities

    Random Variables

    Random Variable: A random variable is a mapping from the

    sample space to the real line, i.e. X : R. Let x R, then

    P(X(w) =x) =P({w :X(w) =x}). Although random

    variable is a function, in practice it is simply represented as X. It is

    very common to directly define probabilities over the range of the

    random variable.

    Cumulative Distribution Function(CDF): Every random variable

    Xhas associated with it a CDF: F : R [0, 1] such that

    P(Xx) =F(x). IfXis a continuous random variable, then its

    Probability density function is given by f(x) = ddxF(x).

    K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 9

    Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities

  • 8/10/2019 Lecture0 Print

    10/30

    Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities

    Expectation

    Probability Mass Function (p.m.f): Every discrete random

    variable has associated with it, a probability mass function which

    ouputs the probability of the random variable taking a particular

    value. E.g. ifXdenotes the R.V. corresponding to the role of a die,

    P(X=i) = 16 is the p.m.f associated with X.

    Expectation: The Expectation of a discrete random variable X

    with probability mass function p(X) is given by E[X] =

    xp(x)x.

    Thus expectation is the weighted average of the values that a

    random variable can take (where the weights are given by theprobabilities). E.g. Let Xbe a bernoulli random variable with

    P(X= 0) =p. Then E[X] =p 0 + (1 p) 1 = 1 p.

    K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 10

    Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities

  • 8/10/2019 Lecture0 Print

    11/30

    y y q q py q

    More on Expectation

    Conditional Expectation: Let X, Ybe two random variables.

    Given that Y =y, we can talk of the expectation ofX conditionedon the information that Y =y:

    E[X|Y =y] =x

    P(X=x|Y =y)x

    Law of Total Expectation: Let X, Ybe two random-variables.The Law of Total Expectation then states that:

    E[X] =E[E[X|Y]]

    The above equation can be interpreted as: The Expectation ofX

    can be computed by firstcomputing the Expectation of therandomnesspresent only in X(but not in Y) and nextcomputing

    the Expectation with respect to the randomness in Y. Thus, the

    Total Expectation w.r.t Xcan be computed by evaluating a

    sequence of partial Expectations.K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 11

    Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities

  • 8/10/2019 Lecture0 Print

    12/30

    y y q q py q

    Stochastic Process

    Stochastic Process: Is defined as an indexed collection of random

    variables. A stochastic process is denoted by {Xi}Ni=1.

    E.g. InNflips of a fair coin, each flip is a random variable. Thus the

    collection of these Nrandom variables forms a stochastic process.

    Independent and Identically distributed (i.i.d). A stochastic

    process is said to be i.i.d if each of the random variables, Xi have

    the same CDF and in addition the random variables

    (X1, X2, . . . , X N) are independent. Indepedence implies that

    P(X1=x1, X2=x2, . . . , X n=xN) =P(X1=x1)P(X2=

    x2) . . . P(XN =xN).

    K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 12

    Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities

  • 8/10/2019 Lecture0 Print

    13/30

    Stochastic Process

    Binomial Distribution. Consider the i.i.d stochastic process {Xi}Ni=1,

    where Xi is a bernoulli random variable. I.e. Xi {0, 1} and

    P(Xi = 1) =p. Let X=n

    i=1 Xi. Consider the probability, P(X=k)

    for some 0 k n. Then,

    P(X=k) =P(1i1

  • 8/10/2019 Lecture0 Print

    14/30

    Convex Set

    A set C is a convex set if for any two elements x, yCand any

    0 1, x + (1 )yC.

    K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 14

    Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities

  • 8/10/2019 Lecture0 Print

    15/30

    Convex FunctionA function f :XY is convex ifX is convex and for any two x, yX

    and any0 1it holds that, f(x + (1 )y)f(x) + ( 1 )f(y).

    Examples: ex, log(x), x log x.

    Second order condition

    A function f is convex if and only if its hessian, 2f0.

    For functions in one variable, this reduces to checking if second derivative

    is non-negative.

    K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 15

    Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities

  • 8/10/2019 Lecture0 Print

    16/30

    First order condition

    f :XY is convex if and only if it is bound below by its first order

    approximation at any point:

    f(x) f(y) + f(y), x y x, yX

    K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 16

    Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities

  • 8/10/2019 Lecture0 Print

    17/30

    Operations preserving convexity

    Intersection of Convex sets is convex. E.g.: Every polyhedron is an

    intersection of finite number of half-spaces and is therefore convex.Sum of Convex functions is convex. E.g. The squared 2-norm of a

    vector, x22 is the sum of quadratic functions of the form y2 and

    hence is convex.

    Ifh, g are convex functions andhis increasing, then the compositionf=h g is convex. E.g. f(x) =exp(Ax + b) is convex.

    Max of convex functions is convex. E.g.

    f(y) = maxx x

    T

    ys.t. xC.

    Negative Entropy: f(x) =

    i xilog(xi) is convex.

    K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 17

    Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities

  • 8/10/2019 Lecture0 Print

    18/30

    Cauchy-Schwartz Inequality

    Theorem 3.1

    LetX andY be any two random variables. Then it holds that

    |E[XY]| E[X2]E[Y2].Proof.

    Let U=X/

    E[X2], V =Y /

    E[Y2]. Note that E[U2] =E[V2] = 1.

    We need to show that |E[U V]| 1. For any two U, V it holds that

    |U V| |U|2+|V|2

    2 . Now,|E[U V]| E[|U V|] 12E[|U|2 + |V|2] = 1.

    K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 18

    Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities

  • 8/10/2019 Lecture0 Print

    19/30

    Jensens Inequality

    Theorem 3.2 (Jensens inequality)

    LetXbe any random variable andf : R R be a convex function.

    Then, it holds thatf(E[X])E[f(X)].

    Proof.

    The proof uses the convexity off. First note from the first order

    conditions for convexity, for any two x, y f(x)f(y) + f

    (y)(x y).

    Now, let x= X, y=E[X]. Then we have that

    f(X)f(E[X]) + f

    (E[X])(X E[X]). Taking expectations on both

    sides of the previous inequality, it follows that, E[f(X)] f(E[X]).

    K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 19

    Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities

  • 8/10/2019 Lecture0 Print

    20/30

    Applications of Jensens

    General Mathematical Inequalities: Cauchy Schwartz, Holders

    Inequality, Minkowskis Inequality, AM-GM-HM inequality (HW 0).

    Economics - Exchange rate (HW 0).Statistics - Inequalities related to moments: Lyapunovs inequality

    and sample-population standard deviations (HW 0).

    Information Theory - Relative Entropy, lower bounds, etc.

    K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 20

    Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities

  • 8/10/2019 Lecture0 Print

    21/30

    Application of Jensens Inequality

    Note that Jensens inequality even holds for compositions. Let g be anyfunction, fbe a convex function. Then, f(E[g(X)])E[f(g(X))].

    Lemma 3.3 (Lyapunovs inequality)

    LetX be any random variable. Let0< s < t. Then

    (E[|X|s])1/s (E[|X|t])1/t.

    Proof.

    Since s < t, the function f(x) =xt/s is convex on R+. Thus,

    (E[|X|s])t/s =f(E[|X|s])E[f(|X|s)] =E[|X|t]. The result now

    follows by taking the tth root on both sides of the inequality.

    K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 21

    Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities

  • 8/10/2019 Lecture0 Print

    22/30

    Application of Jensens: AM-GM inequality

    As a consequence ofLyapunovs inequality, the following often usedbound follows: E[|X|]

    E[|X|2]. The AM-GM inequality also follows

    from Jensens. Let f(x) = log x. Let a1, a2, . . . , an0 and pi0,

    1Tp= 1. Let Xbe a random variable with distribution p over

    a1, a2, . . . , an. Then,

    ipilog ai =E[f(X)]f(E[X]) = log(

    ipiai).Or in other words,

    iapii

    ipiai

    Setting pi= 1/ni= 1, 2, . . . , n the AM-GM inequality follows:

    (n

    i=1 ai)1

    n

    iain .

    K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 22

    Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities

  • 8/10/2019 Lecture0 Print

    23/30

    Application of Jensens: GM-HM inequality

    The GM-HM inequality also follows from Jensens. I.e., for any

    a1, a2, . . . , an>0 it holds that:

    ni=1

    ai

    1n

    nni=1

    1ai

    Exercise: Show this.

    K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 23

    Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities

    M k I li

  • 8/10/2019 Lecture0 Print

    24/30

    Markovs Inequality

    Theorem 4.1 (Markovs inequality)

    LetXbe any non-negative random variable. Then for anyt >0 it holdsthat,

    P(X > t) E[X]/t

    Proof.

    E[X] =0 xf(x)dx

    =t0xf(x)dx +

    t xf(x)dx

    tP(X > t)

    K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 24

    Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities

    Ch b h I li

  • 8/10/2019 Lecture0 Print

    25/30

    Chebyshevs Inequality

    Theorem 4.2 (Chebyshevs inequality)

    LetX be any random variable. Then for anys >0 it holds that,

    P(|X E[X]|> s) Var[X]/s2

    Proof.

    This is an application of Markovs inequality. Set Z=|X E(X)|2 and

    apply Markovs inequality to Z with t= s2. Note that

    E[Z] =Var[X].

    K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 25

    Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities

    A li i W k L f L N b

  • 8/10/2019 Lecture0 Print

    26/30

    Application to Weak Law of Large Numbers

    As a consequence of the Chebyshev inequality we have the Weak Law of

    Large numbers.

    WLLN

    LetX1, X2, . . . , X n be i.i.d random variables. Then,

    lim1n

    i

    Xip E[Xi]

    Proof.

    W.log assume Xi has mean 0 and variance 1. Then variance ofX= 1n

    ni=1 Xi equals

    1n . By Chebyshevs inequality as n , X

    concentrates around its mean w.h.p.

    K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 26

    Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities

  • 8/10/2019 Lecture0 Print

    27/30

    Relative entropy or KL-divergence

    Let X and Ybe two discrete random variables with probability

    distributions p(X), q(Y). Then the relative entropy between X and Y is

    given by

    D(X||Y) =Ep(X)logp(X)

    q(X)

    K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 27

  • 8/10/2019 Lecture0 Print

    28/30

    Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities

    A tighter bound

  • 8/10/2019 Lecture0 Print

    29/30

    A tighter bound

    Lemma 5.2

    LetX, Y - Binary R.V.s with probability of successp, qrespectively.

    ThenD(p||q) =p log pq + (1 p)log1p1q 2(p q)

    2.

    Proof.

    Note that x(1 x) 14 for all x. Hence,

    D(p||q) =pq(

    px

    1p1x)dx

    = pq

    (p(1x)(1p)(x))x(1x) dx

    4pq(p x)dx

    = 2(p q)2

    sK. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 29

    Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities

    Summary

  • 8/10/2019 Lecture0 Print

    30/30

    Summary

    We reviewed the basics of probability: Sample Space, Random

    variables, independence, conditional probability, expectation,

    stochasic process, etc.

    Convexity is fundamental to some important probabilistic

    inequalities: E.g. Jensens inequality and Relative Entropy Inequality.

    Jensens inequality has many applications. A simple one being the

    AM-GM-HM inequality.

    Markovs inequality is fundamental to concentration inequalities.

    The Weak Law of Large Numbers essentially follows from a variant

    of Markovs inequality (Chebyshevs Inequality).

    K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 30