probability & information theory - github pages · probability & information theory...

181
Probability & Information Theory Shan-Hung Wu [email protected] Department of Computer Science, National Tsing Hua University, Taiwan Large-Scale ML, Fall 2016 Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 1 / 76

Upload: others

Post on 13-Jun-2020

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Probability & Information Theory

Shan-Hung [email protected]

Department of Computer Science,National Tsing Hua University, Taiwan

Large-Scale ML, Fall 2016

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 1 / 76

Page 2: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Outline

1Random Variables & Probability Distributions

2Multivariate & Derived Random Variables

3Bayes’ Rule & Statistics

4Application: Principal Components Analysis

5Technical Details of Random Variables

6Common Probability Distributions

7Common Parametrizing Functions

8Information Theory

9Application: Decision Trees & Random Forest

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 2 / 76

Page 3: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Outline

1Random Variables & Probability Distributions

2Multivariate & Derived Random Variables

3Bayes’ Rule & Statistics

4Application: Principal Components Analysis

5Technical Details of Random Variables

6Common Probability Distributions

7Common Parametrizing Functions

8Information Theory

9Application: Decision Trees & Random Forest

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 3 / 76

Page 4: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Random Variables

A random variable x is a variable that can take on different valuesrandomly

E.g., Pr(x = x1) = 0.1, Pr(x = x2) = 0.3, etc.Technically, x is a function that maps events to a real values

Must be coupled with a probability distribution P that specifies howlikely each value is

x ⇠ P(q) means “x has distribution P parametrized by q ”

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 4 / 76

Page 5: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Probability Mass and Density Functions

If x is discrete, P(x = x) denotes a probability mass function

Px(x) = Pr(x = x)E.g., the output of a fair dice has discrete uniform distribution withP(x) = 1/6

If x is continuous, P(x = x) denotes a probability density function

px(x)Is px(x) a probability? No, it is “rate of increase in probability at x”

Pr(a x b) =Z

[a,b]p(x)dx

px(x) can be greater than 1E.g., a continuous uniform distribution within [a,b] has p(x) = 1/b�a ifx 2 [a,b]; 0 otherwise

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 5 / 76

Page 6: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Probability Mass and Density Functions

If x is discrete, P(x = x) denotes a probability mass function

Px(x) = Pr(x = x)E.g., the output of a fair dice has discrete uniform distribution withP(x) = 1/6

If x is continuous, P(x = x) denotes a probability density function

px(x)

Is px(x) a probability? No, it is “rate of increase in probability at x”

Pr(a x b) =Z

[a,b]p(x)dx

px(x) can be greater than 1E.g., a continuous uniform distribution within [a,b] has p(x) = 1/b�a ifx 2 [a,b]; 0 otherwise

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 5 / 76

Page 7: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Probability Mass and Density Functions

If x is discrete, P(x = x) denotes a probability mass function

Px(x) = Pr(x = x)E.g., the output of a fair dice has discrete uniform distribution withP(x) = 1/6

If x is continuous, P(x = x) denotes a probability density function

px(x)Is px(x) a probability?

No, it is “rate of increase in probability at x”

Pr(a x b) =Z

[a,b]p(x)dx

px(x) can be greater than 1E.g., a continuous uniform distribution within [a,b] has p(x) = 1/b�a ifx 2 [a,b]; 0 otherwise

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 5 / 76

Page 8: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Probability Mass and Density Functions

If x is discrete, P(x = x) denotes a probability mass function

Px(x) = Pr(x = x)E.g., the output of a fair dice has discrete uniform distribution withP(x) = 1/6

If x is continuous, P(x = x) denotes a probability density function

px(x)Is px(x) a probability? No, it is “rate of increase in probability at x”

Pr(a x b) =Z

[a,b]p(x)dx

px(x) can be greater than 1E.g., a continuous uniform distribution within [a,b] has p(x) = 1/b�a ifx 2 [a,b]; 0 otherwise

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 5 / 76

Page 9: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Probability Mass and Density Functions

If x is discrete, P(x = x) denotes a probability mass function

Px(x) = Pr(x = x)E.g., the output of a fair dice has discrete uniform distribution withP(x) = 1/6

If x is continuous, P(x = x) denotes a probability density function

px(x)Is px(x) a probability? No, it is “rate of increase in probability at x”

Pr(a x b) =Z

[a,b]p(x)dx

px(x) can be greater than 1E.g., a continuous uniform distribution within [a,b] has p(x) = 1/b�a ifx 2 [a,b]; 0 otherwise

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 5 / 76

Page 10: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Marginal Probability

Consider a probability distribution over a set of variables, e.g., P(x,y)

The probability distribution over the subset of random variables calledthe marginal probability distribution:

P(x = x) = Ây

P(x,y) orZ

p(x,y)dy

Also called the sum rule of probability

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 6 / 76

Page 11: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Conditional Probability

Conditional density function:

P(x = x |y = y) =P(x = x,y = y)

P(y = y)

Defined only when P(y = y)> 0

Product rule of probability:

P(x(1), · · · ,x(n)) = P(x(1))Pn

i=2P(x(i) |x(1), · · · ,x(i�1))

E.g., P(a,b,c) = P(a |b,c)P(b |c)P(c)

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 7 / 76

Page 12: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Conditional Probability

Conditional density function:

P(x = x |y = y) =P(x = x,y = y)

P(y = y)

Defined only when P(y = y)> 0

Product rule of probability:

P(x(1), · · · ,x(n)) = P(x(1))Pn

i=2P(x(i) |x(1), · · · ,x(i�1))

E.g., P(a,b,c) = P(a |b,c)P(b |c)P(c)

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 7 / 76

Page 13: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Independence and Conditional Independence

We say random variables x is independent with y iff

P(x |y) = P(x)

Implies P(x,y) = P(x)P(y)Denoted by x ? y

We say random variables x is conditionally independent with ygiven z iff

P(x |y,z) = P(x |z)

Implies P(x,y |z) = P(x |z)P(y |z)Denoted by x ? y |z

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 8 / 76

Page 14: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Independence and Conditional Independence

We say random variables x is independent with y iff

P(x |y) = P(x)

Implies P(x,y) = P(x)P(y)Denoted by x ? y

We say random variables x is conditionally independent with ygiven z iff

P(x |y,z) = P(x |z)

Implies P(x,y |z) = P(x |z)P(y |z)Denoted by x ? y |z

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 8 / 76

Page 15: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Expectation

The expectation (or expected value or mean) of some function f

with respect to x is the “average” value that f takes on:1

Ex⇠P[f(x)] = Âx

Px(x)f (x) orZ

px(x)f (x)dx = µf(x)

Expectation is linear: E[af(x)+b] = aE[f(x)]+b for deterministic a

and b

E[E[f(x)]] = E[f(x)], as E[f(x)] is deterministic

1The bracket [·] here is used to distinguish the parentheses inside and has nothing to

do with functionals.

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 9 / 76

Page 16: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Expectation

The expectation (or expected value or mean) of some function f

with respect to x is the “average” value that f takes on:1

Ex⇠P[f(x)] = Âx

Px(x)f (x) orZ

px(x)f (x)dx = µf(x)

Expectation is linear: E[af(x)+b] = aE[f(x)]+b for deterministic a

and b

E[E[f(x)]] = E[f(x)], as E[f(x)] is deterministic

1The bracket [·] here is used to distinguish the parentheses inside and has nothing to

do with functionals.

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 9 / 76

Page 17: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Expectation

The expectation (or expected value or mean) of some function f

with respect to x is the “average” value that f takes on:1

Ex⇠P[f(x)] = Âx

Px(x)f (x) orZ

px(x)f (x)dx = µf(x)

Expectation is linear: E[af(x)+b] = aE[f(x)]+b for deterministic a

and b

E[E[f(x)]] = E[f(x)], as E[f(x)] is deterministic

1The bracket [·] here is used to distinguish the parentheses inside and has nothing to

do with functionals.

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 9 / 76

Page 18: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Expectation over Multiple Variables

Defined over the join probability distribution, e.g.,

E[f(x,y)] = Âx,y

Px,y(x,y)f (x,y) orZ

x,ypx,y(x,y)f (x,y)dxdy

E[f(x) |y = y] =R

px |y(x |y)f (x)dx is called the conditional

expectation

E[f(x)g(y)] = E[f(x)]E[g(y)] if x and y are independent [Proof]

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 10 / 76

Page 19: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Expectation over Multiple Variables

Defined over the join probability distribution, e.g.,

E[f(x,y)] = Âx,y

Px,y(x,y)f (x,y) orZ

x,ypx,y(x,y)f (x,y)dxdy

E[f(x) |y = y] =R

px |y(x |y)f (x)dx is called the conditional

expectation

E[f(x)g(y)] = E[f(x)]E[g(y)] if x and y are independent [Proof]

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 10 / 76

Page 20: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Variance

The variance measures how much the values of f deviate from itsexpected value when seeing different values of x:

Var[f(x)] = E⇥(f(x)�E[f(x)])2⇤= s2

f(x)

sf(x) is called the standard deviation

Var[af (x)+b] = a

2Var[f(x)] for deterministic a and b [Proof]

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 11 / 76

Page 21: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Variance

The variance measures how much the values of f deviate from itsexpected value when seeing different values of x:

Var[f(x)] = E⇥(f(x)�E[f(x)])2⇤= s2

f(x)

sf(x) is called the standard deviation

Var[af (x)+b] = a

2Var[f(x)] for deterministic a and b [Proof]

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 11 / 76

Page 22: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Covariance I

Covariance gives some sense of how much two values are linearly

related to each other

Cov[f(x),g(y)] = E [(f(x)�E[f(x)])(g(y)�E[g(y)])]

If sign positive, both variables tend to take on high valuessimultaneouslyIf sign negative, one variable tend to take on high value while the othertaking on low one

If x and y are independent, then Cov(x,y) = 0 [Proof]The converse is not true as X and Y may be related in a nonlinear wayE.g., y = sin(x) and x ⇠ Uniform(�p,p)

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 12 / 76

Page 23: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Covariance I

Covariance gives some sense of how much two values are linearly

related to each other

Cov[f(x),g(y)] = E [(f(x)�E[f(x)])(g(y)�E[g(y)])]

If sign positive, both variables tend to take on high valuessimultaneouslyIf sign negative, one variable tend to take on high value while the othertaking on low one

If x and y are independent, then Cov(x,y) = 0 [Proof]The converse is not true as X and Y may be related in a nonlinear wayE.g., y = sin(x) and x ⇠ Uniform(�p,p)

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 12 / 76

Page 24: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Covariance II

Var(ax+by) = a

2Var(x)+b

2Var(y)+2abCov(x,y) [Proof]

Var(x+y) = Var(x)+Var(y) if x and y are independent

Cov(ax+b,cy+d) = acCov(x,y) [Proof]Cov(ax+by,cw+dv) =acCov(x,w)+adCov(x,v)+bcCov(y,w)+bdCov(y,v) [Proof]

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 13 / 76

Page 25: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Covariance II

Var(ax+by) = a

2Var(x)+b

2Var(y)+2abCov(x,y) [Proof]Var(x+y) = Var(x)+Var(y) if x and y are independent

Cov(ax+b,cy+d) = acCov(x,y) [Proof]

Cov(ax+by,cw+dv) =acCov(x,w)+adCov(x,v)+bcCov(y,w)+bdCov(y,v) [Proof]

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 13 / 76

Page 26: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Covariance II

Var(ax+by) = a

2Var(x)+b

2Var(y)+2abCov(x,y) [Proof]Var(x+y) = Var(x)+Var(y) if x and y are independent

Cov(ax+b,cy+d) = acCov(x,y) [Proof]Cov(ax+by,cw+dv) =acCov(x,w)+adCov(x,v)+bcCov(y,w)+bdCov(y,v) [Proof]

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 13 / 76

Page 27: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Outline

1Random Variables & Probability Distributions

2Multivariate & Derived Random Variables

3Bayes’ Rule & Statistics

4Application: Principal Components Analysis

5Technical Details of Random Variables

6Common Probability Distributions

7Common Parametrizing Functions

8Information Theory

9Application: Decision Trees & Random Forest

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 14 / 76

Page 28: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Multivariate Random Variables I

A multivariate random variable is denoted by x = [x1, · · · ,xd

]>

Normally, xi

’s (attributes or variables or features) are dependentwith each otherP(x) is a joint distribution of x1, · · · ,xd

The mean of x is defined as µx

= E(x) = [µx1 , · · · ,µxd

]>

The covariance matrix of x is defined as:

Sx

=

2

6664

s2x1

sx1,x2 · · · sx1,xd

sx2,x1 s2x2

· · · sx2,xd

...... . . . ...

sxd

,x1 sxd

,x2 · · · s2x

d

3

7775

sxi

,xj

= Cov(xi

,xj

) = E[(xi

�µxi

)(xj

�µxj

)] = E(xi

xj

)�µxi

µxj

Sx

= Cov(x) = E⇥(x�µ

x

)(x�µx

)>⇤= E(xx

>)�µx

µ>x

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 15 / 76

Page 29: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Multivariate Random Variables I

A multivariate random variable is denoted by x = [x1, · · · ,xd

]>

Normally, xi

’s (attributes or variables or features) are dependentwith each otherP(x) is a joint distribution of x1, · · · ,xd

The mean of x is defined as µx

= E(x) = [µx1 , · · · ,µxd

]>

The covariance matrix of x is defined as:

Sx

=

2

6664

s2x1

sx1,x2 · · · sx1,xd

sx2,x1 s2x2

· · · sx2,xd

...... . . . ...

sxd

,x1 sxd

,x2 · · · s2x

d

3

7775

sxi

,xj

= Cov(xi

,xj

) = E[(xi

�µxi

)(xj

�µxj

)] = E(xi

xj

)�µxi

µxj

Sx

= Cov(x) = E⇥(x�µ

x

)(x�µx

)>⇤= E(xx

>)�µx

µ>x

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 15 / 76

Page 30: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Multivariate Random Variables I

A multivariate random variable is denoted by x = [x1, · · · ,xd

]>

Normally, xi

’s (attributes or variables or features) are dependentwith each otherP(x) is a joint distribution of x1, · · · ,xd

The mean of x is defined as µx

= E(x) = [µx1 , · · · ,µxd

]>

The covariance matrix of x is defined as:

Sx

=

2

6664

s2x1

sx1,x2 · · · sx1,xd

sx2,x1 s2x2

· · · sx2,xd

...... . . . ...

sxd

,x1 sxd

,x2 · · · s2x

d

3

7775

sxi

,xj

= Cov(xi

,xj

) = E[(xi

�µxi

)(xj

�µxj

)] = E(xi

xj

)�µxi

µxj

Sx

= Cov(x) = E⇥(x�µ

x

)(x�µx

)>⇤= E(xx

>)�µx

µ>x

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 15 / 76

Page 31: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Multivariate Random Variables I

A multivariate random variable is denoted by x = [x1, · · · ,xd

]>

Normally, xi

’s (attributes or variables or features) are dependentwith each otherP(x) is a joint distribution of x1, · · · ,xd

The mean of x is defined as µx

= E(x) = [µx1 , · · · ,µxd

]>

The covariance matrix of x is defined as:

Sx

=

2

6664

s2x1

sx1,x2 · · · sx1,xd

sx2,x1 s2x2

· · · sx2,xd

...... . . . ...

sxd

,x1 sxd

,x2 · · · s2x

d

3

7775

sxi

,xj

= Cov(xi

,xj

) = E[(xi

�µxi

)(xj

�µxj

)] = E(xi

xj

)�µxi

µxj

Sx

= Cov(x) = E⇥(x�µ

x

)(x�µx

)>⇤= E(xx

>)�µx

µ>x

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 15 / 76

Page 32: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Multivariate Random Variables II

Sx

is always symmetric

Sx

is always positive semidefinite [Homework]S

x

is nonsingular iff it is positive definiteS

x

is singular implies that x has either:Deterministic/independent/non-linearly dependent attributes causingzero rows, orRedundant attributes causing linear dependency between rows

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 16 / 76

Page 33: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Multivariate Random Variables II

Sx

is always symmetricS

x

is always positive semidefinite [Homework]

Sx

is nonsingular iff it is positive definiteS

x

is singular implies that x has either:Deterministic/independent/non-linearly dependent attributes causingzero rows, orRedundant attributes causing linear dependency between rows

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 16 / 76

Page 34: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Multivariate Random Variables II

Sx

is always symmetricS

x

is always positive semidefinite [Homework]S

x

is nonsingular iff it is positive definite

Sx

is singular implies that x has either:Deterministic/independent/non-linearly dependent attributes causingzero rows, orRedundant attributes causing linear dependency between rows

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 16 / 76

Page 35: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Multivariate Random Variables II

Sx

is always symmetricS

x

is always positive semidefinite [Homework]S

x

is nonsingular iff it is positive definiteS

x

is singular implies that x has either:Deterministic/independent/non-linearly dependent attributes causingzero rows, orRedundant attributes causing linear dependency between rows

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 16 / 76

Page 36: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Derived Random Variables

Let y = f(x;w) = w

>x a random variable transformed from x

µy = E(w>x) = w

>E(x) = w

>µx

s2y = w

>Sx

w [Homework]

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 17 / 76

Page 37: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Outline

1Random Variables & Probability Distributions

2Multivariate & Derived Random Variables

3Bayes’ Rule & Statistics

4Application: Principal Components Analysis

5Technical Details of Random Variables

6Common Probability Distributions

7Common Parametrizing Functions

8Information Theory

9Application: Decision Trees & Random Forest

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 18 / 76

Page 38: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

What Does Pr(x = x) Mean?

1Bayesian probability: it’s a degree of belief or qualitative levels ofcertainty

2Frequentist probability: if we can draw samples of x, then theproportion of frequency of samples having the value x is equal toPr(x = x)

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 19 / 76

Page 39: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

What Does Pr(x = x) Mean?

1Bayesian probability: it’s a degree of belief or qualitative levels ofcertainty

2Frequentist probability: if we can draw samples of x, then theproportion of frequency of samples having the value x is equal toPr(x = x)

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 19 / 76

Page 40: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

What Does Pr(x = x) Mean?

1Bayesian probability: it’s a degree of belief or qualitative levels ofcertainty

2Frequentist probability: if we can draw samples of x, then theproportion of frequency of samples having the value x is equal toPr(x = x)

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 19 / 76

Page 41: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Bayes’ Rule

P(y |x) = P(x |y)P(y)P(x)

=P(x |y)P(y)

Sy

P(x |y = y)P(y = y)

Bayes’ Rule is so important in statistics (and ML as well) such thateach term has a name:

posteriorof y =(likelihood of y)⇥ (priorof y)

evidence

Why is it so important?E.g., a doctor diagnoses you as having a disease by letting x be“symptom” and y be “disease”

P(x |y) and P(y) may be estimated from sample frequencies more easily

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 20 / 76

Page 42: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Bayes’ Rule

P(y |x) = P(x |y)P(y)P(x)

=P(x |y)P(y)

Sy

P(x |y = y)P(y = y)

Bayes’ Rule is so important in statistics (and ML as well) such thateach term has a name:

posteriorof y =(likelihood of y)⇥ (priorof y)

evidence

Why is it so important?

E.g., a doctor diagnoses you as having a disease by letting x be“symptom” and y be “disease”

P(x |y) and P(y) may be estimated from sample frequencies more easily

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 20 / 76

Page 43: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Bayes’ Rule

P(y |x) = P(x |y)P(y)P(x)

=P(x |y)P(y)

Sy

P(x |y = y)P(y = y)

Bayes’ Rule is so important in statistics (and ML as well) such thateach term has a name:

posteriorof y =(likelihood of y)⇥ (priorof y)

evidence

Why is it so important?E.g., a doctor diagnoses you as having a disease by letting x be“symptom” and y be “disease”

P(x |y) and P(y) may be estimated from sample frequencies more easily

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 20 / 76

Page 44: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Point Estimation

Point estimation is the attempt to estimate some fixed but unknownquantity q of a random variable by using sample data

Let {x

(1), · · · ,x(n)} be a set of n independent and identicallydistributed (i.i.d.) samples of a random variable x, a point estimator

or statistic is a function of the data:

qn

= g(x(1), · · · ,x(n))

qn

is called the estimate of q

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 21 / 76

Page 45: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Point Estimation

Point estimation is the attempt to estimate some fixed but unknownquantity q of a random variable by using sample dataLet {x

(1), · · · ,x(n)} be a set of n independent and identicallydistributed (i.i.d.) samples of a random variable x, a point estimator

or statistic is a function of the data:

qn

= g(x(1), · · · ,x(n))

qn

is called the estimate of q

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 21 / 76

Page 46: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Sample Mean and Covariance

Given X = [x(1), · · · ,x(n)]> 2 Rn⇥d the i.i.d samples, what are theestimates of the mean and covariance of x?

A sample mean:

µx

=1n

n

Âi=1

x

(i)

A sample covariance matrix:

Sx

=1n

n

Âi=1

(x(i)� µx

)(x(i)� µx

)>

s2x

i

,xj

= 1n

Ân

s=1(x(s)i

� µxi

)(x(s)j

� µxj

)

If each x

(i) is centered (by subtracting µx

first), then Sx

= 1n

X

>X

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 22 / 76

Page 47: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Sample Mean and Covariance

Given X = [x(1), · · · ,x(n)]> 2 Rn⇥d the i.i.d samples, what are theestimates of the mean and covariance of x?A sample mean:

µx

=1n

n

Âi=1

x

(i)

A sample covariance matrix:

Sx

=1n

n

Âi=1

(x(i)� µx

)(x(i)� µx

)>

s2x

i

,xj

= 1n

Ân

s=1(x(s)i

� µxi

)(x(s)j

� µxj

)

If each x

(i) is centered (by subtracting µx

first), then Sx

= 1n

X

>X

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 22 / 76

Page 48: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Sample Mean and Covariance

Given X = [x(1), · · · ,x(n)]> 2 Rn⇥d the i.i.d samples, what are theestimates of the mean and covariance of x?A sample mean:

µx

=1n

n

Âi=1

x

(i)

A sample covariance matrix:

Sx

=1n

n

Âi=1

(x(i)� µx

)(x(i)� µx

)>

s2x

i

,xj

= 1n

Ân

s=1(x(s)i

� µxi

)(x(s)j

� µxj

)

If each x

(i) is centered (by subtracting µx

first), then Sx

= 1n

X

>X

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 22 / 76

Page 49: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Sample Mean and Covariance

Given X = [x(1), · · · ,x(n)]> 2 Rn⇥d the i.i.d samples, what are theestimates of the mean and covariance of x?A sample mean:

µx

=1n

n

Âi=1

x

(i)

A sample covariance matrix:

Sx

=1n

n

Âi=1

(x(i)� µx

)(x(i)� µx

)>

s2x

i

,xj

= 1n

Ân

s=1(x(s)i

� µxi

)(x(s)j

� µxj

)

If each x

(i) is centered (by subtracting µx

first), then Sx

= 1n

X

>X

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 22 / 76

Page 50: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Outline

1Random Variables & Probability Distributions

2Multivariate & Derived Random Variables

3Bayes’ Rule & Statistics

4Application: Principal Components Analysis

5Technical Details of Random Variables

6Common Probability Distributions

7Common Parametrizing Functions

8Information Theory

9Application: Decision Trees & Random Forest

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 23 / 76

Page 51: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Principal Components Analysis (PCA) I

Give a collection of data points X= {x

(i)}N

i=1, where x

(i) 2 RD

Suppose we want to lossily compress X, i.e., to find a function f suchthat f (x(i)) = z

(i) 2 RK , where K < D

How to keep the maximum info in X?

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 24 / 76

Page 52: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Principal Components Analysis (PCA) II

Let x

(i)’s be i.i.d. samples of a random variable x

Let f be linear, i.e., f (x) = W

>x for some W 2 RD⇥K

Principal Component Analysis (PCA) finds K orthonormal vectorsW =

⇥w

(1), · · · ,w(K)⇤

such that the transformed variable z = W

>x has

the most “spread out” attributes, i.e., each attribute zj

= w

(j)>x has

the maximum variance Var(zj

)

w

(1), · · · ,w(K) are called the principle components

Why w

(1), · · · ,w(K) need to be orthogonal with each other?Each w

(j) keeps information that cannot be explained by others, sotogether they preserve the most info

Why kw

(j)k= 1 for all j?Only directions matter—we don’t want to maximize Var(z

j

) by findinga long w

(j)

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 25 / 76

Page 53: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Principal Components Analysis (PCA) II

Let x

(i)’s be i.i.d. samples of a random variable x

Let f be linear, i.e., f (x) = W

>x for some W 2 RD⇥K

Principal Component Analysis (PCA) finds K orthonormal vectorsW =

⇥w

(1), · · · ,w(K)⇤

such that the transformed variable z = W

>x has

the most “spread out” attributes, i.e., each attribute zj

= w

(j)>x has

the maximum variance Var(zj

)

w

(1), · · · ,w(K) are called the principle components

Why w

(1), · · · ,w(K) need to be orthogonal with each other?Each w

(j) keeps information that cannot be explained by others, sotogether they preserve the most info

Why kw

(j)k= 1 for all j?Only directions matter—we don’t want to maximize Var(z

j

) by findinga long w

(j)

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 25 / 76

Page 54: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Principal Components Analysis (PCA) II

Let x

(i)’s be i.i.d. samples of a random variable x

Let f be linear, i.e., f (x) = W

>x for some W 2 RD⇥K

Principal Component Analysis (PCA) finds K orthonormal vectorsW =

⇥w

(1), · · · ,w(K)⇤

such that the transformed variable z = W

>x has

the most “spread out” attributes, i.e., each attribute zj

= w

(j)>x has

the maximum variance Var(zj

)

w

(1), · · · ,w(K) are called the principle components

Why w

(1), · · · ,w(K) need to be orthogonal with each other?

Each w

(j) keeps information that cannot be explained by others, sotogether they preserve the most info

Why kw

(j)k= 1 for all j?Only directions matter—we don’t want to maximize Var(z

j

) by findinga long w

(j)

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 25 / 76

Page 55: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Principal Components Analysis (PCA) II

Let x

(i)’s be i.i.d. samples of a random variable x

Let f be linear, i.e., f (x) = W

>x for some W 2 RD⇥K

Principal Component Analysis (PCA) finds K orthonormal vectorsW =

⇥w

(1), · · · ,w(K)⇤

such that the transformed variable z = W

>x has

the most “spread out” attributes, i.e., each attribute zj

= w

(j)>x has

the maximum variance Var(zj

)

w

(1), · · · ,w(K) are called the principle components

Why w

(1), · · · ,w(K) need to be orthogonal with each other?Each w

(j) keeps information that cannot be explained by others, sotogether they preserve the most info

Why kw

(j)k= 1 for all j?Only directions matter—we don’t want to maximize Var(z

j

) by findinga long w

(j)

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 25 / 76

Page 56: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Principal Components Analysis (PCA) II

Let x

(i)’s be i.i.d. samples of a random variable x

Let f be linear, i.e., f (x) = W

>x for some W 2 RD⇥K

Principal Component Analysis (PCA) finds K orthonormal vectorsW =

⇥w

(1), · · · ,w(K)⇤

such that the transformed variable z = W

>x has

the most “spread out” attributes, i.e., each attribute zj

= w

(j)>x has

the maximum variance Var(zj

)

w

(1), · · · ,w(K) are called the principle components

Why w

(1), · · · ,w(K) need to be orthogonal with each other?Each w

(j) keeps information that cannot be explained by others, sotogether they preserve the most info

Why kw

(j)k= 1 for all j?

Only directions matter—we don’t want to maximize Var(zj

) by findinga long w

(j)

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 25 / 76

Page 57: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Principal Components Analysis (PCA) II

Let x

(i)’s be i.i.d. samples of a random variable x

Let f be linear, i.e., f (x) = W

>x for some W 2 RD⇥K

Principal Component Analysis (PCA) finds K orthonormal vectorsW =

⇥w

(1), · · · ,w(K)⇤

such that the transformed variable z = W

>x has

the most “spread out” attributes, i.e., each attribute zj

= w

(j)>x has

the maximum variance Var(zj

)

w

(1), · · · ,w(K) are called the principle components

Why w

(1), · · · ,w(K) need to be orthogonal with each other?Each w

(j) keeps information that cannot be explained by others, sotogether they preserve the most info

Why kw

(j)k= 1 for all j?Only directions matter—we don’t want to maximize Var(z

j

) by findinga long w

(j)

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 25 / 76

Page 58: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Solving W I

For simplicity, let’s consider K = 1 firstHow to evaluate Var(z1)?

Recall that z1 = w

(1)>x implies s2

z1= w

(1)>Sx

w

(1) [Homework]How to get S

x

?An estimate: S

x

= 1N

X

>X (assuming x

(i)’s are centered first)

Optimization problem to solve:

arg maxw

(1)2RD

w

(1)>X

>Xw

(1), subject to kw

(1)k= 1

X

>X is symmetric thus can be eigendecomposed

By Rayleigh’s Quotient, the optimal w

(1) is given by the eigenvector ofX

>X corresponding to the largest eigenvalue

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 26 / 76

Page 59: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Solving W I

For simplicity, let’s consider K = 1 firstHow to evaluate Var(z1)?

Recall that z1 = w

(1)>x implies s2

z1= w

(1)>Sx

w

(1) [Homework]How to get S

x

?

An estimate: Sx

= 1N

X

>X (assuming x

(i)’s are centered first)

Optimization problem to solve:

arg maxw

(1)2RD

w

(1)>X

>Xw

(1), subject to kw

(1)k= 1

X

>X is symmetric thus can be eigendecomposed

By Rayleigh’s Quotient, the optimal w

(1) is given by the eigenvector ofX

>X corresponding to the largest eigenvalue

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 26 / 76

Page 60: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Solving W I

For simplicity, let’s consider K = 1 firstHow to evaluate Var(z1)?

Recall that z1 = w

(1)>x implies s2

z1= w

(1)>Sx

w

(1) [Homework]How to get S

x

?An estimate: S

x

= 1N

X

>X (assuming x

(i)’s are centered first)

Optimization problem to solve:

arg maxw

(1)2RD

w

(1)>X

>Xw

(1), subject to kw

(1)k= 1

X

>X is symmetric thus can be eigendecomposed

By Rayleigh’s Quotient, the optimal w

(1) is given by the eigenvector ofX

>X corresponding to the largest eigenvalue

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 26 / 76

Page 61: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Solving W I

For simplicity, let’s consider K = 1 firstHow to evaluate Var(z1)?

Recall that z1 = w

(1)>x implies s2

z1= w

(1)>Sx

w

(1) [Homework]How to get S

x

?An estimate: S

x

= 1N

X

>X (assuming x

(i)’s are centered first)

Optimization problem to solve:

arg maxw

(1)2RD

w

(1)>X

>Xw

(1), subject to kw

(1)k= 1

X

>X is symmetric thus can be eigendecomposed

By Rayleigh’s Quotient, the optimal w

(1) is given by the eigenvector ofX

>X corresponding to the largest eigenvalue

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 26 / 76

Page 62: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Solving W I

For simplicity, let’s consider K = 1 firstHow to evaluate Var(z1)?

Recall that z1 = w

(1)>x implies s2

z1= w

(1)>Sx

w

(1) [Homework]How to get S

x

?An estimate: S

x

= 1N

X

>X (assuming x

(i)’s are centered first)

Optimization problem to solve:

arg maxw

(1)2RD

w

(1)>X

>Xw

(1), subject to kw

(1)k= 1

X

>X is symmetric thus can be eigendecomposed

By Rayleigh’s Quotient, the optimal w

(1) is given by the eigenvector ofX

>X corresponding to the largest eigenvalue

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 26 / 76

Page 63: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Solving W I

For simplicity, let’s consider K = 1 firstHow to evaluate Var(z1)?

Recall that z1 = w

(1)>x implies s2

z1= w

(1)>Sx

w

(1) [Homework]How to get S

x

?An estimate: S

x

= 1N

X

>X (assuming x

(i)’s are centered first)

Optimization problem to solve:

arg maxw

(1)2RD

w

(1)>X

>Xw

(1), subject to kw

(1)k= 1

X

>X is symmetric thus can be eigendecomposed

By Rayleigh’s Quotient, the optimal w

(1) is given by the eigenvector ofX

>X corresponding to the largest eigenvalue

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 26 / 76

Page 64: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Solving W II

Optimization problem for w

(2):

arg maxw

(2)2RD

w

(2)>X

>Xw

(2), subject to kw

(2)k= 1 and w

(2)>w

(1) = 0

By Rayleigh’s Quotient again, w

(2) is the eigenvector corresponding tothe 2-nd largest eigenvalueFor general case where K > 1, the w

(1), · · · ,w(K) are eigenvectors ofX

>X corresponding to the largest K eigenvalues

Proof by induction [Proof]

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 27 / 76

Page 65: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Solving W II

Optimization problem for w

(2):

arg maxw

(2)2RD

w

(2)>X

>Xw

(2), subject to kw

(2)k= 1 and w

(2)>w

(1) = 0

By Rayleigh’s Quotient again, w

(2) is the eigenvector corresponding tothe 2-nd largest eigenvalue

For general case where K > 1, the w

(1), · · · ,w(K) are eigenvectors ofX

>X corresponding to the largest K eigenvalues

Proof by induction [Proof]

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 27 / 76

Page 66: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Solving W II

Optimization problem for w

(2):

arg maxw

(2)2RD

w

(2)>X

>Xw

(2), subject to kw

(2)k= 1 and w

(2)>w

(1) = 0

By Rayleigh’s Quotient again, w

(2) is the eigenvector corresponding tothe 2-nd largest eigenvalueFor general case where K > 1, the w

(1), · · · ,w(K) are eigenvectors ofX

>X corresponding to the largest K eigenvalues

Proof by induction [Proof]

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 27 / 76

Page 67: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Visualization

Figure: PCA learns a linear projection that aligns the direction of greatestvariance with the axes of the new space. With these new axes, the estimatedcovariance matrix S

z

= W

>Sx

W 2 RK⇥K is always diagonal.

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 28 / 76

Page 68: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Outline

1Random Variables & Probability Distributions

2Multivariate & Derived Random Variables

3Bayes’ Rule & Statistics

4Application: Principal Components Analysis

5Technical Details of Random Variables

6Common Probability Distributions

7Common Parametrizing Functions

8Information Theory

9Application: Decision Trees & Random Forest

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 29 / 76

Page 69: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Sure and Almost Sure Events

Given a continuous random variable x, we have Pr(x = x) = 0 for anyvalue x

Will the event x = x occur?

Yes!

An event A happens surely if always occursAn event A happens almost surely if Pr(A) = 1 (e.g., Pr(x 6= x) = 1)

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 30 / 76

Page 70: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Sure and Almost Sure Events

Given a continuous random variable x, we have Pr(x = x) = 0 for anyvalue x

Will the event x = x occur? Yes!

An event A happens surely if always occursAn event A happens almost surely if Pr(A) = 1 (e.g., Pr(x 6= x) = 1)

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 30 / 76

Page 71: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Equality of Random Variables I

Definition (Equality in Distribution)

Two random variables x and y are equal in distribution iffPr(x a) = Pr(y a) for all a.

Definition (Almost Sure Equality)

Two random variables x and y are equal almost surely iff Pr(x = y) = 1.

Definition (Equality)

Two random variables x and y are equal iff they maps the same events tosame values.

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 31 / 76

Page 72: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Equality of Random Variables II

What’s the difference between the “equality in distribution” and“almost sure equality?”

Almost sure equality implies equality in distribution, but converse nottrueE.g., let x and y be binary random variables andPx(0) = Px(1) = Py(0) = Py(1) = 0.5

They are equal in distributionBut Pr(x = y) = 0.5 6= 1

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 32 / 76

Page 73: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Equality of Random Variables II

What’s the difference between the “equality in distribution” and“almost sure equality?”Almost sure equality implies equality in distribution, but converse nottrue

E.g., let x and y be binary random variables andPx(0) = Px(1) = Py(0) = Py(1) = 0.5

They are equal in distributionBut Pr(x = y) = 0.5 6= 1

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 32 / 76

Page 74: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Equality of Random Variables II

What’s the difference between the “equality in distribution” and“almost sure equality?”Almost sure equality implies equality in distribution, but converse nottrueE.g., let x and y be binary random variables andPx(0) = Px(1) = Py(0) = Py(1) = 0.5

They are equal in distributionBut Pr(x = y) = 0.5 6= 1

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 32 / 76

Page 75: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Convergence of Random Variables I

Definition (Convergence in Distribution)

A sequence of random variables {x(1),x(2), · · ·} converges in distribution

to x iff limn!• P

�x(n) = x

�= P(x = x)

Definition (Convergence in Probability)

A sequence of random variables {x(1),x(2), · · ·} converges in probability

to x iff for any e > 0, limn!• Pr

�|x(n)�x|< e

�= 1.

Definition (Almost Sure Convergence)

A sequence of random variables {x(1),x(2), · · ·} converges almost surely

to x iff Pr�lim

n!• x(n) = x�= 1.

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 33 / 76

Page 76: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Convergence of Random Variables II

What’s the difference between the convergence “in probability” and“almost surely?”

Almost sure convergence implies convergence in probability, butconverse not truelim

n!• Pr�|x(n)�x|< e

�= 1 leaves open the possibility that

|x(n)�x|> e happens an infinite number of timesPr

�lim

n!• x(n) = x�= 1 guarantees that |x(n)�x|> e almost surely

will not occur

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 34 / 76

Page 77: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Convergence of Random Variables II

What’s the difference between the convergence “in probability” and“almost surely?”Almost sure convergence implies convergence in probability, butconverse not true

limn!• Pr

�|x(n)�x|< e

�= 1 leaves open the possibility that

|x(n)�x|> e happens an infinite number of timesPr

�lim

n!• x(n) = x�= 1 guarantees that |x(n)�x|> e almost surely

will not occur

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 34 / 76

Page 78: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Convergence of Random Variables II

What’s the difference between the convergence “in probability” and“almost surely?”Almost sure convergence implies convergence in probability, butconverse not truelim

n!• Pr�|x(n)�x|< e

�= 1 leaves open the possibility that

|x(n)�x|> e happens an infinite number of timesPr

�lim

n!• x(n) = x�= 1 guarantees that |x(n)�x|> e almost surely

will not occur

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 34 / 76

Page 79: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Distribution of Derived Variables I

Suppose y = f (x) and f

�1 exists, does P(y = y) = P(x = f

�1(y))always hold?

No, when x and y are continuousSuppose x ⇠ Uniform(0,1) is continuous and p(x) = c for x 2 (0,1)

Let y = x/2 ⇠ Uniform(0, 1/2)

If py(y) = px(2y), then

Z 1/2

y=0py(y)dy =

Z 1/2

y=0c ·dy =

126= 1

Violates the axiom of probability

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 35 / 76

Page 80: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Distribution of Derived Variables I

Suppose y = f (x) and f

�1 exists, does P(y = y) = P(x = f

�1(y))always hold? No, when x and y are continuousSuppose x ⇠ Uniform(0,1) is continuous and p(x) = c for x 2 (0,1)

Let y = x/2 ⇠ Uniform(0, 1/2)

If py(y) = px(2y), then

Z 1/2

y=0py(y)dy =

Z 1/2

y=0c ·dy =

126= 1

Violates the axiom of probability

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 35 / 76

Page 81: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Distribution of Derived Variables IIRecall that Pr(y = y) = py(y)dy and Pr(x = x) = px(x)dx

Since f may distort space, we need to ensure that

|py(f (x))dy|= |px(x)dx|

We have

py(y) = px(f�1(y))

����∂ f

�1(y)

∂y

���� (or px(x) = py(f (x))

����∂ f (x)

∂x

����)

In previous example: py(y) = 2 ·px(2y)

In multivariate case, we have

p

y

(y) = p

x

(f�1(y))��det

�J(f�1)(y)

��� ,

where J(f�1)(y) is the Jacobian matrix of f

�1 at input y

J(f�1)(y)i,j = ∂ f

�1i

(y)/∂y

j

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 36 / 76

Page 82: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Distribution of Derived Variables IIRecall that Pr(y = y) = py(y)dy and Pr(x = x) = px(x)dx

Since f may distort space, we need to ensure that

|py(f (x))dy|= |px(x)dx|

We have

py(y) = px(f�1(y))

����∂ f

�1(y)

∂y

���� (or px(x) = py(f (x))

����∂ f (x)

∂x

����)

In previous example: py(y) = 2 ·px(2y)

In multivariate case, we have

p

y

(y) = p

x

(f�1(y))��det

�J(f�1)(y)

��� ,

where J(f�1)(y) is the Jacobian matrix of f

�1 at input y

J(f�1)(y)i,j = ∂ f

�1i

(y)/∂y

j

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 36 / 76

Page 83: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Distribution of Derived Variables IIRecall that Pr(y = y) = py(y)dy and Pr(x = x) = px(x)dx

Since f may distort space, we need to ensure that

|py(f (x))dy|= |px(x)dx|

We have

py(y) = px(f�1(y))

����∂ f

�1(y)

∂y

���� (or px(x) = py(f (x))

����∂ f (x)

∂x

����)

In previous example: py(y) = 2 ·px(2y)

In multivariate case, we have

p

y

(y) = p

x

(f�1(y))��det

�J(f�1)(y)

��� ,

where J(f�1)(y) is the Jacobian matrix of f

�1 at input y

J(f�1)(y)i,j = ∂ f

�1i

(y)/∂y

j

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 36 / 76

Page 84: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Outline

1Random Variables & Probability Distributions

2Multivariate & Derived Random Variables

3Bayes’ Rule & Statistics

4Application: Principal Components Analysis

5Technical Details of Random Variables

6Common Probability Distributions

7Common Parametrizing Functions

8Information Theory

9Application: Decision Trees & Random Forest

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 37 / 76

Page 85: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Random Experiments

The value of a random variable x can be think of as the outcome of anrandom experimentHelps us define P(x)

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 38 / 76

Page 86: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Bernoulli Distribution (Discrete)

Let x 2 {0,1} be the outcome of tossing a coin, we have:

Bernoulli(x = x;r) =⇢

r, if x = 11�r, otherwise or rx(1�r)1�x

Properties: [Proof]E(x) = rVar(x) = r(1�r)

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 39 / 76

Page 87: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Categorical Distribution (Discrete)

Let x 2 {1, · · · ,k} be the outcome of rolling a k-sided dice, we have:

Categorical(x = x;r) =k

’i=1

r1(x;x=i)i

, where 1

>r = 1

An extension of the Bernoulli distribution for k states

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 40 / 76

Page 88: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Categorical Distribution (Discrete)

Let x 2 {1, · · · ,k} be the outcome of rolling a k-sided dice, we have:

Categorical(x = x;r) =k

’i=1

r1(x;x=i)i

, where 1

>r = 1

An extension of the Bernoulli distribution for k states

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 40 / 76

Page 89: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Multinomial Distribution (Discrete)

Let x 2 Rk be a random vector where xi

the number of the outcome i

after rolling a k-sided dice n times:

Multinomial(x= x;n,r)= n!x1! · · ·x

k

!

k

’i=1

rx

i

i

, where 1

>r = 1 and 1

>x= n

Properties: [Proof]E(x) = nrVar(x) = n

�diag(r)�rr>�

(i.e., Var(xi

) = nri

(1�ri

) and Var(xi

,xj

) =�nri

rj

)

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 41 / 76

Page 90: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Multinomial Distribution (Discrete)

Let x 2 Rk be a random vector where xi

the number of the outcome i

after rolling a k-sided dice n times:

Multinomial(x= x;n,r)= n!x1! · · ·x

k

!

k

’i=1

rx

i

i

, where 1

>r = 1 and 1

>x= n

Properties: [Proof]E(x) = nrVar(x) = n

�diag(r)�rr>�

(i.e., Var(xi

) = nri

(1�ri

) and Var(xi

,xj

) =�nri

rj

)

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 41 / 76

Page 91: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Normal/Gaussian Distribution (Continuous)Theorem (Central Limit Theorem)

The sum x of many independent random variables is approximately

normally/Gaussian distributed:

N (x = x; µ,s2) =

r1

2ps2 exp✓� 1

2s2 (x�µ)2◆.

Holds regardless of the original distributions of individual variablesµx = µ and s2

x = s2

To avoid inverting s2, we can parametrize the distribution using theprecision b :

N (x = x; µ,b�1) =

rb2p

exp✓�b

2(x�µ)2

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 42 / 76

Page 92: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Normal/Gaussian Distribution (Continuous)Theorem (Central Limit Theorem)

The sum x of many independent random variables is approximately

normally/Gaussian distributed:

N (x = x; µ,s2) =

r1

2ps2 exp✓� 1

2s2 (x�µ)2◆.

Holds regardless of the original distributions of individual variables

µx = µ and s2x = s2

To avoid inverting s2, we can parametrize the distribution using theprecision b :

N (x = x; µ,b�1) =

rb2p

exp✓�b

2(x�µ)2

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 42 / 76

Page 93: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Normal/Gaussian Distribution (Continuous)Theorem (Central Limit Theorem)

The sum x of many independent random variables is approximately

normally/Gaussian distributed:

N (x = x; µ,s2) =

r1

2ps2 exp✓� 1

2s2 (x�µ)2◆.

Holds regardless of the original distributions of individual variablesµx = µ and s2

x = s2

To avoid inverting s2, we can parametrize the distribution using theprecision b :

N (x = x; µ,b�1) =

rb2p

exp✓�b

2(x�µ)2

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 42 / 76

Page 94: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Normal/Gaussian Distribution (Continuous)Theorem (Central Limit Theorem)

The sum x of many independent random variables is approximately

normally/Gaussian distributed:

N (x = x; µ,s2) =

r1

2ps2 exp✓� 1

2s2 (x�µ)2◆.

Holds regardless of the original distributions of individual variablesµx = µ and s2

x = s2

To avoid inverting s2, we can parametrize the distribution using theprecision b :

N (x = x; µ,b�1) =

rb2p

exp✓�b

2(x�µ)2

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 42 / 76

Page 95: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Confidence Intervals

Figure: Graph of N (µ,s2).

We say the interval [µ �2s ,µ +2s ] has about the 95% confidence

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 43 / 76

Page 96: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Confidence Intervals

Figure: Graph of N (µ,s2).

We say the interval [µ �2s ,µ +2s ] has about the 95% confidence

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 43 / 76

Page 97: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Why Is Gaussian Distribution So Common?

1 It can model complicate systemsE.g., Gaussian white noise

2 Out of all possible probability distributions (over real numbers) withthe same variance, it encodes the maximum amount of uncertainty

So, we insert the least amount of prior knowledge into a model3 It is numerical friendly

E.g., continuous, differentiable, etc.

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 44 / 76

Page 98: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Why Is Gaussian Distribution So Common?

1 It can model complicate systemsE.g., Gaussian white noise

2 Out of all possible probability distributions (over real numbers) withthe same variance, it encodes the maximum amount of uncertainty

So, we insert the least amount of prior knowledge into a model3 It is numerical friendly

E.g., continuous, differentiable, etc.

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 44 / 76

Page 99: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Why Is Gaussian Distribution So Common?

1 It can model complicate systemsE.g., Gaussian white noise

2 Out of all possible probability distributions (over real numbers) withthe same variance, it encodes the maximum amount of uncertainty

So, we insert the least amount of prior knowledge into a model

3 It is numerical friendlyE.g., continuous, differentiable, etc.

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 44 / 76

Page 100: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Why Is Gaussian Distribution So Common?

1 It can model complicate systemsE.g., Gaussian white noise

2 Out of all possible probability distributions (over real numbers) withthe same variance, it encodes the maximum amount of uncertainty

So, we insert the least amount of prior knowledge into a model3 It is numerical friendly

E.g., continuous, differentiable, etc.

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 44 / 76

Page 101: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Properties

If x ⇠ N (µ,s2), then ax+b ⇠ N (aµ +b,a2s2) for any deterministica,b [Proof]

z = x�µs ⇠ N (0,1) the z-normalization or standardization of x

If x(1) ⇠ N (µ(1),s2(1)) is independent with x(2) ⇠ N (µ(2),s2(2)),then x(1) +x(2) ⇠ N (µ(1) +µ(2),s2(1) +s2(2))[Homework: px(1)+x(2) (x) =

Rpx(1) (x� y)px(2) (y)dy the convolution]

Not true if x(1) and x(2) are dependent

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 45 / 76

Page 102: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Properties

If x ⇠ N (µ,s2), then ax+b ⇠ N (aµ +b,a2s2) for any deterministica,b [Proof]

z = x�µs ⇠ N (0,1) the z-normalization or standardization of x

If x(1) ⇠ N (µ(1),s2(1)) is independent with x(2) ⇠ N (µ(2),s2(2)),then x(1) +x(2) ⇠ N (µ(1) +µ(2),s2(1) +s2(2))[Homework: px(1)+x(2) (x) =

Rpx(1) (x� y)px(2) (y)dy the convolution]

Not true if x(1) and x(2) are dependent

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 45 / 76

Page 103: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Properties

If x ⇠ N (µ,s2), then ax+b ⇠ N (aµ +b,a2s2) for any deterministica,b [Proof]

z = x�µs ⇠ N (0,1) the z-normalization or standardization of x

If x(1) ⇠ N (µ(1),s2(1)) is independent with x(2) ⇠ N (µ(2),s2(2)),then x(1) +x(2) ⇠ N (µ(1) +µ(2),s2(1) +s2(2))[Homework: px(1)+x(2) (x) =

Rpx(1) (x� y)px(2) (y)dy the convolution]

Not true if x(1) and x(2) are dependent

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 45 / 76

Page 104: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Multivariate Gaussian Distribution

When x is sum of many random vectors:

N (x = x; µ,S) =

s1

(2p)ddet(S)exp

�1

2(x�µ)>S�1(x�µ)

µx

= µ and Sx

= S (must be nonsingular)

If x ⇠ N (µ,S), then each attribute xi

is univariate normalConverse not trueHowever, if x1, · · · ,xd

are i.i.d. and xi

⇠ N (µi

,s2i

), then x ⇠ N (µ,S),where µ = [µ1, · · · ,µd

]> and S = diag(s21 , · · · ,s2

d

)

What does the graph of N (µ,S) look like?

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 46 / 76

Page 105: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Multivariate Gaussian Distribution

When x is sum of many random vectors:

N (x = x; µ,S) =

s1

(2p)ddet(S)exp

�1

2(x�µ)>S�1(x�µ)

µx

= µ and Sx

= S (must be nonsingular)

If x ⇠ N (µ,S), then each attribute xi

is univariate normalConverse not true

However, if x1, · · · ,xd

are i.i.d. and xi

⇠ N (µi

,s2i

), then x ⇠ N (µ,S),where µ = [µ1, · · · ,µd

]> and S = diag(s21 , · · · ,s2

d

)

What does the graph of N (µ,S) look like?

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 46 / 76

Page 106: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Multivariate Gaussian Distribution

When x is sum of many random vectors:

N (x = x; µ,S) =

s1

(2p)ddet(S)exp

�1

2(x�µ)>S�1(x�µ)

µx

= µ and Sx

= S (must be nonsingular)

If x ⇠ N (µ,S), then each attribute xi

is univariate normalConverse not trueHowever, if x1, · · · ,xd

are i.i.d. and xi

⇠ N (µi

,s2i

), then x ⇠ N (µ,S),where µ = [µ1, · · · ,µd

]> and S = diag(s21 , · · · ,s2

d

)

What does the graph of N (µ,S) look like?

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 46 / 76

Page 107: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Multivariate Gaussian Distribution

When x is sum of many random vectors:

N (x = x; µ,S) =

s1

(2p)ddet(S)exp

�1

2(x�µ)>S�1(x�µ)

µx

= µ and Sx

= S (must be nonsingular)

If x ⇠ N (µ,S), then each attribute xi

is univariate normalConverse not trueHowever, if x1, · · · ,xd

are i.i.d. and xi

⇠ N (µi

,s2i

), then x ⇠ N (µ,S),where µ = [µ1, · · · ,µd

]> and S = diag(s21 , · · · ,s2

d

)

What does the graph of N (µ,S) look like?

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 46 / 76

Page 108: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Bivariate Example IConsider the Mahalanobis distance first

N (µ,S) =

s1

(2p)ddet(S)exp

�1

2(x�µ)>S�1(x�µ)

The level sets closer to thecenter µ

x

are lowerIncreasing Cov[x1,x2]stretches the level sets alongthe 45� axisDecreasing Cov[x1,x2]stretches the level sets alongthe �45� axis

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 47 / 76

Page 109: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Bivariate Example IConsider the Mahalanobis distance first

N (µ,S) =

s1

(2p)ddet(S)exp

�1

2(x�µ)>S�1(x�µ)

Cov(x1,x

2)=0, Var(x

1)=Var(x

2)

x1

x 2

Cov(x1,x

2)=0, Var(x

1)>Var(x

2)

Cov(x1,x

2)>0 Cov(x

1,x

2)<0

The level sets closer to thecenter µ

x

are lower

Increasing Cov[x1,x2]stretches the level sets alongthe 45� axisDecreasing Cov[x1,x2]stretches the level sets alongthe �45� axis

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 47 / 76

Page 110: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Bivariate Example IConsider the Mahalanobis distance first

N (µ,S) =

s1

(2p)ddet(S)exp

�1

2(x�µ)>S�1(x�µ)

Cov(x1,x

2)=0, Var(x

1)=Var(x

2)

x1

x 2

Cov(x1,x

2)=0, Var(x

1)>Var(x

2)

Cov(x1,x

2)>0 Cov(x

1,x

2)<0

The level sets closer to thecenter µ

x

are lowerIncreasing Cov[x1,x2]stretches the level sets alongthe 45� axisDecreasing Cov[x1,x2]stretches the level sets alongthe �45� axis

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 47 / 76

Page 111: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Bivariate Example IIThe hight of N (µ,S) =

q1

(2p)ddet(S) exp⇥� 1

2(x�µ)>S�1(x�µ)⇤

inits graph is inversely proportional to the Mahalanobis distance

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

x1

x2

A multivariate Gaussian distribution is isotropic iff S = sI

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 48 / 76

Page 112: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Bivariate Example IIThe hight of N (µ,S) =

q1

(2p)ddet(S) exp⇥� 1

2(x�µ)>S�1(x�µ)⇤

inits graph is inversely proportional to the Mahalanobis distance

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

x1

x2

A multivariate Gaussian distribution is isotropic iff S = sI

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 48 / 76

Page 113: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Properties

If x ⇠ N (µ,S), then w

>x ⇠ N (w>µ,w>Sw) for any deterministic

w 2 Rd

More generally, given W 2 Rd⇥k, k d, we haveW

>x ⇠ N (W>µ,W>SW) that is k-variate normal

The projection of x onto a k-dimensional space is still normal

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 49 / 76

Page 114: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Properties

If x ⇠ N (µ,S), then w

>x ⇠ N (w>µ,w>Sw) for any deterministic

w 2 Rd

More generally, given W 2 Rd⇥k, k d, we haveW

>x ⇠ N (W>µ,W>SW) that is k-variate normal

The projection of x onto a k-dimensional space is still normal

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 49 / 76

Page 115: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Properties

If x ⇠ N (µ,S), then w

>x ⇠ N (w>µ,w>Sw) for any deterministic

w 2 Rd

More generally, given W 2 Rd⇥k, k d, we haveW

>x ⇠ N (W>µ,W>SW) that is k-variate normal

The projection of x onto a k-dimensional space is still normal

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 49 / 76

Page 116: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Exponential Distribution (Continuous)In deep learning, we often want to have a probability distribution witha sharp point at x = 0

To accomplish this, we can use the exponential distribution:

Exponential(x = x;l ) = l1(x;x � 0)exp(�lx)

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 50 / 76

Page 117: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Exponential Distribution (Continuous)In deep learning, we often want to have a probability distribution witha sharp point at x = 0

To accomplish this, we can use the exponential distribution:

Exponential(x = x;l ) = l1(x;x � 0)exp(�lx)

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 50 / 76

Page 118: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Laplace Distribution (Continuous)

Laplace distribution can be think of as a “two-sided” exponentialdistribution centered at µ :

Laplace(x = x; µ,b) = 12b

exp✓� |x�µ|

b

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 51 / 76

Page 119: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Dirac Distribution (Continuous)

In some cases, we wish to specify that all of the mass in a probabilitydistribution clusters around a single data point µ

This can be accomplished by using the Dirac distribution:

Dirac(x = x; µ) = d (x�µ),

where d (·) is the Dirac delta function that1 Is zero-valued everywhere except at input 0

2 Integrals to 1

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 52 / 76

Page 120: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Dirac Distribution (Continuous)

In some cases, we wish to specify that all of the mass in a probabilitydistribution clusters around a single data point µThis can be accomplished by using the Dirac distribution:

Dirac(x = x; µ) = d (x�µ),

where d (·) is the Dirac delta function that1 Is zero-valued everywhere except at input 0

2 Integrals to 1

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 52 / 76

Page 121: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Empirical Distribution (Continuous)

Given a dataset X= {x

(i)}N

i=1 where x

(i)’s are i.i.d. samples of x

What is the distribution P(q) that maximizes the likelihood P(q |X) ofX?

If x is discrete, the distribution simply reflects the empirical frequencyof values:

Empirical(x = x;X) = 1N

N

Âi=1

1(x;x = x

(i))

If x is continuous, we have the empirical distribution:

Empirical(x = x;X) = 1N

N

Âi=1

d (x�x

(i))

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 53 / 76

Page 122: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Empirical Distribution (Continuous)

Given a dataset X= {x

(i)}N

i=1 where x

(i)’s are i.i.d. samples of x

What is the distribution P(q) that maximizes the likelihood P(q |X) ofX?If x is discrete, the distribution simply reflects the empirical frequencyof values:

Empirical(x = x;X) = 1N

N

Âi=1

1(x;x = x

(i))

If x is continuous, we have the empirical distribution:

Empirical(x = x;X) = 1N

N

Âi=1

d (x�x

(i))

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 53 / 76

Page 123: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Empirical Distribution (Continuous)

Given a dataset X= {x

(i)}N

i=1 where x

(i)’s are i.i.d. samples of x

What is the distribution P(q) that maximizes the likelihood P(q |X) ofX?If x is discrete, the distribution simply reflects the empirical frequencyof values:

Empirical(x = x;X) = 1N

N

Âi=1

1(x;x = x

(i))

If x is continuous, we have the empirical distribution:

Empirical(x = x;X) = 1N

N

Âi=1

d (x�x

(i))

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 53 / 76

Page 124: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Mixtures of Distributions

We may define a probability distribution by combining other simplerprobability distributions {P(i)(q (i))}

i

E.g., the mixture model:

Mixture(x = x;r,{q (i)}i

) =Âi

P(i)(x = x|c = i;q (i))Categorical(c = i;r)

The empirical distribution is a mixture distribution (where ri

= 1/N)The component identity variable c is a latent variable

Whose values are not observed

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 54 / 76

Page 125: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Mixtures of Distributions

We may define a probability distribution by combining other simplerprobability distributions {P(i)(q (i))}

i

E.g., the mixture model:

Mixture(x = x;r,{q (i)}i

) =Âi

P(i)(x = x|c = i;q (i))Categorical(c = i;r)

The empirical distribution is a mixture distribution (where ri

= 1/N)The component identity variable c is a latent variable

Whose values are not observed

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 54 / 76

Page 126: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Mixtures of Distributions

We may define a probability distribution by combining other simplerprobability distributions {P(i)(q (i))}

i

E.g., the mixture model:

Mixture(x = x;r,{q (i)}i

) =Âi

P(i)(x = x|c = i;q (i))Categorical(c = i;r)

The empirical distribution is a mixture distribution (where ri

= 1/N)

The component identity variable c is a latent variable

Whose values are not observed

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 54 / 76

Page 127: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Mixtures of Distributions

We may define a probability distribution by combining other simplerprobability distributions {P(i)(q (i))}

i

E.g., the mixture model:

Mixture(x = x;r,{q (i)}i

) =Âi

P(i)(x = x|c = i;q (i))Categorical(c = i;r)

The empirical distribution is a mixture distribution (where ri

= 1/N)The component identity variable c is a latent variable

Whose values are not observed

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 54 / 76

Page 128: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Gaussian Mixture ModelA mixture model is called the Gaussian mixture model iffP(i)(x = x|c = i;q (i)) = N (i)(x = x|c = i; µ(i),S(i)), 8i

Variants: S(i) = S or S(i) = diag(s) or S(i) = sI

Any smooth density can be approximated by a Gaussian mixturemodel with enough components

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 55 / 76

Page 129: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Gaussian Mixture ModelA mixture model is called the Gaussian mixture model iffP(i)(x = x|c = i;q (i)) = N (i)(x = x|c = i; µ(i),S(i)), 8i

Variants: S(i) = S or S(i) = diag(s) or S(i) = sI

Any smooth density can be approximated by a Gaussian mixturemodel with enough components

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 55 / 76

Page 130: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Gaussian Mixture ModelA mixture model is called the Gaussian mixture model iffP(i)(x = x|c = i;q (i)) = N (i)(x = x|c = i; µ(i),S(i)), 8i

Variants: S(i) = S or S(i) = diag(s) or S(i) = sI

Any smooth density can be approximated by a Gaussian mixturemodel with enough components

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 55 / 76

Page 131: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Outline

1Random Variables & Probability Distributions

2Multivariate & Derived Random Variables

3Bayes’ Rule & Statistics

4Application: Principal Components Analysis

5Technical Details of Random Variables

6Common Probability Distributions

7Common Parametrizing Functions

8Information Theory

9Application: Decision Trees & Random Forest

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 56 / 76

Page 132: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Parametrizing Functions

A probability distribution P(q) is parametrized by qIn ML, q may be the output value of a deterministic function

Called parametrizing function

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 57 / 76

Page 133: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Logistic FunctionThe logistic function (a special case of sigmoid functions) isdefined as:

s (x) =exp(x)

exp(x)+1=

11+ exp(�x)

Always takes on values between (0,1)Commonly used to produce the r parameter of Bernoulli distribution

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 58 / 76

Page 134: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Logistic FunctionThe logistic function (a special case of sigmoid functions) isdefined as:

s (x) =exp(x)

exp(x)+1=

11+ exp(�x)

Always takes on values between (0,1)

Commonly used to produce the r parameter of Bernoulli distribution

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 58 / 76

Page 135: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Logistic FunctionThe logistic function (a special case of sigmoid functions) isdefined as:

s (x) =exp(x)

exp(x)+1=

11+ exp(�x)

Always takes on values between (0,1)Commonly used to produce the r parameter of Bernoulli distribution

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 58 / 76

Page 136: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Softplus Function

The softplus function :

z (x) = log(1+ exp(x))

A “softened” version of x

+ = max(0,x)

Range: (0,•)

Useful for producing the b or s parameter of Gaussian distribution

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 59 / 76

Page 137: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Softplus Function

The softplus function :

z (x) = log(1+ exp(x))

A “softened” version of x

+ = max(0,x)

Range: (0,•)

Useful for producing the b or s parameter of Gaussian distribution

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 59 / 76

Page 138: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Softplus Function

The softplus function :

z (x) = log(1+ exp(x))

A “softened” version of x

+ = max(0,x)

Range: (0,•)

Useful for producing the b or s parameter of Gaussian distribution

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 59 / 76

Page 139: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Properties [Homework]

1�s(x) = s(�x)

logs(x) =�z (�x)d

dx

s(x) = s(x)(1�s(x))d

dx

z (x) = s(x)

8x 2 (0,1),s�1(x) = log�

x

1�x

8x > 0,z�1(x) = log(exp(x)�1)

z (x) =R

x

�• s(y)dy

z (x)�z (�x) = x

z (�x) is the softened x

� = max(0,�x)x = x

+� x

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 60 / 76

Page 140: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Outline

1Random Variables & Probability Distributions

2Multivariate & Derived Random Variables

3Bayes’ Rule & Statistics

4Application: Principal Components Analysis

5Technical Details of Random Variables

6Common Probability Distributions

7Common Parametrizing Functions

8Information Theory

9Application: Decision Trees & Random Forest

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 61 / 76

Page 141: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

What’s Information Theory

Probability theory allows us to make uncertain statements and reasonin the presence of uncertainty

Information theory allows us to quantify the amount of uncertainty

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 62 / 76

Page 142: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

What’s Information Theory

Probability theory allows us to make uncertain statements and reasonin the presence of uncertaintyInformation theory allows us to quantify the amount of uncertainty

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 62 / 76

Page 143: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Self-Information

Given a random variable x, how much information you receive whenseeing an event x = x?

1 Likely events should have low informationE.g., we are less surprised when tossing a biased coins

2 Independent events should have additive informationE.g, “two heads” should have twice as much info as “one head”

The self-information:

I(x = x) =� logP(x = x)

Called bit if base-2 logarithm is usedCalled nat if base-e

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 63 / 76

Page 144: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Self-Information

Given a random variable x, how much information you receive whenseeing an event x = x?

1 Likely events should have low informationE.g., we are less surprised when tossing a biased coins

2 Independent events should have additive informationE.g, “two heads” should have twice as much info as “one head”

The self-information:

I(x = x) =� logP(x = x)

Called bit if base-2 logarithm is usedCalled nat if base-e

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 63 / 76

Page 145: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Self-Information

Given a random variable x, how much information you receive whenseeing an event x = x?

1 Likely events should have low informationE.g., we are less surprised when tossing a biased coins

2 Independent events should have additive informationE.g, “two heads” should have twice as much info as “one head”

The self-information:

I(x = x) =� logP(x = x)

Called bit if base-2 logarithm is usedCalled nat if base-e

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 63 / 76

Page 146: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Self-Information

Given a random variable x, how much information you receive whenseeing an event x = x?

1 Likely events should have low informationE.g., we are less surprised when tossing a biased coins

2 Independent events should have additive informationE.g, “two heads” should have twice as much info as “one head”

The self-information:

I(x = x) =� logP(x = x)

Called bit if base-2 logarithm is usedCalled nat if base-e

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 63 / 76

Page 147: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Self-Information

Given a random variable x, how much information you receive whenseeing an event x = x?

1 Likely events should have low informationE.g., we are less surprised when tossing a biased coins

2 Independent events should have additive informationE.g, “two heads” should have twice as much info as “one head”

The self-information:

I(x = x) =� logP(x = x)

Called bit if base-2 logarithm is usedCalled nat if base-e

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 63 / 76

Page 148: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

EntropySelf-information deals with a particular outcome

We can quantify the amount of uncertainty in an entire probabilitydistribution using the entropy:

H(x ⇠ P) = Ex⇠P[I(x)] =�Âx

P(x) logP(x) or �Z

p(x) logp(x)dx

Let 0log0 = limx!0 x logx = 0

Called Shannon entropy when x is discrete; differential entropy

when x is continuous

Figure: Shannon entropy H(x) over Bernoulli distributions with different r.

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 64 / 76

Page 149: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

EntropySelf-information deals with a particular outcomeWe can quantify the amount of uncertainty in an entire probabilitydistribution using the entropy:

H(x ⇠ P) = Ex⇠P[I(x)] =�Âx

P(x) logP(x) or �Z

p(x) logp(x)dx

Let 0log0 = limx!0 x logx = 0

Called Shannon entropy when x is discrete; differential entropy

when x is continuous

Figure: Shannon entropy H(x) over Bernoulli distributions with different r.

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 64 / 76

Page 150: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

EntropySelf-information deals with a particular outcomeWe can quantify the amount of uncertainty in an entire probabilitydistribution using the entropy:

H(x ⇠ P) = Ex⇠P[I(x)] =�Âx

P(x) logP(x) or �Z

p(x) logp(x)dx

Let 0log0 = limx!0 x logx = 0

Called Shannon entropy when x is discrete; differential entropy

when x is continuous

Figure: Shannon entropy H(x) over Bernoulli distributions with different r.

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 64 / 76

Page 151: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

EntropySelf-information deals with a particular outcomeWe can quantify the amount of uncertainty in an entire probabilitydistribution using the entropy:

H(x ⇠ P) = Ex⇠P[I(x)] =�Âx

P(x) logP(x) or �Z

p(x) logp(x)dx

Let 0log0 = limx!0 x logx = 0

Called Shannon entropy when x is discrete; differential entropy

when x is continuous

Figure: Shannon entropy H(x) over Bernoulli distributions with different r.Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 64 / 76

Page 152: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Average Code Length

Shannon entropy gives a lower bound on the number of “bits” neededon average to encode values drawn from a distribution P

Consider a random variable x ⇠ Uniform having 8 equally likely statesTo send a value x to receiver, we would encode it into 3 bitsShannon entropy: H(x ⇠ Uniform) =�8⇥ 1

8 log218 = 3

If the probabilities of the 8 states are ( 12 ,

14 ,

18 ,

116 ,

164 ,

164 ,

164 ,

164) instead

H(x) = 2The encoding 0,10,110,1110,111100,111101,111110,111111 gives theaverage code length 2

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 65 / 76

Page 153: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Average Code Length

Shannon entropy gives a lower bound on the number of “bits” neededon average to encode values drawn from a distribution PConsider a random variable x ⇠ Uniform having 8 equally likely states

To send a value x to receiver, we would encode it into 3 bitsShannon entropy: H(x ⇠ Uniform) =�8⇥ 1

8 log218 = 3

If the probabilities of the 8 states are ( 12 ,

14 ,

18 ,

116 ,

164 ,

164 ,

164 ,

164) instead

H(x) = 2The encoding 0,10,110,1110,111100,111101,111110,111111 gives theaverage code length 2

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 65 / 76

Page 154: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Average Code Length

Shannon entropy gives a lower bound on the number of “bits” neededon average to encode values drawn from a distribution PConsider a random variable x ⇠ Uniform having 8 equally likely states

To send a value x to receiver, we would encode it into 3 bitsShannon entropy: H(x ⇠ Uniform) =�8⇥ 1

8 log218 = 3

If the probabilities of the 8 states are ( 12 ,

14 ,

18 ,

116 ,

164 ,

164 ,

164 ,

164) instead

H(x) = 2The encoding 0,10,110,1110,111100,111101,111110,111111 gives theaverage code length 2

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 65 / 76

Page 155: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Kullback-Leibler (KL) DivergenceHow many extra “bits” needed in average to transmit a value drawnfrom distribution P when we use a code that was designed for anotherdistribution Q?

Kullback-Leibler (KL) Divergence or (relative entropy) fromdistribution Q to P:

DKL(PkQ) = Ex⇠P

log

P(x)Q(x)

�=�Ex⇠P [logQ(x)]�H(x ⇠ P)

The term �Ex⇠P [logQ(x)] is called the cross entropy

If P and Q are independent, we can solve

argminQ

DKL(PkQ)

byargmin

Q

�Ex⇠P [logQ(x)]

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 66 / 76

Page 156: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Kullback-Leibler (KL) DivergenceHow many extra “bits” needed in average to transmit a value drawnfrom distribution P when we use a code that was designed for anotherdistribution Q?Kullback-Leibler (KL) Divergence or (relative entropy) fromdistribution Q to P:

DKL(PkQ) = Ex⇠P

log

P(x)Q(x)

�=�Ex⇠P [logQ(x)]�H(x ⇠ P)

The term �Ex⇠P [logQ(x)] is called the cross entropy

If P and Q are independent, we can solve

argminQ

DKL(PkQ)

byargmin

Q

�Ex⇠P [logQ(x)]

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 66 / 76

Page 157: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Kullback-Leibler (KL) DivergenceHow many extra “bits” needed in average to transmit a value drawnfrom distribution P when we use a code that was designed for anotherdistribution Q?Kullback-Leibler (KL) Divergence or (relative entropy) fromdistribution Q to P:

DKL(PkQ) = Ex⇠P

log

P(x)Q(x)

�=�Ex⇠P [logQ(x)]�H(x ⇠ P)

The term �Ex⇠P [logQ(x)] is called the cross entropy

If P and Q are independent, we can solve

argminQ

DKL(PkQ)

byargmin

Q

�Ex⇠P [logQ(x)]

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 66 / 76

Page 158: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Properties

DKL(PkQ)� 0, 8P,Q

DKL(PkQ) = 0 iff P and Q are equal almost surelyKL divergence is asymmetric, i.e., DKL(PkQ) 6= DKL(QkP)

Figure: KL divergence for two normal distributions.

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 67 / 76

Page 159: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Minimizer of KL DivergenceGiven P, we want to find Q⇤ that minimizes the KL divergenceQ⇤(from) = argminQ DKL(PkQ) or Q⇤(to) = argminQ DKL(QkP)?

Q⇤(from) places high probability where P has high probabilityQ⇤(to) places low probability where P has low probability

Figure: Approximating a mixture P of two Gaussians using a single Gaussian Q.

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 68 / 76

Page 160: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Minimizer of KL DivergenceGiven P, we want to find Q⇤ that minimizes the KL divergenceQ⇤(from) = argminQ DKL(PkQ) or Q⇤(to) = argminQ DKL(QkP)?Q⇤(from) places high probability where P has high probabilityQ⇤(to) places low probability where P has low probability

Figure: Approximating a mixture P of two Gaussians using a single Gaussian Q.

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 68 / 76

Page 161: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Outline

1Random Variables & Probability Distributions

2Multivariate & Derived Random Variables

3Bayes’ Rule & Statistics

4Application: Principal Components Analysis

5Technical Details of Random Variables

6Common Probability Distributions

7Common Parametrizing Functions

8Information Theory

9Application: Decision Trees & Random Forest

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 69 / 76

Page 162: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Decision Trees

Given a supervised dataset X= {(x(i),y(i))}N

i=1

Can we find out a tree-like function f (i.e, a set of rules) such thatf (x(i)) = y

(i)?

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 70 / 76

Page 163: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Training a Decision Tree

Start from root which corresponds to all data points{(x(i),y(i)) : Rules = /0)}Recursively split leaf nodes until data corresponding to children are“pure” in labels

How to split? Find a cutting point (j,v) among all unseen attributessuch that after partitioning the corresponding data pointsXparent = {(x(i),y(i) : Rules)} into two groups

Xleft = {(x(i),y(i)) : Rules[{x

(i)j

< v}}, and

Xright = {(x(i),y(i)) : Rules[{x

(i)j

� v}},

the “impurity” of labels drops the most, i.e., solve

argmaxj,v

⇣Impurity(Xparent)� Impurity(Xleft,Xright)

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 71 / 76

Page 164: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Training a Decision Tree

Start from root which corresponds to all data points{(x(i),y(i)) : Rules = /0)}Recursively split leaf nodes until data corresponding to children are“pure” in labelsHow to split?

Find a cutting point (j,v) among all unseen attributessuch that after partitioning the corresponding data pointsXparent = {(x(i),y(i) : Rules)} into two groups

Xleft = {(x(i),y(i)) : Rules[{x

(i)j

< v}}, and

Xright = {(x(i),y(i)) : Rules[{x

(i)j

� v}},

the “impurity” of labels drops the most, i.e., solve

argmaxj,v

⇣Impurity(Xparent)� Impurity(Xleft,Xright)

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 71 / 76

Page 165: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Training a Decision Tree

Start from root which corresponds to all data points{(x(i),y(i)) : Rules = /0)}Recursively split leaf nodes until data corresponding to children are“pure” in labelsHow to split? Find a cutting point (j,v) among all unseen attributessuch that after partitioning the corresponding data pointsXparent = {(x(i),y(i) : Rules)} into two groups

Xleft = {(x(i),y(i)) : Rules[{x

(i)j

< v}}, and

Xright = {(x(i),y(i)) : Rules[{x

(i)j

� v}},

the “impurity” of labels drops the most

, i.e., solve

argmaxj,v

⇣Impurity(Xparent)� Impurity(Xleft,Xright)

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 71 / 76

Page 166: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Training a Decision Tree

Start from root which corresponds to all data points{(x(i),y(i)) : Rules = /0)}Recursively split leaf nodes until data corresponding to children are“pure” in labelsHow to split? Find a cutting point (j,v) among all unseen attributessuch that after partitioning the corresponding data pointsXparent = {(x(i),y(i) : Rules)} into two groups

Xleft = {(x(i),y(i)) : Rules[{x

(i)j

< v}}, and

Xright = {(x(i),y(i)) : Rules[{x

(i)j

� v}},

the “impurity” of labels drops the most, i.e., solve

argmaxj,v

⇣Impurity(Xparent)� Impurity(Xleft,Xright)

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 71 / 76

Page 167: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Impurity Measure

argmaxj,v

⇣Impurity(Xparent)� Impurity(Xleft,Xright)

What’s Impurity(·)?

Entropy is a common choice:

Impurity(Xparent) = H[y ⇠ Empirical(Xparent)]

Impurity(Xleft,Xright) = Âi=left,right

|X(i)||Xparent|H[y ⇠ Empirical(X(i))]

In this case, Impurity(Xparent)� Impurity(Xleft,Xright) is called theinformation gain

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 72 / 76

Page 168: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Impurity Measure

argmaxj,v

⇣Impurity(Xparent)� Impurity(Xleft,Xright)

What’s Impurity(·)?Entropy is a common choice:

Impurity(Xparent) = H[y ⇠ Empirical(Xparent)]

Impurity(Xleft,Xright) = Âi=left,right

|X(i)||Xparent|H[y ⇠ Empirical(X(i))]

In this case, Impurity(Xparent)� Impurity(Xleft,Xright) is called theinformation gain

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 72 / 76

Page 169: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Impurity Measure

argmaxj,v

⇣Impurity(Xparent)� Impurity(Xleft,Xright)

What’s Impurity(·)?Entropy is a common choice:

Impurity(Xparent) = H[y ⇠ Empirical(Xparent)]

Impurity(Xleft,Xright) = Âi=left,right

|X(i)||Xparent|H[y ⇠ Empirical(X(i))]

In this case, Impurity(Xparent)� Impurity(Xleft,Xright) is called theinformation gain

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 72 / 76

Page 170: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Random Forests

A decision tree can be very deep

Deeper nodes give more specific rulesBacked by less training dataMay not be applicable to testing data

How to ensure the generalizability of a decision tree?I.e., to have high prediction accuracy on testing data

1 Pruning (e.g., limit the depth of the tree)2

Random forest: an ensemble of many (deep) trees

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 73 / 76

Page 171: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Random Forests

A decision tree can be very deepDeeper nodes give more specific rules

Backed by less training dataMay not be applicable to testing data

How to ensure the generalizability of a decision tree?I.e., to have high prediction accuracy on testing data

1 Pruning (e.g., limit the depth of the tree)2

Random forest: an ensemble of many (deep) trees

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 73 / 76

Page 172: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Random Forests

A decision tree can be very deepDeeper nodes give more specific rules

Backed by less training dataMay not be applicable to testing data

How to ensure the generalizability of a decision tree?I.e., to have high prediction accuracy on testing data

1 Pruning (e.g., limit the depth of the tree)

2Random forest: an ensemble of many (deep) trees

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 73 / 76

Page 173: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Random Forests

A decision tree can be very deepDeeper nodes give more specific rules

Backed by less training dataMay not be applicable to testing data

How to ensure the generalizability of a decision tree?I.e., to have high prediction accuracy on testing data

1 Pruning (e.g., limit the depth of the tree)2

Random forest: an ensemble of many (deep) trees

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 73 / 76

Page 174: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Training a Random Forest

1 Randomly pick M samples from the training set with replacementCalled the bootstrap samples

2 Grow a decision tree from the bootstrap samples. At each node:1

Randomly select K features without replacement2 Find the best cutting point (j,v) and split the node

3 Repeat the steps 1 and 2 for T times to get T trees4 Aggregate the predictions made by different trees via the majority

vote

Each tree is trained slightly differently because of Step 1 and 2(a)Provides different “perspectives” when voting

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 74 / 76

Page 175: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Training a Random Forest

1 Randomly pick M samples from the training set with replacementCalled the bootstrap samples

2 Grow a decision tree from the bootstrap samples. At each node:1

Randomly select K features without replacement2 Find the best cutting point (j,v) and split the node

3 Repeat the steps 1 and 2 for T times to get T trees4 Aggregate the predictions made by different trees via the majority

vote

Each tree is trained slightly differently because of Step 1 and 2(a)Provides different “perspectives” when voting

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 74 / 76

Page 176: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Training a Random Forest

1 Randomly pick M samples from the training set with replacementCalled the bootstrap samples

2 Grow a decision tree from the bootstrap samples. At each node:1

Randomly select K features without replacement2 Find the best cutting point (j,v) and split the node

3 Repeat the steps 1 and 2 for T times to get T trees

4 Aggregate the predictions made by different trees via the majority

vote

Each tree is trained slightly differently because of Step 1 and 2(a)Provides different “perspectives” when voting

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 74 / 76

Page 177: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Training a Random Forest

1 Randomly pick M samples from the training set with replacementCalled the bootstrap samples

2 Grow a decision tree from the bootstrap samples. At each node:1

Randomly select K features without replacement2 Find the best cutting point (j,v) and split the node

3 Repeat the steps 1 and 2 for T times to get T trees4 Aggregate the predictions made by different trees via the majority

vote

Each tree is trained slightly differently because of Step 1 and 2(a)Provides different “perspectives” when voting

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 74 / 76

Page 178: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Training a Random Forest

1 Randomly pick M samples from the training set with replacementCalled the bootstrap samples

2 Grow a decision tree from the bootstrap samples. At each node:1

Randomly select K features without replacement2 Find the best cutting point (j,v) and split the node

3 Repeat the steps 1 and 2 for T times to get T trees4 Aggregate the predictions made by different trees via the majority

vote

Each tree is trained slightly differently because of Step 1 and 2(a)Provides different “perspectives” when voting

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 74 / 76

Page 179: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Decision Boundaries

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 75 / 76

Page 180: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Decision Trees vs. Random Forests

Cons of random forests:Less interpretable model

Pros:Less sensitive to the depth of trees

The majority voting can “absorb” the noise from individual trees

Can be parallelizedEach tree can grow independently

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 76 / 76

Page 181: Probability & Information Theory - GitHub Pages · Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University,

Decision Trees vs. Random Forests

Cons of random forests:Less interpretable model

Pros:Less sensitive to the depth of trees

The majority voting can “absorb” the noise from individual trees

Can be parallelizedEach tree can grow independently

Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Large-Scale ML, Fall 2016 76 / 76