lecture 13: maximum likelihood estimation (mle) · machine learning lecture 13: maximum likelihood...

52
UVA CS 6316: Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE) Dr. Yanjun Qi University of Virginia Department of Computer Science

Upload: others

Post on 19-Jul-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

UVA CS 6316: Machine Learning

Lecture 13: Maximum Likelihood Estimation (MLE)

Dr. Yanjun QiUniversity of Virginia

Department of Computer Science

Page 2: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

Last : Probability Review

10/21/19 Dr. Yanjun Qi / UVA CS

• The big picture• Events and Event spaces• Random variables• Joint probability, Marginalization,

conditioning, chain rule, Bayes Rule, law of total probability, etc.

• Structural properties, e.g., Independence, conditional independence

• Maximum Likelihood Estimation

Page 3: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

10/21/19 Dr. Yanjun Qi / UVA CS

Sample space and Events• O : Sample Space,

• result of an experiment / set of all outcomes

• If you toss a coin twice O = {HH,HT,TH,TT}

• Event: a subset of O• First toss is head = {HH,HT}

• S: event space, a set of events:• Contains the empty event and O

Page 4: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

10/21/19 Dr. Yanjun Qi / UVA CS

From Events to Random Variable• Concise way of specifying attributes of outcomes• Modeling students (Grade and Intelligence):

• O = all possible students (sample space)• What are events (subset of sample space)

• Grade_A = all students with grade A• Grade_B = all students with grade B• HardWorking_Yes = … who works hard

• Very cumbersome

• Need “functions” that maps from O to an attribute space T.• P(H = YES) = P({student ϵ O : H(student) = YES})

Page 5: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

10/21/19 Dr. Yanjun Qi / UVA CS

If hard to directly estimate from data, most likely we can estimate

• 1. Joint probability • Use Chain Rule

• 2. Marginal probability • Use the total law of probability

• 3. Conditional probability • Use the Bayes Rule

Page 6: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

Today : Probability Review

10/21/19 Dr. Yanjun Qi / UVA CS

• The big picture• Events and Event spaces• Random variables• Joint probability, Marginalization,

conditioning, chain rule, Bayes Rule, law of total probability, etc.

• Structural properties, e.g., Independence, conditional independence

• Maximum Likelihood Estimation

Page 7: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

The Big Picture

10/21/19 Dr. Yanjun Qi / UVA CS

Model i.e. Data generating

process

Observed Data

Probability

MLE Estimation / learning / statistics / Data mining

Page 8: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

Today

q Basic MLEq MLE for Discrete RVq MLE for Continuous RV (Gaussian)q MLE connects to Normal Equation of LRq More about Mean and Variance

10/21/19 Dr. Yanjun Qi / UVA CS

Page 9: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

A general Statement

Consider a sample set T=(X1...Xn) which is drawn from a probability distribution P(X|\theta) where \theta are parameters.

If the Xs are independent with probability density function P(Xi|\theta),the joint probability of the whole set is

P( 1X ... nX |θ )=

i=1

n

∏P( iX |θ )

this may be maximised with respect to \theta to give the maximum likelihood estimates.

Maximum Likelihood Estimation

10/21/19 Dr. Yanjun Qi / UVA CS

Page 10: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

It is often convenient to work with the Log of the likelihood function.

log(L(θ )) =i=1

n

∑log(P( iX |θ ))

The idea is to

ü assume a particular model with unknown parameters, üwe can then define the probability of observing a given event

conditional on a particular set of parameters.ü We have observed a set of outcomes in the real world.ü It is then possible to choose a set of parameters which are

most likely to have produced the observed results.

This is maximum likelihood. In most cases it is both consistent and efficient. It provides a standard to compare other estimation techniques.

P( iX |θ )

θ

θ̂ = argmaxθ

P( 1X ... nX |θ )

10/21/19 Dr. Yanjun Qi / UVA CS

Page 11: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

It is often convenient to work with the Log of the likelihood function.

log(L(θ )) =i=1

n

∑log(P( iX |θ ))

The idea is to

ü assume a particular model with unknown parameters, üwe can then define the probability of observing a given event

conditional on a particular set of parameters.ü We have observed a set of outcomes in the real world.ü It is then possible to choose a set of parameters which are

most likely to have produced the observed results.

This is maximum likelihood. In most cases it is both consistent and efficient. It provides a standard to compare other estimation techniques.

P( iX |θ )

θ

θ̂ = argmaxθ

P( 1X ... nX |θ )

10/21/19 Dr. Yanjun Qi / UVA CS

Page 12: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

It is often convenient to work with the Log of the likelihood function.

log(L(θ )) =i=1

n

∑log(P( iX |θ ))

The idea is to

ü assume a particular model with unknown parameters, üwe can then define the probability of observing a given event

conditional on a particular set of parameters.ü We have observed a set of outcomes in the real world.ü It is then possible to choose a set of parameters which are

most likely to have produced the observed results.

This is maximum likelihood. In most cases it is both consistent and efficient. It provides a standard to compare other estimation techniques.

P( iX |θ )

θ

θ̂ = argmaxθ

P( 1X ... nX |θ )

10/21/19 Dr. Yanjun Qi / UVA CS

Page 13: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

It is often convenient to work with the Log of the likelihood function.

log(L(θ )) =i=1

n

∑log(P( iX |θ ))

The idea is to

ü assume a particular model with unknown parameters, üwe can then define the probability of observing a given event

conditional on a particular set of parameters.ü We have observed a set of outcomes in the real world.ü It is then possible to choose a set of parameters which are

most likely to have produced the observed results.

This is maximum likelihood. In most cases it is both consistent and efficient. It provides a standard to compare other estimation techniques.

P( iX |θ )

θ

θ̂ = argmaxθ

P( 1X ... nX |θ )

10/21/19 Dr. Yanjun Qi / UVA CS

Page 14: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

It is often convenient to work with the Log of the likelihood function.

log(L(θ )) =i=1

n

∑log(P( iX |θ ))

The idea is to

ü assume a particular model with unknown parameters, üwe can then define the probability of observing a given event

conditional on a particular set of parameters.ü We have observed a set of outcomes in the real world.ü It is then possible to choose a set of parameters which are

most likely to have produced the observed results.

This is maximum likelihood. In most cases it is both consistent and efficient.

P( iX |θ )

θ

θ̂ = argmaxθ

P( 1X ... nX |θ )

10/21/19 Dr. Yanjun Qi / UVA CS

Page 15: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

It is often convenient to work with the Log of the likelihood function.

log(L(θ )) =i=1

n

∑log(P( iX |θ ))

The idea is to

ü assume a particular model with unknown parameters, üwe can then define the probability of observing a given event

conditional on a particular set of parameters.ü We have observed a set of outcomes in the real world.ü It is then possible to choose a set of parameters which are

most likely to have produced the observed results.

This is maximum likelihood. In most cases it is both consistent and efficient.

P( iX |θ )

θ

θ̂ = argmaxθ

P( 1X ... nX |θ )

Log-Likelihood

Likelihood

10/21/19 Dr. Yanjun Qi / UVA CS

Page 16: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

Today

q Basic MLEq MLE for Discrete RVq MLE for Continuous RV (Gaussian)q MLE connects to Normal Equation of LR

10/21/19 Dr. Yanjun Qi / UVA CS

Page 17: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

Discrete Random Variables

• Random variables (RVs) which may take on only a countable number of distinct values

• E.g. the total number of heads X you get if you flip 100 coins

• X is a RV with arity k if it can take on exactly one value out of • E.g. the possible values that X can take on are 0, 1, 2,…, 100

10/21/19 Dr. Yanjun Qi / UVA CS

x1,…,xk{ }

Page 18: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

e.g. Coin Flips cont.

• You flip a coin• Head with probability p• Binary random variable• Bernoulli trial with success probability p

• You flip a coin for k times• How many heads would you expect• Number of heads X is a discrete random variable• Binomial distribution with parameters k and p

10/21/19 Dr. Yanjun Qi / UVA CS

Page 19: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

Review: Bernoulli Distributione.g. Coin Flips

• You flip n coins• How many heads would you expect• Head with probability p• Number of heads X out of n trial • Each Trial following Bernoulli distribution with parameters p

10/21/19 Dr. Yanjun Qi / UVA CS

Page 20: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

Calculating Likelihood

10/21/19 Dr. Yanjun Qi / UVA CS

Page 21: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

Defining Likelihood for Bernoulli

• Likelihood = p(data | parameter)

10/21/19 Dr. Yanjun Qi / UVA CS

!!f (xi |p)= pxi (1− p)1−xi

function of x_i

PMF:

!!L(p)=∏i=1n pxi (1− p)1−xi = px(1− p)n−x

function of p

LIKELIHOOD:

è e.g., for n independent tosses of coins, with unknown parameter p

Observed data èx heads-up from n trials

x = xi

i=1

n

Page 22: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

Deriving the Maximum Likelihood Estimate for Bernoulli

10/21/19 Dr. Yanjun Qi / UVA CS

!!L(p)= px(1− p)n−x

maximize

0.0

00

.04

0.0

8

potential HIV prevalences

like

liho

od

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

!!log(L(p)= log px(1− p)n−x⎡⎣ ⎤⎦

maximize

!2

50

!1

50

!5

00

potential HIV prevalences

log

(lik

elih

oo

d)

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

−l(p)= − log px(1− p)n−x⎡⎣ ⎤⎦

Minimize the negative log-likelihood

05

01

00

20

0potential HIV prevalences

!lo

g(lik

elih

oo

d)

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

p

p

p

Page 23: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

Deriving the Maximum Likelihood Estimate for Bernoulli

10/21/19 Dr. Yanjun Qi / UVA CS

−l(p)= − log(L(p))= − log px(1− p)n−x⎡⎣ ⎤⎦

= − log(px )− log((1− p)n−x )

= −x log(p)−(n− x)log(1− p)

Minimize the negative log-likelihood

Page 24: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

10/21/19 Dr. Yanjun Qi / UVA CS

Deriving the Maximum Likelihood Estimate for Bernoulli

−l(p)= − log n

x⎛

⎝⎜⎞

⎠⎟− x log(p)−(n− x)log(1− p)

dl(p)dp

= 0 − xp−−(n − x)1− p

0= − x

p+ n− x1− p

0= −x(1− p)+ p(n− x)

p(1− p)

0= −x + px + pn− px

0= −x + pn

ˆ p = xn

The proportion of positives!

Minimize the negative log-likelihood

è MLE parameter estimation

i.e. Relative frequency of a binary event

Page 25: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

Today

q Basic MLEq MLE for Discrete RVq MLE for Continuous RV (Gaussian)q MLE connects to Normal Equation of LRq More about Mean and Variance

10/21/19 Dr. Yanjun Qi / UVA CS

Page 26: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

Review: Continuous Random Variables

• Probability density function (pdf) instead of probability mass function (pmf)

• For discrete RV: Probability mass function (pmf): P(X = xi)

• A pdf (prob. Density func.) is any function f(x) that describes the probability density in terms of the input variable x.

10/21/19 Dr. Yanjun Qi / UVA CS

Page 27: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

10/21/19 Dr. Yanjun Qi / UVA CS

Review: Probability of Continuous RV

• Properties of pdf§

§

• Actual probability can be obtained by taking the integral of pdf§ E.g. the probability of X being between 5 and 6 is

f (x)≥0,∀x

f (x)=1−∞

+∞

P(5≤ X ≤6)= f (x)dx

5

6

Page 28: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

Review: Mean and Variance of RV

• Mean (Expectation): • Discrete RVs:

• Continuous RVs:

10/21/19 Dr. Yanjun Qi / UVA CS

( )XEµ =

( ) ( )X P Xii iv

E v v= =å

( ) ( )XE xf x dx+¥

-¥= ò

E(g(X)) = g(vi)P(X = vi)vi∑

E(g(X)) = g(x) f (x)dx−∞

+∞

Adapt From Carols’ prob tutorial

Page 29: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

Review: Mean and Variance of RV

•Variance:

• Discrete RVs:

• Continuous RVs:

•Covariance:

10/21/19 Dr. Yanjun Qi / UVA CS

( ) ( ) ( )2X P Xi

i ivV v vµ= - =å

( ) ( ) ( )2XV x f x dxµ+¥

-¥= -ò

Var(X) = E((X −µ)2 )

Cov(X,Y ) = E((X − µx )(Y − µy )) = E(XY ) − µxµy

Adapt From Carols’ prob tutorial

Page 30: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

10/21/19 Dr. Yanjun Qi / UVA CS

Single-Variate Gaussian Distribution

Courtesy: http://research.microsoft.com/~cmbishop/PRML/index.htm

( )2~ ,X N µ s

Page 31: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

10/21/19 Dr. Yanjun Qi / UVA CS

Bi-Variate Gaussian Distribution

Courtesy: http://research.microsoft.com/~cmbishop/PRML/index.htm

Bivariate normal PDF: .

• Mean of normal PDF is at peak value. Contours of equal PDF form ellipses.

• The covariance matrix captures linear dependencies among the variables

X2

X1

Page 32: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

Multivariate Normal (Gaussian) PDFs

Where |*| represents determinant

The only widely used continuous joint PDF is the multivariate normal (or Gaussian):

• Mean of normal PDF is at peak value. Contours of equal PDF form ellipses.

• The covariance matrix captures linear dependencies among the variables

10/21/19 Dr. Yanjun Qi / UVA CS

Covariance MatrixMean

Page 33: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

Example: the Bivariate Normal distribution

f x1,x2( ) = 1

2π( ) Σ 1/2 e−1

2!x−!µ( )T Σ−1 !x−

!µ( )

with

!µ =

µ1

µ2

⎣⎢⎢

⎦⎥⎥

22 1 2

22 22 22 2 2

s s s rs ss s rs s s11 1 1

´1 1

é ùé ùS = = ê úê ú

ë û ë û

and

( )2 2 2 222 12 1 2 1s s s s s r11S = - = -

10/21/19 Dr. Yanjun Qi / UVA CS

Page 34: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

Surface Plots of the bivariate Normal distribution

10/21/19 Dr. Yanjun Qi / UVA CS

Page 35: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

Contour Plots of the bivariate Normal distribution

10/21/19 Dr. Yanjun Qi / UVA CS

Page 36: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

Scatter Plots of data from the bivariate Normal distribution

10/21/19 Dr. Yanjun Qi / UVA CS

Page 37: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

Trivariate Normal distribution

x1

x2

x3

10/21/19 Dr. Yanjun Qi / UVA CS

Page 38: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

How to Estimate 1D Gaussian: MLE

10/21/19 Dr. Yanjun Qi / UVA CS

• In the 1D Gaussian case, we simply set the mean and the variance to the sample mean and the sample variance:

𝜎" =1𝑛&

'()

*

+𝑥' − )𝜇"

�̅� =1𝑛1'()

*

𝑥'

Page 39: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

< X1, X2!, X p >~ N µ

"#,Σ( )

How to Estimate p-D Gaussian: MLE

10/21/19 Dr. Yanjun Qi / UVA CS

Page 40: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

Today

q Basic MLEq MLE for Discrete RVq MLE for Continuous RV (Gaussian)q MLE connects to Normal Equation of LRq More about Mean and Variance

10/21/19 Dr. Yanjun Qi / UVA CS

Page 41: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

DETOUR: Probabilistic Interpretation of Linear Regression

• Let us assume that the target variable and the inputs are related by the equation:

where ε is an error term of unmodeled effects or random noise

• Now assume that ε follows a Gaussian N(0,σ), then we have:

• By IID assumption:

10/21/19 Dr. Yanjun Qi / UVA CS

iiT

iy eq += x

÷÷ø

öççè

æ --= 2

2

221

sq

spq )(exp);|( i

Ti

iiyxyp x

÷÷

ø

ö

çç

è

æ --÷

ø

öçè

æ== åÕ =

=2

12

1 221

sq

spqq

n

i iT

inn

iii

yxypL

)(exp);|()(

x

Page 42: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

DETOUR: Probabilistic Interpretation of Linear Regression

• Let us assume that the target variable and the inputs are related by the equation:

where ε is an error term of unmodeled effects or random noise

• Now assume that ε follows a Gaussian N(0,σ), then we have:

• By IID assumption:

10/21/19 Dr. Yanjun Qi / UVA CS

iiT

iy eq += x

÷÷ø

öççè

æ --= 2

2

221

sq

spq )(exp);|( i

Ti

iiyxyp x

÷÷

ø

ö

çç

è

æ --÷

ø

öçè

æ== åÕ =

=2

12

1 221

sq

spqq

n

i iT

inn

iii

yxypL

)(exp);|()(

x

Page 43: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

DETOUR: Probabilistic Interpretation of Linear Regression

• By IID (independent and identically distributed) assumption, we have data likelihood

10/21/19 Dr. Yanjun Qi / UVA CS

L(θ )= p( yi |xi ;θ )

i=1

n

∏ = 12πσ

⎝⎜⎞

⎠⎟

n

exp −( yi −θTx i )2i=1

n∑2σ 2

⎝⎜⎜

⎠⎟⎟

l(θ )= log(L(θ ))= nlog 1

2πσ− 1σ 2

12 ( yi −θTx i )2i=1

n∑

Page 44: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

L(θ )= p( yi |xi ;θ )

i=1

n

∏ = 12πσ

⎝⎜⎞

⎠⎟

n

exp −( yi −θTx i )2i=1

n∑2σ 2

⎝⎜⎜

⎠⎟⎟

We can learn \theta by maximizing the probability / likelihood of generating the observed samples:

l(θ )= log(L(θ ))= nlog 1

2πσ− 1σ 2

12 ( yi −θTx i )2i=1

n∑

å=

-=n

ii

Ti yJ

1

2

21 )()( qq x

10/21/19 Dr. Yanjun Qi / UVA CS

Page 45: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

l(θ )= log(L(θ ))= nlog 1

2πσ− 1σ 2

12 ( yi −θTx i )2i=1

n∑

å=

-=n

ii

Ti yJ

1

2

21 )()( qq x

Thus under independence Gaussian residual assumption, residual square error is equivalent to MLE of θ !

10/21/19 Dr. Yanjun Qi / UVA CS

Page 46: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

10/21/19 Dr. Yanjun Qi / UVA CS

Page 47: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

References

q Prof. Andrew Moore’s review tutorialq Prof. Nando de Freitas’s review slidesq Prof. Carlos Guestrin recitation slides

10/21/19 Dr. Yanjun Qi / UVA CS

Page 48: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

Today

q Basic MLEq MLE for Discrete RVq MLE for Continuous RV (Gaussian)q MLE connects to Normal Equation of LRq Extra: about Mean and Variance

10/21/19 Dr. Yanjun Qi / UVA CS

Page 49: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

Mean and Variance

• Correlation:

10/21/19 Dr. Yanjun Qi / UVA CS

ρ(X,Y ) = Cov(X,Y ) /σ xσ y

−1≤ ρ(X,Y ) ≤1

Page 50: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

Properties

• Mean

• If X and Y are independent,

• Variance

• If X and Y are independent,

10/21/19 Dr. Yanjun Qi / UVA CS

( ) ( ) ( )X Y X YE E E+ = +

( ) ( )X XE a aE=

( ) ( ) ( )XY X YE E E= ×

( ) ( )2X XV a b a V+ =

( )X Y (X) (Y)V V V+ = +

Page 51: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

Some more properties

• The conditional expectation of Y given X when the value of X = x is:

• The Law of Total Expectation or Law of Iterated Expectation:

10/21/19 Dr. Yanjun Qi / UVA CS

( ) dyxypyxXYE )|(*| ò==

[ ] ò === dxxpxXYEXYEEYE X )()|()|()(

Page 52: Lecture 13: Maximum Likelihood Estimation (MLE) · Machine Learning Lecture 13: Maximum Likelihood Estimation (MLE ... •The big picture •Events and Event spaces •Random variables

Some more properties

• The law of Total Variance:

10/21/19 Dr. Yanjun Qi / UVA CS

Var(Y ) =Var E(Y | X)[ ] + E Var(Y | X)[ ]