expectation maximization a “gentle” introduction

Expectation MaximizationExpectation Maximization A “Gentle” Introduction

Scott Morris

Department of Computer Science

Basic Premise

• Given a set of observed data, X, what is the Given a set of observed data, X, what is the underlying model that produced X?underlying model that produced X?– Example: distributions – Gaussian, Poisson, UniformExample: distributions – Gaussian, Poisson, Uniform

• Assume we know (or can intuit) what type of Assume we know (or can intuit) what type of model produced data model produced data

• Model has m parameters (Model has m parameters (ΘΘ1..1..ΘΘm)m)– Parameters are unknown, we would like to estimate Parameters are unknown, we would like to estimate

themthem

Maximum Likelihood Estimators (MLE)

• P(P(ΘΘ|X) = Probability that a set of given |X) = Probability that a set of given parameters are “correct” ?? parameters are “correct” ??

• Instead define “likelihood” of the Instead define “likelihood” of the parameters given the data, L(parameters given the data, L(ΘΘ|X)|X)

• What if data is continuous?

MLE continued• We are solving an optimization problem

• Often solve log() of Likelihood instead. – Why is this the same?

• Any method that maximizes the likelihood function is called a Maximum Likelihood Estimator

Simple Example: Least Squares Fit

• Input: N points in R^2Input: N points in R^2• Model: A single line, y = ax+bModel: A single line, y = ax+b

– Parameters: a, bParameters: a, b

• Origin? Maximum Likelihood EstimatorOrigin? Maximum Likelihood Estimator

• Input: N points in R^2Input: N points in R^2• Model: A single line, y = ax+bModel: A single line, y = ax+b

– Parameters: a, bParameters: a, b

• Origin? Maximum Likelihood EstimatorOrigin? Maximum Likelihood Estimator

Expectation Maximization

• An elaborate technique for maximizing An elaborate technique for maximizing the likelihood functionthe likelihood function

• Often used when observed data is Often used when observed data is incompleteincomplete– Due to problems in observation processDue to problems in observation process– Due to unknown or difficult distribution Due to unknown or difficult distribution

function(s)function(s)

• Iterative ProcessIterative Process• Still a local techniqueStill a local technique

EM likelihood function

• Observed data X, assume missing Observed data X, assume missing data Y.data Y.

• Let Z be the complete dataLet Z be the complete data– Joint density functionJoint density function– P(z|P(z|ΘΘ) = p(x,y|) = p(x,y|ΘΘ) = p(y|x,) = p(y|x,ΘΘ)p(x|)p(x|ΘΘ))

• Define new likelihood functionDefine new likelihood functionL(L(ΘΘ|Z) = p(X,Y||Z) = p(X,Y|ΘΘ))

• X,X,ΘΘ are constants, so L() is a are constants, so L() is a random variable dependent on the random variable dependent on the random variable Y.random variable Y.

“E” Step of EM Algorithm

• Since L(Since L(ΘΘ|Z) is itself a random variable, |Z) is itself a random variable, we can compute its expected value:we can compute its expected value:

• Can be thought of as computing the Can be thought of as computing the expected value of Y given the current expected value of Y given the current estimate of estimate of ΘΘ..

““M” step of EM AlgorithmM” step of EM AlgorithmOnce we have expectation computed, Once we have expectation computed,

optimize optimize ΘΘ using the MLE. using the MLE.

Convergence – Various results proving Convergence – Various results proving convergence cited.convergence cited.

Generalized EM – Instead of finding Generalized EM – Instead of finding optimal optimal ΘΘ, choose one that increases , choose one that increases the MLEthe MLE

Mixture Models

• Assume “mixture” of probability distributions:

• Log-likelihood function is difficult to optimize, use a trick:– Assume unobserved data items Y whose

values inform us which distribution generated each item in X.

Update Equations

• After much derivation, estimates for new parameters in terms of old result:– Θ = (μ,Σ)

• Where μ is the mean and Σ is the variance of a d-dimensional normal distribution

expectation maximization a “gentle” introduction

Documents

log of likelihood

data model

likelihood functionoften

missing data y

loglikelihood function

random variable y

unobserved data items

set of observed data