expectation maximization - university of washington
TRANSCRIPT
1
1
Expectation Maximization
Machine Learning â CSE546 Carlos Guestrin University of Washington
November 13, 2014 ŠCarlos Guestrin 2005-2014
E.M.: The General Case n⯠E.M. widely used beyond mixtures of Gaussians
¨⯠The recipe is the sameâŚ
n⯠Expectation Step: Fill in missing data, given current values of parameters, θ(t) ¨⯠If variable y is missing (could be many variables) ¨⯠Compute, for each data point xj, for each value i of z:
n⯠P(z=i|xj,θ(t))
n⯠Maximization step: Find maximum likelihood parameters for (weighted) âcompleted dataâ: ¨⯠For each data point xj, create k weighted data points
nâŻ
¨⯠Set θ(t+1) as the maximum likelihood parameter estimate for this weighted data
n⯠Repeat ŠCarlos Guestrin 2005-2014 2
2
3
The general learning problem with missing data
n⯠Marginal likelihood â x is observed, z is missing:
ŠCarlos Guestrin 2005-2014
4
E-step
n⯠x is observed, z is missing n⯠Compute probability of missing data given current choice of θ
¨⯠Q(z|xj) for each xj n⯠e.g., probability computed during classification step n⯠corresponds to âclassification stepâ in K-means
ŠCarlos Guestrin 2005-2014
3
5
Jensenâs inequality
n⯠Theorem: log âz P(z) f(z) ⼠âz P(z) log f(z)
ŠCarlos Guestrin 2005-2014
6
Applying Jensenâs inequality
n⯠Use: log âz P(z) f(z) ⼠âz P(z) log f(z)
ŠCarlos Guestrin 2005-2014
4
7
The M-step maximizes lower bound on weighted data
n⯠Lower bound from Jensenâs:
n⯠Corresponds to weighted dataset: ¨⯠<x1,z=1> with weight Q(t+1)(z=1|x1) ¨⯠<x1,z=2> with weight Q(t+1)(z=2|x1) ¨⯠<x1,z=3> with weight Q(t+1)(z=3|x1) ¨⯠<x2,z=1> with weight Q(t+1)(z=1|x2) ¨⯠<x2,z=2> with weight Q(t+1)(z=2|x2) ¨⯠<x2,z=3> with weight Q(t+1)(z=3|x2) ¨⯠âŚ
ŠCarlos Guestrin 2005-2014
8
The M-step
n⯠Maximization step:
n⯠Use expected counts instead of counts: ¨⯠If learning requires Count(x,z) ¨âŻUse EQ(t+1)[Count(x,z)]
ŠCarlos Guestrin 2005-2014
5
9
Convergence of EM
n⯠Define potential function F(θ,Q):
n⯠EM corresponds to coordinate ascent on F ¨âŻThus, maximizes lower bound on marginal log likelihood ¨âŻWe saw that M-step corresponds to fixing Q, max θ ¨âŻE-step fix θ and max Q
ŠCarlos Guestrin 2005-2014
10
M-step is easy
n⯠Using potential function
ŠCarlos Guestrin 2005-2014
6
11
E-step also doesnât decrease potential function 1 n⯠Fixing θ to θ(t):
ŠCarlos Guestrin 2005-2014
12
KL-divergence
n⯠Measures distance between distributions
n⯠KL=zero if and only if Q=P ŠCarlos Guestrin 2005-2014
7
13
E-step also doesnât decrease potential function 2
n⯠Fixing θ to θ(t):
ŠCarlos Guestrin 2005-2014
14
E-step also doesnât decrease potential function 3
n⯠Fixing θ to θ(t)
n⯠Maximizing F(θ(t),Q) over Q â set Q to posterior probability:
n⯠Note that
ŠCarlos Guestrin 2005-2014
8
15
EM is coordinate ascent
n⯠M-step: Fix Q, maximize F over θ (a lower bound on ):
n⯠E-step: Fix θ, maximize F over Q:
¨⯠âRealignsâ F with likelihood:
ŠCarlos Guestrin 2005-2014
16
What you should know
n⯠K-means for clustering: ¨⯠algorithm ¨⯠converges because itâs coordinate ascent
n⯠EM for mixture of Gaussians: ¨⯠How to âlearnâ maximum likelihood parameters (locally max. like.) in
the case of unlabeled data
n⯠Be happy with this kind of probabilistic analysis n⯠Remember, E.M. can get stuck in local minima, and
empirically it DOES n⯠EM is coordinate ascent n⯠General case for EM
ŠCarlos Guestrin 2005-2014
9
17
Dimensionality Reduction PCA
Machine Learning â CSE4546 Carlos Guestrin University of Washington
November 13, 2014 ŠCarlos Guestrin 2005-2014
18
Dimensionality reduction
n⯠Input data may have thousands or millions of dimensions! ¨âŻe.g., text data has
n⯠Dimensionality reduction: represent data with fewer dimensions ¨âŻeasier learning â fewer parameters ¨âŻvisualization â hard to visualize more than 3D or 4D ¨âŻdiscover âintrinsic dimensionalityâ of data
n⯠high dimensional data that is truly lower dimensional
ŠCarlos Guestrin 2005-2014
10
19
Lower dimensional projections
n⯠Rather than picking a subset of the features, we can new features that are combinations of existing features
n⯠Letâs see this in the unsupervised setting ¨⯠just X, but no Y
ŠCarlos Guestrin 2005-2014
20
Linear projection and reconstruction
x1
x2
project into 1-dimension z1
reconstruction: only know z1,
what was (x1,x2)
ŠCarlos Guestrin 2005-2014
11
21
Principal component analysis â basic idea n⯠Project n-dimensional data into k-dimensional
space while preserving information: ¨âŻe.g., project space of 10000 words into 3-dimensions ¨âŻe.g., project 3-d into 2-d
n⯠Choose projection with minimum reconstruction error
ŠCarlos Guestrin 2005-2014
22
Linear projections, a review
n⯠Project a point into a (lower dimensional) space: ¨âŻpoint: x = (x1,âŚ,xd) ¨âŻselect a basis â set of basis vectors â (u1,âŚ,uk)
n⯠we consider orthonormal basis: ¨⯠uiâ˘ui=1, and uiâ˘uj=0 for iâ j
¨âŻselect a center â x, defines offset of space ¨âŻbest coordinates in lower dimensional space defined
by dot-products: (z1,âŚ,zk), zi = (x-x)â˘ui n⯠minimum squared error
ŠCarlos Guestrin 2005-2014
12
23
PCA finds projection that minimizes reconstruction error n⯠Given N data points: xi = (x1
i,âŚ,xdi), i=1âŚN
n⯠Will represent each point as a projection:
¨⯠where: and
n⯠PCA: ¨⯠Given k<<d, find (u1,âŚ,uk) minimizing reconstruction error:
x1
x2
N
N
N
ŠCarlos Guestrin 2005-2014
24
Understanding the reconstruction error
¨âŻGiven k<<d, find (u1,âŚ,uk) minimizing reconstruction error:
N d
n⯠Note that xi can be represented exactly by d-dimensional projection:
n⯠Rewriting error:
ŠCarlos Guestrin 2005-2014
13
25
Reconstruction error and covariance matrix
N
N
N d
ŠCarlos Guestrin 2005-2014
26
Minimizing reconstruction error and eigen vectors
N d
n⯠Minimizing reconstruction error equivalent to picking orthonormal basis (u1,âŚ,ud) minimizing:
n⯠Eigen vector:
n⯠Minimizing reconstruction error equivalent to picking (uk+1,âŚ,ud) to be eigen vectors with smallest eigen values
ŠCarlos Guestrin 2005-2014
14
27
Basic PCA algoritm
n⯠Start from m by n data matrix X n⯠Recenter: subtract mean from each row of X
¨⯠Xc â X â X n⯠Compute covariance matrix:
¨⯠Σ â 1/N XcT Xc
n⯠Find eigen vectors and values of Σ n⯠Principal components: k eigen vectors with
highest eigen values
ŠCarlos Guestrin 2005-2014
28
PCA example
ŠCarlos Guestrin 2005-2014
15
29
PCA example â reconstruction
only used first principal component
ŠCarlos Guestrin 2005-2014
30
Eigenfaces [Turk, Pentland â91]
n⯠Input images: n⯠Principal components:
ŠCarlos Guestrin 2005-2014
16
31
Eigenfaces reconstruction
n⯠Each image corresponds to adding 8 principal components:
ŠCarlos Guestrin 2005-2014
32
Scaling up
n⯠Covariance matrix can be really big! ¨⯠Σ is d by d ¨âŻSay, only 10000 features ¨⯠finding eigenvectors is very slowâŚ
n⯠Use singular value decomposition (SVD) ¨⯠finds to k eigenvectors ¨âŻgreat implementations available, e.g., python, R,
Matlab svd
ŠCarlos Guestrin 2005-2014
17
33
SVD n⯠Write X = W S VT
¨⯠X â data matrix, one row per datapoint ¨⯠W â weight matrix, one row per datapoint â coordinate of xi in eigenspace ¨⯠S â singular value matrix, diagonal matrix
n⯠in our setting each entry is eigenvalue Îťj ¨⯠VT â singular vector matrix
n⯠in our setting each row is eigenvector vj
ŠCarlos Guestrin 2005-2014
34
PCA using SVD algoritm
n⯠Start from m by n data matrix X n⯠Recenter: subtract mean from each row of X
¨⯠Xc â X â X n⯠Call SVD algorithm on Xc â ask for k singular vectors
n⯠Principal components: k singular vectors with highest singular values (rows of VT) ¨⯠Coefficients become:
ŠCarlos Guestrin 2005-2014