ece 5984: introduction to machine learnings15ece5984/slides/l24_em_pca.pptx.pdf · • expectation...

33
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: (Finish) Expectation Maximization Principal Component Analysis (PCA) Readings: Barber 15.1-15.4

Upload: others

Post on 11-Jun-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L24_em_pca.pptx.pdf · • Expectation Maximization [Dempster ‘77] • Often looks like “soft” K-means • Extremely

ECE 5984: Introduction to Machine Learning

Dhruv Batra Virginia Tech

Topics: –  (Finish) Expectation Maximization –  Principal Component Analysis (PCA)

Readings: Barber 15.1-15.4

Page 2: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L24_em_pca.pptx.pdf · • Expectation Maximization [Dempster ‘77] • Often looks like “soft” K-means • Extremely

Administrativia •  Poster Presentation:

–  May 8 1:30-4:00pm –  310 Kelly Hall: ICTAS Building –  Print poster (or bunch of slides) –  Format:

•  Portrait •  Make 1 dimen = 36in •  Board size = 30x40

–  Less text, more pictures.

(C) Dhruv Batra 2

Best Project Prize!

Page 3: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L24_em_pca.pptx.pdf · • Expectation Maximization [Dempster ‘77] • Often looks like “soft” K-means • Extremely

Administrativia •  Final Exam

–  When: May 11 7:45-9:45am –  Where: In class

–  Format: Pen-and-paper. –  Open-book, open-notes, closed-internet.

•  No sharing.

–  What to expect: mix of •  Multiple Choice or True/False questions •  “Prove this statement” •  “What would happen for this dataset?”

–  Material •  Everything! •  Focus on the recent stuff. •  Exponentially decaying weights? Optimal policy?

(C) Dhruv Batra 3

Page 4: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L24_em_pca.pptx.pdf · • Expectation Maximization [Dempster ‘77] • Often looks like “soft” K-means • Extremely

Recap of Last Time

(C) Dhruv Batra 4

Page 5: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L24_em_pca.pptx.pdf · • Expectation Maximization [Dempster ‘77] • Often looks like “soft” K-means • Extremely

K-means

1.  Ask user how many clusters they’d like.

(e.g. k=5)

2.  Randomly guess k cluster Center

locations

3.  Each datapoint finds out which Center it’s

closest to.

4.  Each Center finds the centroid of the

points it owns…

5.  …and jumps there

6.  …Repeat until terminated! 5 (C) Dhruv Batra Slide Credit: Carlos Guestrin

Page 6: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L24_em_pca.pptx.pdf · • Expectation Maximization [Dempster ‘77] • Often looks like “soft” K-means • Extremely

•  Optimize objective function:

•  Fix µ, optimize a (or C) –  Assignment Step

•  Fix a (or C), optimize µ –  Recenter Step

6 (C) Dhruv Batra Slide Credit: Carlos Guestrin

K-means as Co-ordinate Descent

minµ1,...,µk

mina1,...,aN

F (µ,a) = minµ1,...,µk

mina1,...,aN

NX

i=1

kX

j=1

aij ||xi � µj ||2

Page 7: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L24_em_pca.pptx.pdf · • Expectation Maximization [Dempster ‘77] • Often looks like “soft” K-means • Extremely

GMM

(C) Dhruv Batra 7

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Figure Credit: Kevin Murphy

Page 8: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L24_em_pca.pptx.pdf · • Expectation Maximization [Dempster ‘77] • Often looks like “soft” K-means • Extremely

K-means vs GMM •  K-Means

–  http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

•  GMM –  http://www.socr.ucla.edu/applets.dir/mixtureem.html

(C) Dhruv Batra 8

Page 9: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L24_em_pca.pptx.pdf · • Expectation Maximization [Dempster ‘77] • Often looks like “soft” K-means • Extremely

EM •  Expectation Maximization [Dempster ‘77]

•  Often looks like “soft” K-means

•  Extremely general •  Extremely useful algorithm

–  Essentially THE goto algorithm for unsupervised learning

•  Plan –  EM for learning GMM parameters –  EM for general unsupervised learning problems

(C) Dhruv Batra 9

Page 10: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L24_em_pca.pptx.pdf · • Expectation Maximization [Dempster ‘77] • Often looks like “soft” K-means • Extremely

EM for Learning GMMs •  Simple Update Rules

–  E-Step: estimate P(zi = j | xi) –  M-Step: maximize full likelihood weighted by posterior

(C) Dhruv Batra 10

Page 11: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L24_em_pca.pptx.pdf · • Expectation Maximization [Dempster ‘77] • Often looks like “soft” K-means • Extremely

Gaussian Mixture Example: Start

11 (C) Dhruv Batra Slide Credit: Carlos Guestrin

Page 12: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L24_em_pca.pptx.pdf · • Expectation Maximization [Dempster ‘77] • Often looks like “soft” K-means • Extremely

After 1st iteration

12 (C) Dhruv Batra Slide Credit: Carlos Guestrin

Page 13: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L24_em_pca.pptx.pdf · • Expectation Maximization [Dempster ‘77] • Often looks like “soft” K-means • Extremely

After 2nd iteration

13 (C) Dhruv Batra Slide Credit: Carlos Guestrin

Page 14: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L24_em_pca.pptx.pdf · • Expectation Maximization [Dempster ‘77] • Often looks like “soft” K-means • Extremely

After 3rd iteration

14 (C) Dhruv Batra Slide Credit: Carlos Guestrin

Page 15: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L24_em_pca.pptx.pdf · • Expectation Maximization [Dempster ‘77] • Often looks like “soft” K-means • Extremely

After 4th iteration

15 (C) Dhruv Batra Slide Credit: Carlos Guestrin

Page 16: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L24_em_pca.pptx.pdf · • Expectation Maximization [Dempster ‘77] • Often looks like “soft” K-means • Extremely

After 5th iteration

16 (C) Dhruv Batra Slide Credit: Carlos Guestrin

Page 17: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L24_em_pca.pptx.pdf · • Expectation Maximization [Dempster ‘77] • Often looks like “soft” K-means • Extremely

After 6th iteration

17 (C) Dhruv Batra Slide Credit: Carlos Guestrin

Page 18: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L24_em_pca.pptx.pdf · • Expectation Maximization [Dempster ‘77] • Often looks like “soft” K-means • Extremely

After 20th iteration

18 (C) Dhruv Batra Slide Credit: Carlos Guestrin

Page 19: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L24_em_pca.pptx.pdf · • Expectation Maximization [Dempster ‘77] • Often looks like “soft” K-means • Extremely

General Mixture Models

(C) Dhruv Batra 19

P(x | z) P(z) Name Gaussian Categorial GMM Multinomial Categorical Mixture of Multinomials Categorical Dirichlet Latent Dirichlet Allocation

Page 20: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L24_em_pca.pptx.pdf · • Expectation Maximization [Dempster ‘77] • Often looks like “soft” K-means • Extremely

Mixture of Bernoullis

(C) Dhruv Batra 20

0.12 0.14 0.12 0.06 0.13

0.07 0.05 0.15 0.07 0.09

Page 21: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L24_em_pca.pptx.pdf · • Expectation Maximization [Dempster ‘77] • Often looks like “soft” K-means • Extremely

•  Marginal likelihood – x is observed, z is missing:

The general learning problem with missing data

21 (C) Dhruv Batra

ll(� : D) = log

NY

i=1

P (xi | �)

=

NX

i=1

logP (xi | �)

=

NX

i=1

log

X

z

P (xi, z | �)

Page 22: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L24_em_pca.pptx.pdf · • Expectation Maximization [Dempster ‘77] • Often looks like “soft” K-means • Extremely

Applying Jensen’s inequality

•  Use: log ∑z P(z) f(z) ≥ ∑z P(z) log f(z)

22 (C) Dhruv Batra

ll(� : D) =

NX

i=1

log

X

z

Qi(z)P (xi, z | �)

Qi(z)

Page 23: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L24_em_pca.pptx.pdf · • Expectation Maximization [Dempster ‘77] • Often looks like “soft” K-means • Extremely

Convergence of EM

•  Define potential function F(θ,Q):

•  EM corresponds to coordinate ascent on F –  Thus, maximizes lower bound on marginal log likelihood

23 (C) Dhruv Batra

ll(� : D) � F (�, Qi) =

NX

i=1

X

z

Qi(z) logP (xi, z | �)

Qi(z)

Page 24: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L24_em_pca.pptx.pdf · • Expectation Maximization [Dempster ‘77] • Often looks like “soft” K-means • Extremely

EM is coordinate ascent

•  E-step: Fix θ(t), maximize F over Q:

–  “Realigns” F with likelihood:

24 (C) Dhruv Batra

ll(� : D) � F (�, Qi) =

NX

i=1

X

z

Qi(z) logP (xi, z | �)

Qi(z)

Q(t)i (z) = P (z | xi, �

(t))

F (�(t), Q(t)) = ll(�(t) : D)

F (�, Qi) =

NX

i=1

X

z

Qi(z) logP (xi, z | �(t))

Qi(z)

=

NX

i=1

X

z

Qi(z) logP (z | xi, �)P (xi | �(t))

Qi(z)

=

NX

i=1

X

z

Qi(z) logP (xi | �(t)) +NX

i=1

X

z

Qi(z) logP (z | xi, �(t))

Qi(z)

= ll(�(t) : D)�NX

i=1

KL(Qi(z)||P (z | xi, �(t))

Page 25: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L24_em_pca.pptx.pdf · • Expectation Maximization [Dempster ‘77] • Often looks like “soft” K-means • Extremely

EM is coordinate ascent

•  M-step: Fix Q(t), maximize F over θ

•  Corresponds to weighted dataset: –  <x1,z=1> with weight Q(t+1)(z=1|x1) –  <x1,z=2> with weight Q(t+1)(z=2|x1) –  <x1,z=3> with weight Q(t+1)(z=3|x1) –  <x2,z=1> with weight Q(t+1)(z=1|x2) –  <x2,z=2> with weight Q(t+1)(z=2|x2) –  <x2,z=3> with weight Q(t+1)(z=3|x2) –  … 25 (C) Dhruv Batra

ll(� : D) � F (�, Qi) =

NX

i=1

X

z

Qi(z) logP (xi, z | �)

Qi(z)

F (�, Qi) =

NX

i=1

X

z

Q(t)i (z) log

P (xi, z | �)Q(t)

i (z)

=

NX

i=1

X

z

Q(t)i (z) logP (xi, z | �)�

NX

i=1

X

z

Q(t)i (z) logQ(t)

i (z)

=

NX

i=1

X

z

Q(t)i (z) logP (xi, z | �) +

NX

i=1

H(Q(t)i )

| {z }constant

Page 26: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L24_em_pca.pptx.pdf · • Expectation Maximization [Dempster ‘77] • Often looks like “soft” K-means • Extremely

EM Intuition

(C) Dhruv Batra 26

Q(θ,θt)

Q(θ,θt+1

)

l(θ)

θt

θt+1

θt+2

Page 27: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L24_em_pca.pptx.pdf · • Expectation Maximization [Dempster ‘77] • Often looks like “soft” K-means • Extremely

What you should know •  K-means for clustering:

–  algorithm –  converges because it’s coordinate ascent

•  EM for mixture of Gaussians: –  How to “learn” maximum likelihood parameters (locally max.

like.) in the case of unlabeled data

•  EM is coordinate ascent •  Remember, E.M. can get stuck in local minima, and

empirically it DOES

•  General case for EM

27 (C) Dhruv Batra Slide Credit: Carlos Guestrin

Page 28: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L24_em_pca.pptx.pdf · • Expectation Maximization [Dempster ‘77] • Often looks like “soft” K-means • Extremely

Tasks

(C) Dhruv Batra 28

Classification x y

Regression x y

Discrete

Continuous

Clustering x c Discrete ID

Dimensionality Reduction

x z Continuous

Page 29: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L24_em_pca.pptx.pdf · • Expectation Maximization [Dempster ‘77] • Often looks like “soft” K-means • Extremely

New Topic: PCA

Page 30: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L24_em_pca.pptx.pdf · • Expectation Maximization [Dempster ‘77] • Often looks like “soft” K-means • Extremely

Synonyms •  Principal Component Analysis •  Karhunen–Loève transform

•  Eigen-Faces •  Eigen-<Insert-your-problem-domain>

•  PCA is a Dimensionality Reduction Algorithm

•  Other Dimensionality Reduction algorithms –  Linear Discriminant Analysis (LDA) –  Independent Component Analysis (ICA) –  Local Linear Embedding (LLE) –  …

(C) Dhruv Batra 30

Page 31: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L24_em_pca.pptx.pdf · • Expectation Maximization [Dempster ‘77] • Often looks like “soft” K-means • Extremely

Dimensionality reduction •  Input data may have thousands or millions of

dimensions! –  e.g., images have 5M pixels

•  Dimensionality reduction: represent data with fewer dimensions –  easier learning – fewer parameters –  visualization – hard to visualize more than 3D or 4D –  discover “intrinsic dimensionality” of data

•  high dimensional data that is truly lower dimensional

Page 32: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L24_em_pca.pptx.pdf · • Expectation Maximization [Dempster ‘77] • Often looks like “soft” K-means • Extremely

Dimensionality Reduction •  Demo

–  http://lcn.epfl.ch/tutorial/english/pca/html/

–  http://setosa.io/ev/principal-component-analysis/

(C) Dhruv Batra 32

Page 33: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L24_em_pca.pptx.pdf · • Expectation Maximization [Dempster ‘77] • Often looks like “soft” K-means • Extremely

PCA / KL-Transform •  De-correlation view

–  Make features uncorrelated –  No projection yet

•  Max-variance view: –  Project data to lower dimensions –  Maximize variance in lower dimensions

•  Synthesis / Min-error view: –  Project data to lower dimensions –  Minimize reconstruction error

•  All views lead to same solution (C) Dhruv Batra 33