l22 kmeans gmm - virginia tech

42
ECE 5424: Introduction to Machine Learning Stefan Lee Virginia Tech Topics: – Unsupervised Learning: Kmeans, GMM, EM Readings: Barber 20.120.3

Upload: others

Post on 24-Nov-2021

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: L22 kmeans gmm - Virginia Tech

ECE  5424:  Introduction  to  Machine  Learning

Stefan  LeeVirginia  Tech

Topics:  – Unsupervised  Learning:  Kmeans,  GMM,  EM

Readings:  Barber  20.1-­20.3

Page 2: L22 kmeans gmm - Virginia Tech

Tasks

(C)  Dhruv  Batra   2

Classificationx y

Regressionx y

Discrete

Continuous

Clusteringx c Discrete  ID

DimensionalityReductionx z Continuous

Supervised  Learning

Unsupervised  Learning

Page 3: L22 kmeans gmm - Virginia Tech

Unsupervised  Learning• Learning  only  with  X

– Y  not  present  in  training  data

• Some  example  unsupervised   learning  problems:– Clustering  /  Factor  Analysis– Dimensionality  Reduction  /  Embeddings– Density  Estimation  with  Mixture  Models

(C)  Dhruv  Batra   3

Page 4: L22 kmeans gmm - Virginia Tech

New  Topic:  Clustering

Slide  Credit:  Carlos  Guestrin 4

Page 5: L22 kmeans gmm - Virginia Tech

Synonyms• Clustering

• Vector  Quantization  

• Latent  Variable  Models• Hidden  Variable  Models• Mixture  Models

• Algorithms:– K-­means– Expectation  Maximization  (EM)

(C)  Dhruv  Batra   5

Page 6: L22 kmeans gmm - Virginia Tech

Some  Data

6(C)  Dhruv  Batra   Slide  Credit:  Carlos  Guestrin

Page 7: L22 kmeans gmm - Virginia Tech

K-­means

1. Ask  user  how  many  clusters  they’d  like.  

(e.g.  k=5)  

7(C)  Dhruv  Batra   Slide  Credit:  Carlos  Guestrin

Page 8: L22 kmeans gmm - Virginia Tech

K-­means

1. Ask  user  how  many  clusters  they’d  like.  

(e.g.  k=5)  

2. Randomly  guess  k  cluster  Center  locations

8(C)  Dhruv  Batra   Slide  Credit:  Carlos  Guestrin

Page 9: L22 kmeans gmm - Virginia Tech

K-­means

1. Ask  user  how  many  clusters  they’d  like.  

(e.g.  k=5)  

2. Randomly  guess  k  cluster  Center  locations

3. Each  datapoint  finds  out  which  Center  it’s  closest  to.  (Thus  each  Center  “owns”  a  set  of  datapoints)

9(C)  Dhruv  Batra   Slide  Credit:  Carlos  Guestrin

Page 10: L22 kmeans gmm - Virginia Tech

K-­means

1. Ask  user  how  many  clusters  they’d  like.  

(e.g.  k=5)  

2. Randomly  guess  kcluster  Center  locations

3. Each  datapoint finds  out  which  Center  it’s  

closest  to.

4. Each  Center  finds  the  centroid of  the  points  it  owns

10(C)  Dhruv  Batra   Slide  Credit:  Carlos  Guestrin

Page 11: L22 kmeans gmm - Virginia Tech

K-­means

1. Ask  user  how  many  clusters  they’d  like.  

(e.g.  k=5)  

2. Randomly  guess  kcluster  Center  locations

3. Each  datapoint finds  out  which  Center  it’s  

closest  to.

4. Each  Center  finds  the  centroid  of  the  points  it  owns

5. …Repeat  until  terminated!

11(C)  Dhruv  Batra   Slide  Credit:  Carlos  Guestrin

Page 12: L22 kmeans gmm - Virginia Tech

K-­means• Randomly   initialize  k centers

– �(0) =  �1(0),…,  �k(0)

• Assign:  – Assign  each  point  i�{1,…n}  to  nearest  center:–

• Recenter:  – 𝜇# becomes  centroid  of  its  points

12(C)  Dhruv  Batra   Slide  Credit:  Carlos  Guestrin

C(i) ⇥� argminj

||xi � µj ||2

Page 13: L22 kmeans gmm - Virginia Tech

K-­means• Demo

– http://mlehman.github.io/kmeans-­javascript/

(C)  Dhruv  Batra   13

Page 14: L22 kmeans gmm - Virginia Tech

What  is  K-­means  optimizing?  • Objective  F(�,C):  function  of  centers  �and  point  allocations  C:

– 1-­of-­k  encoding

• Optimal  K-­means:– min�mina F(�,a)  

14(C)  Dhruv  Batra  

F (µ, C) =NX

i=1

||xi � µC(i)||2

F (µ,a) =NX

i=1

kX

j=1

aij ||xi � µj ||2

Page 15: L22 kmeans gmm - Virginia Tech

Coordinate  descent  algorithms

15(C)  Dhruv  Batra   Slide  Credit:  Carlos  Guestrin

• Want:  mina minb F(a,b)

• Coordinate  descent:– fix  a,  minimize  b– fix  b,  minimize  a– repeat

• Converges!!!– if  F  is  bounded– to  a  (often  good)  local  optimum  

• as  we  saw  in  applet  (play  with  it!)

• K-­means  is  a  coordinate  descent  algorithm!

Page 16: L22 kmeans gmm - Virginia Tech

• Optimize  objective  function:

• Fix  �,  optimize  a (or  C)

16(C)  Dhruv  Batra   Slide  Credit:  Carlos  Guestrin

K-­means  as  Co-­ordinate  Descent

minµ1,...,µk

mina1,...,aN

F (µ,a) = minµ1,...,µk

mina1,...,aN

NX

i=1

kX

j=1

aij ||xi � µj ||2

Page 17: L22 kmeans gmm - Virginia Tech

• Optimize  objective  function:

• Fix  a (or  C),  optimize  �

17(C)  Dhruv  Batra   Slide  Credit:  Carlos  Guestrin

K-­means  as  Co-­ordinate  Descent

minµ1,...,µk

mina1,...,aN

F (µ,a) = minµ1,...,µk

mina1,...,aN

NX

i=1

kX

j=1

aij ||xi � µj ||2

Page 18: L22 kmeans gmm - Virginia Tech

One  important  use  of  K-­means• Bag-­of-­word  models  in  computer  vision

(C)  Dhruv  Batra   18

Page 19: L22 kmeans gmm - Virginia Tech

Bag  of  Words  model

aardvark 0

about 2

all 2

Africa 1

apple 0

anxious 0

...

gas 1

...

oil 1

Zaire 0

Slide  Credit:  Carlos  Guestrin(C)  Dhruv  Batra   19

Page 20: L22 kmeans gmm - Virginia Tech

Object Bag  of  ‘words’

Fei-­‐Fei Li

Page 21: L22 kmeans gmm - Virginia Tech

Fei-­‐Fei Li

Page 22: L22 kmeans gmm - Virginia Tech

Interest  Point  Features

Normalize  patch

Detect  patches[Mikojaczyk  and  Schmid  ’02]

[Matas  et  al.  ’02]  

[Sivic  et  al.  ’03]

Compute  SIFT  

descriptor[Lowe’99]

Slide  credit:  Josef  Sivic

Page 23: L22 kmeans gmm - Virginia Tech

Patch  Features

Slide  credit:  Josef  Sivic

Page 24: L22 kmeans gmm - Virginia Tech

dictionary  formation

Slide  credit:  Josef  Sivic

Page 25: L22 kmeans gmm - Virginia Tech

Clustering  (usually  k-­means)

Vector  quantization

Slide  credit:  Josef  Sivic

Page 26: L22 kmeans gmm - Virginia Tech

Clustered  Image  Patches

Fei-­Fei et  al.  2005

Page 27: L22 kmeans gmm - Virginia Tech

Image  representation

…..

frequency

codewords

Fei-­‐Fei Li

Page 28: L22 kmeans gmm - Virginia Tech

(One)  bad  case  for  k-­means

• Clusters  may  overlap• Some  clusters  may  be  “wider”  than  others

• GMM  to  the  rescue!

Slide  Credit:  Carlos  Guestrin(C)  Dhruv  Batra   28

Page 29: L22 kmeans gmm - Virginia Tech

GMM

(C)  Dhruv  Batra   29Figure  Credit:  Kevin  Murphy

−25 −20 −15 −10 −5 0 5 10 15 20 250

5

10

15

20

25

30

35

Page 30: L22 kmeans gmm - Virginia Tech

Recall  Multi-­variate Gaussians

(C)  Dhruv  Batra   30

Page 31: L22 kmeans gmm - Virginia Tech

GMM

(C)  Dhruv  Batra   31

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Figure  Credit:  Kevin  Murphy

Page 32: L22 kmeans gmm - Virginia Tech

Hidden  Data  Causes  Problems  #1• Fully  Observed  (Log)  Likelihood   factorizes

• Marginal  (Log)  Likelihood  doesn’t  factorize

• All  parameters  coupled!  

(C)  Dhruv  Batra   32

Page 33: L22 kmeans gmm - Virginia Tech

GMM  vs Gaussian  Joint  Bayes  Classifier• On  Board

– Observed  Y  vs Unobserved  Z– Likelihood  vs Marginal  Likelihood

(C)  Dhruv  Batra   33

Page 34: L22 kmeans gmm - Virginia Tech

Hidden  Data  Causes  Problems  #2

(C)  Dhruv  Batra   34

−25 −20 −15 −10 −5 0 5 10 15 20 250

5

10

15

20

25

30

35

Figure  Credit:  Kevin  Murphy

Page 35: L22 kmeans gmm - Virginia Tech

Hidden  Data  Causes  Problems  #2• Identifiability

(C)  Dhruv  Batra   35

−25 −20 −15 −10 −5 0 5 10 15 20 250

5

10

15

20

25

30

35

µ1

µ2

−15.5 −10.5 −5.5 −0.5 4.5 9.5 14.5 19.5

−15.5

−10.5

−5.5

−0.5

4.5

9.5

14.5

19.5

Figure  Credit:  Kevin  Murphy

Page 36: L22 kmeans gmm - Virginia Tech

Hidden  Data  Causes  Problems  #3• Likelihood  has  singularities   if  one  Gaussian  “collapses”

(C)  Dhruv  Batra   36x

p(x

)

Page 37: L22 kmeans gmm - Virginia Tech

Special  case:  spherical  Gaussians  and  hard  assignments

Slide  Credit:  Carlos  Guestrin(C)  Dhruv  Batra   37

• If  P(X|Z=k)  is  spherical,  with  same  �for  all  classes:

• If  each  xi belongs  to  one  class  C(i)  (hard  assignment),  marginal   likelihood:

• M(M)LE  same  as  K-­means!!!

P(xi, y = j)j=1

k

∑i=1

N

∏ ∝ exp − 12σ 2 xi −µC(i)

2%

&'(

)*i=1

N

P(xi | z = j)∝ exp −12σ 2 xi −µ j

2#

$%&

'(

Page 38: L22 kmeans gmm - Virginia Tech

The  K-­means  GMM  assumption

• There  are  k  components

• Component  i has  an  associated  mean  vector  µι

µ1

µ2

µ3

Slide  Credit:  Carlos  Guestrin(C)  Dhruv  Batra   38

Page 39: L22 kmeans gmm - Virginia Tech

The  K-­means  GMM  assumption

• There  are  k  components

• Component  i has  an  associated  mean  vector  µι

• Each  component  generates  data  from  a  Gaussian  with  mean  mi  and  

covariance  matrix σ2Ι

Each  data  point  is  generated  according  to  the  following  recipe:  

µ1

µ2

µ3

Slide  Credit:  Carlos  Guestrin(C)  Dhruv  Batra   39

Page 40: L22 kmeans gmm - Virginia Tech

The  K-­means  GMM  assumption

• There  are  k  components

• Component  i has  an  associated  mean  vector  µι

• Each  component  generates  data  from  a  Gaussian  with  

mean  mi  and  covariance  matrix  σ2Ι

Each  data  point  is  generated  according  to  the  following  

recipe:  

1. Pick  a  component  at  random:  Choose  component  i with  

probability  P(y=i)

µ2

Slide  Credit:  Carlos  Guestrin(C)  Dhruv  Batra   40

Page 41: L22 kmeans gmm - Virginia Tech

The  K-­means  GMM  assumption• There  are  k  components

• Component  i has  an  associated  mean  vector  µι

• Each  component  generates  data  from  a  Gaussian  with  

mean  mi  and  covariance  matrix  σ2Ι

Each  data  point  is  generated  according  to  the  following  

recipe:  

1. Pick  a  component  at  random:  Choose  component  i with  

probability  P(y=i)

2. Datapoint ∼ Ν(µι, σ2Ι )

µ2

x

Slide  Credit:  Carlos  Guestrin(C)  Dhruv  Batra   41

Page 42: L22 kmeans gmm - Virginia Tech

The  General GMM  assumption

µ1

µ2

µ3

• There  are  k  components

• Component  i has  an  associated  mean  vector  mi

• Each  component  generates  data  from  a  Gaussian  with  

mean  mi  and  covariance  matrix  Σi

Each  data  point  is  generated  according  to  the  following  

recipe:  

1. Pick  a  component  at  random:  Choose  component  i with  

probability  P(y=i)

2. Datapoint ~  N(mi,  Σi )Slide  Credit:  Carlos  Guestrin(C)  Dhruv  Batra   42