classification of unlabeled data:

21
Classification of unlabeled data: Mixture models, K-means algorithm, and Expectation Maximization (EM) Based on slides by Reza Shadmehr that are based on Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer 2006.

Upload: akeem-everett

Post on 31-Dec-2015

44 views

Category:

Documents


3 download

DESCRIPTION

Classification of unlabeled data: Mixture models, K-means algorithm, and Expectation Maximization (EM) Based on slides by Reza Shadmehr that are based on Christopher M. Bishop. Pattern Recognition and Machine Learning . Springer 2006. 6. 6. 4. 4. 2. 2. 0. 0. -2. -2. -4. -4. -6. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Classification of unlabeled data:

Classification of unlabeled data:

Mixture models, K-means algorithm, and Expectation Maximization (EM)

Based on slides by

Reza Shadmehr

that are based on

Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer 2006.

Page 2: Classification of unlabeled data:

The problem of finding labels for unlabeled data

1x

2x

In nature, items often do not come with labels. How can we learn labels?

-10 -8 -6 -4 -2

-6

-4

-2

0

2

4

6

-10 -8 -6 -4 -2

-6

-4

-2

0

2

4

6

Unlabeled data Labeled data

Page 3: Classification of unlabeled data:

Example: image segmentation

Identify pixels that are white matter, gray matter, or outside of the brain.

0.2 0.4 0.6 0.8 1

250

500

750

1000

1250

1500

Outside the brain

Gray matterWhite matter

Pixel value (normalized)

Page 4: Classification of unlabeled data:

If our data is not labeled, we can hypothesize that:

1. There are exactly m classes in the data:

2. Each class y occurs with a specific frequency:

3. Examples of class y are governed by a specific distribution:

According to our hypothesis, each example x(i) must have been generated from a specific “mixture” distribution:

We might hypothesize that the distributions are Gaussian:

Mixtures

1,2, ,y m

p y

p y

x

1

m

j

p P y j p y j

x x

1 1

1

1 , , , , , ,

,

m m

m

j jj

P y P y m

p P y j N

x x

Parameters of the distributions

Mixing proportions Normal distribution

Page 5: Classification of unlabeled data:

Suppose there are only three classes. The data generation process will be something like this:

Mixture densities

P y

3y 1y

1 1,N x 3 3,N x

3

1

,j jj

p P y j N

x x

If we had a classifier that could sort our data, we would get labels like this. Here, the idea is that if x is assigned to class i, we set the i-th component of vector z equal to one and keep the remaining components of z at zero.

Each data point is generated by a single Gaussian

( ) ( )

( ) ( ) ( ) ( )1 2

( )( ) 1 if class

0 otherwise

n n

Tn n n nm

nni

z z z

iz

x z

z

x

x y z(1)

(2)

(3)

2 010

3 001

1 1 00

x

x

x

2 2,N x

2y

Multinomial random variable that represents the class x belongs to

Page 6: Classification of unlabeled data:

P z

3 1z 1 1z

p x z

3 3 31, ,p z x

3

1

1 1i ii

p P z p z

x x

2 2 21, ,p z x

2 1z

z

x

Hidden variable

Measured variable

1 1 11, ,p z x

Page 7: Classification of unlabeled data:

If our data were labeled, we could estimate the mean and variance of each Gaussian and compute the mixing proportion.

( )( )

( )

1

( ) ( )

1

( ) ( ) ( )

1

1 if class

0 otherwise

ˆ

ˆ

ˆ

1ˆ ˆ ˆˆ

nni

Ni

j ji

j

Ni i

j jij

N Ti i ij j j j

ij

iz

n z

nP y j

N

zn

zn

x

x

x x

Our imagined classifier

Number of times a particular class was present

Mixture densities

The K-means algorithm begins by making an initial guess about class center:

Step 1: Assign each data point to the class that has the closest center.

Step 2: Re-estimate the center of each class.

ˆ j

2( )( ) ˆ1 if arg min

0 otherwise

nn j ji

i xz

( )

1

( ) ( )

1

ˆ

ˆ

Ni

j ji

Ni i

j jij

n z

zn

x

Mixing proportion

Mean of each class

Variance of each class

Page 8: Classification of unlabeled data:

-10 -8 -6 -4 -2

-6

-4

-2

0

2

4

6

-10 -8 -6 -4 -2

-6

-4

-2

0

2

4

6

-10 -8 -6 -4 -2

-6

-4

-2

0

2

4

6

True labelsK-means

(first 5 iterations - converged)

K-meansclassification

K-means algorithm on an easy problem

Page 9: Classification of unlabeled data:

True labels

K-means(first 10 iterations-converged)

K-meansclassification

K-means algorithm on a harder problem

-12 -10 -8 -6 -4 -2

-4

-2

0

2

-12 -10 -8 -6 -4 -2

-4

-2

0

2

-12 -10 -8 -6 -4 -2-6

-4

-2

0

2

4

6

Page 10: Classification of unlabeled data:

The cost function for the K-means algorithm

2( ) ( ) ( ) ( ) ( )

1 1 1 1

2( )( )

( ) ( )

1

( ) ( )

1

( )

1

ˆ1 if arg min

0 otherwise

0

N m N m Tn n n n nj j j j j

n j n j

nn j ji

Nn nj j

nj

Nn nj

nj N

nj

n

J z z

i xz

dJz

d

z

z

x x x

x

x

Step 1: assuming that class centers are known, to minimize cost we should classify a point based on the closest center.

Step 2: assuming that class memberships are known, the derivative of cost with respect to centers dictates that centers should be in the middle of the class.

Number of points in class j

Page 11: Classification of unlabeled data:

(1) (1) (1) (1)1 2 3

(2) (2) (2) (2)1 2 3

(3) (3) (3) (3)1 2 3

? 1 , 1 , 1 ,

? 1 , 1 , 1 ,

? 1 , 1 , 1 ,

P z P z P z

P z P z P z

P z P z P z

x x x x

x x x x

x x x x

Types of classification data

Data set is complete. K-means: Data set is incomplete, but we complete it using a specific cost function.

x

EM: data set is incomplete, but we complete it using posterior probabilities (a “soft” class membership).

x y

(1)

(2)

(3)

2 010

3 001

1 1 00

x

x

x

z x y

(1)

(2)

(3)

? 010

? 001

? 100

x

x

x

z

zy

Page 12: Classification of unlabeled data:

Instead of using a “hard” class assignment, we could use a “soft” assignment that represents the posterior probability of class membership for each data point:

EM algorithm

1 1

1

1 1 1, , 11 ,

1 1, ,

i i i ii m

j j j jj

p z P z p z P zP z

p P z p z

x xx

x x

This gives us a “soft” class membership. Using this classifier, we can now re-estimate the parameters of our distribution for each class:

( ) ( )

( )

1

( ) ( )

1

( ) ( ) ( )

1

1 ,

ˆ

ˆ1

ˆ

1ˆ ˆ ˆˆ

n nj j

Ni

j ji

jj

Ni i

j jij

N Ti i ij j j j

ij

P z

n

nP z

N

n

n

x

x

x x

“Number of times” a particular class was present (no longer an integer!)

“soft” class membership

Page 13: Classification of unlabeled data:

EM algorithm: iteration 0

We select m random samples from the dataset and assume that these are the mean of each class.

We set the variance of each class to the sampled covariance of the whole data.

We could set the mixing coefficients to be equal.

We assume that there are exactly m classes in the data. We begin with a guess about the initial setting of the parameters:

(0) (0)(0) (0) (0) (0) (0)1 1 11 , , , , 1 , ,m m mP z P z

Page 14: Classification of unlabeled data:

EM algorithm: iteration k+1

( ) ( )( ) ( ) ( ) ( ) ( )

( ) ( )

( ) ( ) ( ) ( ) ( )

1

1, 1 , 11 ,

1 ,

k kn k n k ki i i i in k

i mn k n k kj j j

j

P z P z P P zP z

P P z P

x x x xx

x x x x

The “M” step: estimate new mixture parameters based on the new class assignments:

( ) ( )

( 1) ( )

1

( 1)( 1)

( 1) ( ) ( )

1

( 1) ( ) ( ) ( 1) ( ) ( 1)

1

1 ,

ˆ

ˆ1

ˆ

1ˆ ˆ ˆˆ

n nj j

Nk nj j

i

kk j

j

Nk n ij j

ij

N Tk n i k i kj j j j

ij

P z

n

nP z

N

n

n

x

x

x x

We have a guess about the mixture parameters:

The “E” step: Complete the incomplete data with the posterior probabilities.

For each data point, we assign a class membership by computing the posterior probability for all m classes:

( ) ( )( ) ( ) ( ) ( ) ( )1 1 11 , , , , 1 , ,

k kk k k k km m mP z P z

( )nx

Page 15: Classification of unlabeled data:

Example: classification with EM (easy problem)

Initial guess 5th iteration

-10 -8 -6 -4 -2 0

-6

-4

-2

0

2

4

6

-10 -8 -6 -4 -2 0

-6

-4

-2

0

2

4

6

-10 -8 -6 -4 -2 0

-6

-4

-2

0

2

4

6

-10 -8 -6 -4 -2 0

-6

-4

-2

0

2

4

6

-10 -8 -6 -4 -2 0

-6

-4

-2

0

2

4

6

15th iteration10th iteration

To plot each distribution, I draw an ellipse centered at the mean with an area that covers the expected location of the 1st, 2nd, and 3rd quartile of the data.

-10 -8 -6 -4 -2

-6

-4

-2

0

2

4

6

True labeled data Posterior probabilities

Colors indicate probability of belonging to one class or another.

Page 16: Classification of unlabeled data:

(1) ( ) (1) (2) ( )

( )

1

( ) ( )1 1 1 2 2 2

1

log , , log

log

log 1 , 1 ,

N N

Ni

i

Ni i

i

p p p p

p

P z N P z N

x x x x x

x

x x

0 10 20 30 40-930

-920

-910

-900

-890

-880

-870

(1) ( ) ( )log , , N kp x x

k

EM algorithm: log-likelihood increases with each step

Page 17: Classification of unlabeled data:

-12 -10 -8 -6 -4 -2 0

-4

-2

0

2

Example: classification with EM (hard problem)

Initial guess 30th iteration 60th iteration

Posterior probabilities

-12 -10 -8 -6 -4 -2 0

-4

-2

0

2

-12 -10 -8 -6 -4 -2 0

-4

-2

0

2

-12 -10 -8 -6 -4 -2 0

-4

-2

0

2

-12 -10 -8 -6 -4 -2 0

-4

-2

0

2True labeled data

0 20 40 60 80

-858

-856

-854

-852

-850

(1) ( ) ( )log , , N kp x x

iteration

Page 18: Classification of unlabeled data:

Some observations about EM:

EM takes many more iterations to reach convergence than K-means.

Each cycles requires significantly more computation.

It is common to run K-means first to find a reasonable initialization for EM.

Covariance matrices can be set to cluster covariance found for K-means.

Mixing proportions can be set to the fraction of data assigned to each cluster in K-means.

Page 19: Classification of unlabeled data:

( ) ( )

1

(1) ( ) ( )

1

( )

11

(1) ( ) ( )

1 1

( )

( )1

1

( ) ( ) 1 ( )

,

, ,

,

, , log ,

1,

,

, ,

mn n

j j jj

NN i

i

N mn

j j jjn

N mN n

j j jn j

Nn

i i imnni i

j j jj

n n ni i i i i i

i

i

i

p N

p p

N

l N

dl dN

d dN

dN N

d

dl

d

x x

x x x

x

x x x

xx

x x x

( )

1 ( )

( )1

1

( ) 1 ( ) ( ) 1 ( )

1 1

,

,

1 ,

nN

i i ni im

nnj j j

j

N Nn n n n

i i i j i in n

N

N

P z

xx

x

x x x

1j jP z

On the “M” step, EM maximizes the log-likelihood of the data

Under the assumption that pi is constant

Page 20: Classification of unlabeled data:

( ) ( )

( ) 1 ( ) ( ) 1

1 1

( ) ( )

1

( )

1

( ) ( ) ( )

1

( )

1

( )

1

1 ,

0

11

n nj i

N Nn n nj i i j i

n ni

Nn nj

ni N

nj

n

N Tn n nj i i

ni N

nj

n

Nn

j i jn

P z x

dl

d

P zN

x

x

x x

EM and maximizing log-likelihood of the data

Note that these equation for the parameters are not solutions because the posterior probabilities are themselves function of parameters.

What we have done with the algorithm is to try and solve for the parameters iteratively: we made a guess about the parameters, computed the posteriors, and then found the parameters that maximized the likelihood of the data. We iterated until we found convergence.

This is the same as the rule that we had come up with “soft” classification.

The rules are the same for estimates of class variance and mixture probabilities.

Page 21: Classification of unlabeled data:

EM algorithm: iteration k+1

( )( ) ( )

( )

( ) ( ) ( )

1

, 11 11 ,

1 ,

kk kj j jj jk

j mk k k

i i ii

p P zp z P zP z

p P z p

xxx

x x

The “M” step: under the assumption that the posterior probabilities are constant, estimate new parameter values to maximize log-likelihood:

We have a guess about the mixture parameters:

The “E” step: Compute the posterior probabilities for all data points:

( ) ( )( ) ( ) ( ) ( ) ( )1 1 11 , , , , 1 , ,

k kk k k k km m mP z P z

( 1) (1) ( )arg max log , ,k Np x x