unsupervised learning: clustering rong jin outline unsupervised learning k means for clustering ...

Unsupervised Learning: Clustering

Rong Jin

Outline Unsupervised learning K means for clustering Expectation Maximization algorithm for clustering

Unsupervised vs. Supervised Learning Supervised learning

Training data

Every training example is labeled Unsupervised learning

Training data No data is labeled

We can still discover the structure of the data Semi-supervised learning

Training data

Mixture of labeled and unlabeled data

1 1 2 2{( , ), ( , ),..., ( , )}n nD x y x y x y

1 2{( ), ( ),..., ( )}nD x x x

1 1 2 2 1 2( , ), ( , ),..., ( , ); ( ), ( ),..., ( )n n nD x y x y x y x x x

Can you think of ways to utilize the unlabeled

data for improving predication?

Unsupervised Learning Clustering Visualization Density Estimation Outlier/Novelty Detection Data Compression

Clustering/Density Estimation

$$$

age

Clustering for Visualization

Image Compression

http://www.ece.neu.edu/groups/rpl/kmeans/

K-means for Clustering Key for clustering

Find cluster centers Determine appropriate

clusters (very very hard)

K-means Start with a random

guess of cluster centers Determine the

membership of each data points

Adjust the cluster centers

K-means1. Ask user how many

clusters they’d like. (e.g. k=5)



2. Randomly guess k cluster Center locations




3. Each datapoint finds out which Center it’s closest to. (Thus each Center “owns” a set of datapoints)




3. Each datapoint finds out which Center it’s closest to.

4. Each Center finds the centroid of the points it owns




3. Each datapoint finds out which Center it’s closest to.

4. Each Center finds the centroid of the points it ownsAny Computational Problem?

Computational Complexity: O(N) where N is the number of points?

Improve K-means Group points by region

KD tree SR tree

Key difference Find the closest center for

each rectangle Assign all the points within a

rectangle to one cluster

Improved K-means Find the closest center for

each rectangle Assign all the points within

a rectangle to one cluster

Improved K-means

Gaussian Mixture Model for Clustering Assume that data are

generated from a mixture of Gaussian distributions

For each Gaussian distribution Center: i

Variance: i (ignore)

For each data point Determine membership

( | )ji i jz p x

Learning a Gaussian Mixture(with known covariance)

Probability ( )ip x x

2

/ 2 22

( ) ( , ) ( ) ( | )

1( ) exp

22

j j

j

i i j j i j

i jj d

p x x p x x p p x x

xp

Log-likelihood of unlabeled data

Find optimal parameters

2

/ 2 22

1log ( ) log ( ) exp

22j

i ji j d

i i

xp x x p

( ),j j jp

Learning a Gaussian Mixture(with known covariance)

22

22

1( )

2

1( )

2

1

( )

( )

i j

i n

x

j

k x

kn

e p

e p

1

( | ) ( )[ ]

( | ) ( )

i j jij k

i n jn

p x x pE z

p x x p

E-Step

m

iiijj xzE

m 1

][1M-Step

1

1( ) [ ]

m

j iji

p E zm

Gaussian Mixture Example: Start

After First Iteration

After 2nd Iteration

After 3rd Iteration

After 4th Iteration

After 5th Iteration

After 6th Iteration

After 20th Iteration

unsupervised learning: clustering rong jin outline unsupervised learning k means for clustering ...

Documents

cluster slide

clustering slide

iteration slide

visualization slide

step slide

membership slide

age slide

set of datapoints slide