unsupervised learning: clustering rong jin outline unsupervised learning k means for clustering ...
Post on 19-Dec-2015
236 views
TRANSCRIPT
Unsupervised Learning: Clustering
Rong Jin
Outline Unsupervised learning K means for clustering Expectation Maximization algorithm for clustering
Unsupervised vs. Supervised Learning Supervised learning
Training data
Every training example is labeled Unsupervised learning
Training data No data is labeled
We can still discover the structure of the data Semi-supervised learning
Training data
Mixture of labeled and unlabeled data
1 1 2 2{( , ), ( , ),..., ( , )}n nD x y x y x y
1 2{( ), ( ),..., ( )}nD x x x
1 1 2 2 1 2( , ), ( , ),..., ( , ); ( ), ( ),..., ( )n n nD x y x y x y x x x
Can you think of ways to utilize the unlabeled
data for improving predication?
Unsupervised Learning Clustering Visualization Density Estimation Outlier/Novelty Detection Data Compression
Clustering/Density Estimation
$$$
age
Clustering for Visualization
Image Compression
http://www.ece.neu.edu/groups/rpl/kmeans/
K-means for Clustering Key for clustering
Find cluster centers Determine appropriate
clusters (very very hard)
K-means Start with a random
guess of cluster centers Determine the
membership of each data points
Adjust the cluster centers
K-means1. Ask user how many
clusters they’d like. (e.g. k=5)
K-means1. Ask user how many
clusters they’d like. (e.g. k=5)
2. Randomly guess k cluster Center locations
K-means1. Ask user how many
clusters they’d like. (e.g. k=5)
2. Randomly guess k cluster Center locations
3. Each datapoint finds out which Center it’s closest to. (Thus each Center “owns” a set of datapoints)
K-means1. Ask user how many
clusters they’d like. (e.g. k=5)
2. Randomly guess k cluster Center locations
3. Each datapoint finds out which Center it’s closest to.
4. Each Center finds the centroid of the points it owns
K-means1. Ask user how many
clusters they’d like. (e.g. k=5)
2. Randomly guess k cluster Center locations
3. Each datapoint finds out which Center it’s closest to.
4. Each Center finds the centroid of the points it ownsAny Computational Problem?
Computational Complexity: O(N) where N is the number of points?
Improve K-means Group points by region
KD tree SR tree
Key difference Find the closest center for
each rectangle Assign all the points within a
rectangle to one cluster
Improved K-means Find the closest center for
each rectangle Assign all the points within
a rectangle to one cluster
Improved K-means
Improved K-means
Improved K-means
Improved K-means
Improved K-means
Improved K-means
Improved K-means
Improved K-means
Improved K-means
Gaussian Mixture Model for Clustering Assume that data are
generated from a mixture of Gaussian distributions
For each Gaussian distribution Center: i
Variance: i (ignore)
For each data point Determine membership
( | )ji i jz p x
Learning a Gaussian Mixture(with known covariance)
Probability ( )ip x x
2
/ 2 22
( ) ( , ) ( ) ( | )
1( ) exp
22
j j
j
i i j j i j
i jj d
p x x p x x p p x x
xp
Log-likelihood of unlabeled data
Find optimal parameters
2
/ 2 22
1log ( ) log ( ) exp
22j
i ji j d
i i
xp x x p
( ),j j jp
Learning a Gaussian Mixture(with known covariance)
22
22
1( )
2
1( )
2
1
( )
( )
i j
i n
x
j
k x
kn
e p
e p
1
( | ) ( )[ ]
( | ) ( )
i j jij k
i n jn
p x x pE z
p x x p
E-Step
m
iiijj xzE
m 1
][1M-Step
1
1( ) [ ]
m
j iji
p E zm
Gaussian Mixture Example: Start
After First Iteration
After 2nd Iteration
After 3rd Iteration
After 4th Iteration
After 5th Iteration
After 6th Iteration
After 20th Iteration