clustering: k-meansbbd/kmeans_cs601_cmu_dalvi.pdf1 clustering: k-means machine learning 10-601, fall...
TRANSCRIPT
1
Clustering: K-Means
Machine Learning 10-601, Fall 2014
Bhavana Dalvi Mishra
PhD student LTI, CMU
Slides are based on materials from Prof. Eric Xing, Prof. William Cohen and Prof. Andrew Ng
Outline
2
What is clustering?
How are similarity measures defined?
Different clustering algorithms
K-Means
Gaussian Mixture Models
Expectation Maximization
Advanced topics
How to seed clustering?
How to choose #clusters
Application: Gloss finding for a Knowledge Base
Classification vs. Clustering
Supervision available Unsupervised
4
Unsupervised learning: learning
from raw (unlabeled) data
Learning from supervised data:
example classifications are given
Clustering
5
The process of grouping a set of objects into clusters
high intra-cluster similarity
low inter-cluster similarity
How many clusters?
How to identify them?
Applications of Clustering
6
Google news: Clusters news stories from different sources
about same event
….
Computational biology: Group genes that perform the
same functions
Social media analysis: Group individuals that have similar
political views
Computer graphics: Identify similar objects from pictures
What is Similarity?
10
The real meaning of similarity is a philosophical question.
Depends on representation and algorithm. For many rep./alg., easier to think in terms of a distance (rather than similarity) between vectors.
Hard to define!
But we know it
when we see it
Intuitions behind desirable distance
measure properties
11
D(A,B) = D(B,A) Symmetry
Otherwise you could claim "Alex looks like Bob, but Bob looks nothing like Alex"
D(A,A) = 0 Constancy of Self-Similarity
Otherwise you could claim "Alex looks more like Bob, than Bob does"
D(A,B) = 0 IIf A= B Identity of indiscerniblesects in your world that are different, but you cannot tell apart.
D(A,B) D(A,C) + D(B,C) Triangular Inequality Otherwise you could claim "Alex is very like Bob, and Alex is very like Carl, but Bob is very unlike Carl"
Intuitions behind desirable distance
measure properties
12
D(A,B) = D(B,A) Symmetry
Otherwise you could claim "Alex looks like Bob, but Bob looks nothing like Alex"
D(A,A) = 0 Constancy of Self-Similarity
Otherwise you could claim "Alex looks more like Bob, than Bob does"
D(A,B) = 0 IIf A= B Identity of indiscernibles
Otherwise there are objects in your world that are different, but you cannot tell apart.
D(A,B) D(A,C) + D(B,C) Triangular Inequality Otherwise you could claim "Alex is very like Bob, and Alex is very like Carl, but Bob is
very unlike Carl"
Distance Measures: Minkowski Metric
13
Suppose two object x and y both have p features
The Minkowski metric is defined by
Most Common Minkowski Metrics
rp
i
ii
r
yxyxd ||),(1
),,,(
),,,(
21
21
p
p
yyyy
xxxx
||max),( ) distance sup"("
||),( distance) (Manhattan1
||),( ) distance (Euclidean2
1
1
22
1
iipi
p
i
ii
p
i
ii
yxyxd r
yxyxd r
yxyxd r
.},{max :distance sup"" :3
. :distanceManhattan :2
. :distanceEuclidean :1
434
734
5342 22
An Example
14
4
3
x
y
11011111100001110
10011100100100110
1716151413121110987654321
GeneB
GeneA
.5141001 :Distance Hamming )#()#(
Hamming distance
15
Manhattan distance is called Hamming distance when all features are
binary.
Gene Expression Levels Under 17 Conditions (1-High,0-Low)
Similarity Measures: Correlation
Coefficient
16 Time
Gene A
Gene B
Expression Level
Expression Level
Time
Gene A
Gene B
Positively correlated
Uncorrelated Negatively correlated
1
2
3
Time
Gene A
Expression Level
Gene B
Similarity Measures: Correlation
Coefficient
17
Pearson correlation coefficient
Special case: cosine distance
. and where
)()(
))((
),(
1
1
1
1
1 1
22
1
p
i
ip
p
i
ip
p
i
p
i
ii
p
i
ii
yyxx
yyxx
yyxx
yxs1),(1 yxs
yx
yxyxs
),(
K-Means: Algorithm
24
1. Decide on a value for k.
2. Initialize the k cluster centers randomly if necessary.
3. Repeat till any object changes its cluster assignment
Decide the cluster memberships of the N objects by assigning
them to the nearest cluster centroid
Re-estimate the k cluster centers, by assuming the
memberships found above are correct.
),(minarg)( jiji xxcluster
d
K-Means is widely used in practice
25
Extremely fast and scalable: used in variety of applications
Can be easily parallelized
Easy Map-Reduce implementation
Mapper: assigns each datapoint to nearest cluster
Reducer: takes all points assigned to a cluster, and re-computes the centroids
Sensitive to starting points or random seed initialization (Similar to Neural networks)
There are extensions like K-Means++ that try to solve this problem
Density estimation
28
An aircraft testing facility measures Heat and Vibration
parameters for every newly built aircraft.
Estimate density function
P(x) given unlabeled
datapoints X1 to Xn
Mixture Models
30
A density model p(x) may be multi-modal.
We may be able to model it as a mixture of uni-modal
distributions (e.g., Gaussians).
Each mode may correspond to a different sub-population
(e.g., male and female).
Gaussian Mixture Models (GMMs)
31
Consider a mixture of K Gaussian components:
This model can be used for unsupervised clustering.
This model (fit by AutoClass) has been used to discover new kinds of stars in astronomical data, etc.
K
i
iiin xNxp1
),|(),(
mixture proportion mixture component
Learning mixture models
32
In fully observed iid settings, the log likelihood decomposes
into a sum of local terms.
With latent variables, all the parameters become coupled
together via marginalization
),|(log)|(log)|,(log);( xzc zxpzpzxpD l
z
xz
z
c zxpzpzxpD ),|()|(log)|,(log);( l
If we are doing MLE for completely observed data
Data log-likelihood
MLE
Cxzz
xN
zxpzpxzpD
n k
kn
k
n
n k
k
k
n
n
z
kk
k
n
n k
z
k
nn
n
n
n
nn
k
kn
kn
∑∑∑∑
∑ ∏∑ ∏
∏
2
2
1 )-(-log
),;(loglog
),,|()|(log),(log);(
2
θl
MLE for GMM
33
),;(maxargˆ, DMLEk θl
);(maxargˆ, DMLEk θl
);(maxargˆ, DMLEk θl
∑∑
,ˆ ⇒
n
k
n
n n
k
n
MLEkz
xz
ˆ datapoints ofNumber ,
i
kiZ
MLEk
Gaussian Naïve Bayes
Expectation-Maximization (EM)
36
Start: "Guess" the mean and covariance of each of the K gaussians
Loop
Expectation-Maximization (EM)
38
Start: "Guess" the centroid and covariance of each of the K clusters
Loop
The Expectation-Maximization (EM)
Algorithm
39
E Step: Guess values of Z’s
k
l
iii
iii
tttiiti
j
lzPlzxp
jzPjzxp
xjzpw
1
)()()()(
)()|(
)()|(
),,,|(
),()|( )()( t
j
t
j
ii Njzxp
)( (t)
k kZP i
The Expectation-Maximization (EM)
Algorithm
40
datapoints#)(
)(
1)(t
k
n
tk
niw
kZP
∑∑
)(
)1()1()(
)1())((
n
tk
n
n
Tt
kn
t
kn
tk
nt
kw
xxw
∑∑
)(
)(
)1(
n
tk
n
n n
tk
nt
kw
xw
• M Step: Update parameter estimates
EM Algorithm for GMM
41
E Step: Guess values of Z’s
k
l
iii
iii
tttiiti
j
lzPlzxp
jzPjzxp
xjzpw
1
)()()()(
)()|(
)()|(
),,,|(
)( )(
1)(t
k N
wkZP n
tk
ni
∑∑
)(
)1()1()(
)1())((
n
tk
n
n
Tt
kn
t
kn
tk
nt
kw
xxw
∑∑
)(
)(
)1(
n
tk
n
n n
tk
nt
kw
xw
• M Step: Update parameter
estimates
K-means is a hard version of EM
42
In the K-means “E-step” we do hard assignment:
In the K-means “M-step” we update the means as the
weighted sum of the data, but now the weights are 0 or 1:
)()(minarg )()(1)()( t
kn
t
k
Tt
knk
t
n xxz
n
t
n
n n
t
nt
kkz
xkz
),(
),()(
)(
)1(
Theory underlying EM
44
What are we doing?
Recall that according to MLE, we intend to learn the model parameters that would maximize the likelihood of the data.
But we do not observe z, so computing
is difficult!
What shall we do?
z
xz
z
c zxpzpzxpD ),|()|(log)|,(log);( l
Jensen’s Inequality
46
For a convex function f(x)
Similarly, for a concave function f(x)
[f(x)]xf(Ε ])[
[f(x)]xf(Ε ])[
How Many Clusters?
50
Number of clusters K is given
Partition „n‟ documents into predetermined #topics
Solve an optimization problem: penalize #clusters
Information theoretic approaches: AIC, BIC criteria for model
selection
Tradeoff between having clearly separable clusters and having too
many clusters
Seed Choice: K-Means++
51
K-Means results can vary based on random seed selection.
K-Means++
Choose one center uniformly at random among given datapoints.
For each data point x, compute D(x)
D(x) = distance(x, nearest center)
Choose one new data point at random as a new center
P(x) ∝ D(x)2.
Repeat Steps 2 and 3 until k centers have been chosen.
Run standard K-Means with this centroid initialization.
Automatic Gloss Finding
for a Knowledge Base
54
Glosses: Natural language definitions of named entities. E.g. “Microsoft” is an American multinational corporation headquartered in Redmond that develops, manufactures, licenses, supports and sells computer software, consumer electronics and personal computers and services ...
Input: Knowledge Base i.e. a set of concepts (e.g. company) and entities belonging to those concepts (e.g. Microsoft), and a set of potential glosses.
Output: Candidate glosses matched to relevant entities in the KB. “Microsoft is an American multinational corporation headquartered in Redmond …” is mapped to entity “Microsoft” of type “Company”.
[Automatic Gloss Finding for a Knowledge Base using Ontological Constraints, Bhavana Dalvi Mishra, Einat Minkov, Partha Pratim Talukdar, and William W. Cohen, 2014, Under submission]
GLOFIN on NELL Dataset
66
0
10
20
30
40
50
60
70
80
Precision Recall F1
SVM
Labal Propagation
GLOFIN
275 categories, 247K candidate glosses, #train=20K, #test=227K
GLOFIN on Freebase Dataset
67
0
10
20
30
40
50
60
70
80
90
100
Precision Recall F1
SVM
Labal Propagation
GLOFIN
66 categories, 285K candidate glosses, #train=25K, #test=260K
Summary
68
What is clustering?
What are similarity measures?
K-Means clustering algorithm
Mixture of Gaussians (GMM)
Expectation Maximization
Advanced Topics
How to seed clustering
How to decide #clusters
Application: Gloss finding for a Knowledge Bases