cluster analysis 1 mark stamp. cluster analysis grouping objects in meaningful way o clustered data...
TRANSCRIPT
![Page 1: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/1.jpg)
1
Cluster Analysis
Cluster Analysis
Mark Stamp
![Page 2: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/2.jpg)
2
Cluster Analysis
Grouping objects in meaningful wayo Clustered data fits together in some
wayo Can help to make sense of (big) datao Finds application in many fields
Many different clustering strategies Overview, then details on 2
methodso K-means simple and can be effectiveo EM clustering not as simple
Cluster Analysis
![Page 3: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/3.jpg)
3
Intrinsic vs Extrinsic
Intrinsic clustering relies on unsupervised learningo No predetermined labels on objectso Apply analysis directly to data
Extrinsic requires category labels o Requires pre-processing of datao Can be viewed as a form of supervised
learning
Cluster Analysis
![Page 4: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/4.jpg)
4
Agglomerative vs Divisive
Agglomerative o Each object starts in its own clustero Clustering merges existing clusterso A “bottom up” approach
Divisiveo All objects start in one clustero Clustering process splits existing
clusterso A “top down” approach
Cluster Analysis
![Page 5: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/5.jpg)
5
Hierarchical vs Partitional
Hierarchical clusteringo “Child” and “parent” clusterso Can be viewed as dendrograms
Partitional clusteringo Partition objects into disjoint clusterso No hierarchical relationship
We consider K-means and EM in detailo These are both partitional
Cluster Analysis
![Page 6: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/6.jpg)
6
Hierarchical Clustering
Example of a hierarchical approach...
1. start: Every point is its own cluster2. while number of clusters exceeds 1
o Find 2 nearest clusters and merge
3. end while OK, but no real theoretical basis
o And some find that “disconcerting”o Even K-means has some theory behind
itCluster Analysis
![Page 7: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/7.jpg)
7
Distance
Distance between data points? Suppose
x = (x1,x2,…,xn) and y = (y1,y2,…,yn)
where each xi and yi are real numbers
Euclidean distance isd(x,y) = sqrt((x1-y1)2 + (x2-y2)2 +…+ (xn-
yn)2)
Manhattan (taxicab) distance isd(x,y) = |x1-y1| + |x2-y2| + … + |xn-yn| Cluster Analysis
![Page 8: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/8.jpg)
8
Distance
Euclidean distance red line Manhattan distance blue or yellow
o Or any similar right-angle only path
Cluster Analysis
a
b
![Page 9: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/9.jpg)
9
Distance
Lots and lots more distance measures
Other examples includeo Mahalanobis distance takes mean
and covariance into accounto Simple substitution distance
measure of “decryption” distance o Chi-squared distance statistical o Or just about anything you can think
of…Cluster Analysis
![Page 10: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/10.jpg)
10
One Clustering Approach
Given data points x1,x2,x3,…,xm Want to partition into K clusters
o I.e., each point in exactly one cluster A centroid specified for each
clustero Let c1,c2,…,cK denote current centroids
Each xi associated with one centroido Let centroid(xi) be centroid for xi
o If cj = centroid(xi), then xi is in cluster j
Cluster Analysis
![Page 11: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/11.jpg)
11
Clustering
Two crucial questions1. How to determine centroids, cj?
2. How to determine clusters, that is, how to assign xi to centroids?
But first, what makes a cluster good?o For now, focus on one individual clustero Relationship between clusters later…
What do you think?Cluster Analysis
![Page 12: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/12.jpg)
12
Distortion
Intuitively, “compact” clusters goodo Depends on data and K, which are giveno And depends on centroids and
assignment of xi to clusters (which we can control)
How to measure this “goodness”? Define distortion = Σ
d(xi,centroid(xi))o Where d(x,y) is a distance measure
Given K, let’s try to minimize distortion
Cluster Analysis
![Page 13: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/13.jpg)
13
Distortion
Consider this 2-d datao Choose K = 3 clusters
Same data for botho Which has smaller
distortion? How to minimize
distortion?o Good question…
Cluster Analysis
![Page 14: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/14.jpg)
14
Distortion
Note, distortion depends on Ko So, should probably write distortionK
Typically, larger K, smaller distortionK o Want to minimize distortionK for fixed K
Best choice of K is a different issueo Briefly considered latero Also consider other measures of
goodness For now, assume K is given and
fixed
Cluster Analysis
![Page 15: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/15.jpg)
15
How to Minimize Distortion?
Given m data points and K … Min distortion via exhaustive search?
o Try all m choose K different cases? o Too much work for realistic size data set
An approximate solution will have to doo Exact solution is NP-complete problem
Important Note: For minimum distortion…o Each xi grouped with nearest centroido Centroid must be center of its group
Cluster Analysis
![Page 16: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/16.jpg)
16
K-Means Previous slide implies that we can
improve suboptimal cluster by either…1. Re-assign each xi to nearest centroid
2. Re-compute centroids so they’re centered
No improvement from applying either 1 or 2 more than once in succession
But alternating might be usefulo In fact, that is the K-means algorithm
Cluster Analysis
![Page 17: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/17.jpg)
17
K-Means Algorithm Given dataset…1. Select a value for K (how?)2. Select initial centroids (how?)3. Group data by nearest centroid4. Recompute centroids (cluster
centers)5. If significant change, goto 3; else
stop
Cluster Analysis
![Page 18: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/18.jpg)
18
K-Means Animation
Very good animation herehttp://shabal.in/visuals/kmeans/2.html
Nice animations of movement of centroids in different cases here
http://www.ccs.neu.edu/home/kenb/db/examples/059.html
(near bottom of web page) Other?
Cluster Analysis
![Page 19: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/19.jpg)
19
K-Means
Are we assured of optimal solution?o Definitely not
Why not?o For one thing, initial centroid locations
are criticalo There is a (sensitive) dependence on
initial conditionso This is a common issue in iterative
processes (HMM training, is an example)
Cluster Analysis
![Page 20: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/20.jpg)
20
K-Means Initialization
Recall, K is the number of clusters How to choose K? No obvious “best” way to do so But K-means is fast
o So trial and error may be OKo That is, experiment with different K o Similar to choosing N in HMM
Is there a better way to choose K?
Cluster Analysis
![Page 21: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/21.jpg)
21
Optimal K?
Even for trial and error, need a way to measure “goodness” of results
Choosing optimal K is tricky Most intuitive measures will tend to
improve for larger K But K “too big” may overfit data So, when is K “big enough”?
o But not too big…
Cluster Analysis
![Page 22: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/22.jpg)
22
Schwarz Criterion Choose K that minimizes
f(K) = distortionK + λdK log mo Where d is the dimension, m is the number of
data points, and λ is ??? Recall that distortion depends on K
o Tends to decrease as K increaseso Essentially, adding a penalty as K increases
Related to Bayes Information Criterion (BIC)o And some other similar things
Consider choice of K in more detail later…Cluster Analysis
K
f(K)
![Page 23: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/23.jpg)
23
K-Means Initialization
How to choose initial centroids? Again, no best way to do this
o Counterexamples to any “best” approach
Often just choose at random Or uniform/maximum spacing
o Or some variation on this idea Other?
Cluster Analysis
![Page 24: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/24.jpg)
24
K-Means Initialization
In practice, often… Try several different choices of K
o For each K, test several initial centroids
Select the result that is besto How to measure “best”?o We’ll look at that next
May not be very scientifico But often works well
Cluster Analysis
![Page 25: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/25.jpg)
25
K-Means Variations K-mediods
o Centroids point must be actual data point
Fuzzy K-meanso In K-means, any data point is in one
cluster and not in any othero In fuzzy case, data point can be partly
in several different clusters o “Degree of membership” vs distance
Many other variations…Cluster Analysis
![Page 26: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/26.jpg)
26
Measuring Cluster Quality
How can we judge clustering results?o In general, that is, not just for K-
means Compare to typical
training/scoring…o Suppose we test new scoring methodo E.g., score malware and benign fileso Compute ROC curves, AUC, etc.o Many tools to measure
success/accuracy Clustering is different (Why? How?)
Cluster Analysis
![Page 27: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/27.jpg)
27
Clustering Quality
Clustering is a fishing expeditiono Not sure what we are looking foro Hoping to find structure, data
discoveryo If we know answer, no point to
clustering Might find something that’s not
thereo Even random data can be clustered
Some things to consider on next slideso Relative to the data to be clustered
Cluster Analysis
![Page 28: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/28.jpg)
28
Cluster-ability?
Clustering tendencyo How suitable is dataset for clustering?o Which dataset below is cluster-
friendly?o We can always apply clustering…o …but expect better results in some
cases
Cluster Analysis
![Page 29: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/29.jpg)
29
Validation
External validationo Compare clusters based on data
labelso Similar to usual training/scoring
scenarioo Good idea if know something about
data Internal validation
o Determine quality based only on clusters
o E.g., spacing between and within clusters
o Generally applicable
Cluster Analysis
![Page 30: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/30.jpg)
30
It’s All Relative
Comparing clustering resultso That is, compare one clustering result
with others for same dataseto Would be very useful in practiceo Often, lots of trial and erroro Could enable us to “hill climb” to
better clustering results…o …if we have a way to quantify things
Cluster Analysis
![Page 31: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/31.jpg)
31
How Many Clusters?
Optimal number of clusters?o Already mentioned this wrt K-meanso But what about the general case?o I.e., no reference to clustering
techniqueo Can the data tell us how many
clusters?o Or the topology of the clusters?
Next, we consider several relevant measuresCluster Analysis
![Page 32: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/32.jpg)
32
Internal Validation
Direct measurement of clusterso Might call it “topological” validation
We’ll consider the followingo Cluster correlationo Similarity matrixo Sum of squares erroro Cohesion and separationo Silhouette coefficient
Cluster Analysis
![Page 33: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/33.jpg)
33
Cluster Correlation
Given data x1,x2,…,xm, and clusters, define 2 matrices
Distance matrix D = {dij} o Where dij is distance between xi and xj
Adjacency matrix A = {aij}o Where aij is 1 if xi and xj in same
clustero And aij is 0 otherwise
Now what?Cluster Analysis
![Page 34: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/34.jpg)
34
Cluster Correlation
Compute correlation between D and A rAD = Corr(A,D) = cov(A,D) / (σAσD)
= Σ(aij–μA)(dij–μD) / sqrt(Σ(aij–μA)2Σ(dij–μD)2)
Can show that r is between -1 and 1o If r > 0 then positive Corr (and vice
versa)o Magnitude is strength of correlation
High (inverse) correlation implies nearby things clustered together
Cluster Analysis
![Page 35: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/35.jpg)
35
Correlation
Correlation examples
Cluster Analysis
![Page 36: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/36.jpg)
36
Similarity Matrix
Form “similarity matrix”o Could be based on just about anythingo Typically, distance matrix D = {dij},
where dij = d(xi,xj)
Group rows and columns by cluster Heat map for resulting matrix
o Provides visual representation of similarity within clusters (so look at it…)
Cluster Analysis
![Page 37: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/37.jpg)
37
Similarity Matrix
Examples Better
than just looking at clusters?
Good for higher dimensions
Cluster Analysis
![Page 38: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/38.jpg)
38
Residual Sum of Squares
Residual Sum of Squares (RSS)o Aka Sum of Squared Errors (SSE)o RSS is squared sum of “error” termso Definition of error depends on
problem What is “error” when clustering?
o Distance from centroid?o Then same as distortiono But, could use other measures instead
Cluster Analysis
![Page 39: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/39.jpg)
39
Cohesion and Separation
Cluster cohesiono How tightly packed is a clustero More cohesive clusters is more better
Cluster separationo Distance between clusterso The more separation, the better
Can we measure these things?o Yes, easily
Cluster Analysis
![Page 40: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/40.jpg)
40
Notation
Same notation is K-meansoLet ci, i=1,2,…,K, cluster
centroidsoLet x1,x2,…,xm be data points
oLet centroid(xi) be centroid of xi oClusters determined by centroids
Following results apply generallyoNot just for K-means…
Cluster Analysis
![Page 41: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/41.jpg)
41
Cohesion
Lots of measures of cohesiono Previously defined distortion is usefulo Recall, distortion = Σ d(xi,centroid(xi))
Can also use distance between all pairs
Cluster Analysis
![Page 42: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/42.jpg)
42
Separation Again, many ways to measure this
o Here, using distances to other centroids
Or distances between all points in clusters Or distance from centroids to a
“midpoint” Or distance between centroids, or…Cluster Analysis
![Page 43: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/43.jpg)
43
Silhouette Coefficient
Essentially, combines cohesion and separation into a single number
Let Ci be cluster of point xi o Let a be average of d(xi,y) for all y in
Ci
o For Cj ≠ Ci, let bj be avg d(xi,y) for y in Cj
o Let b be minimum of bj
Then let S(xi) = (b – a) / max(a,b)o What the … ?
Cluster Analysis
![Page 44: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/44.jpg)
44
Silhouette Coefficient
The idea...
Cluster Analysis
xi avg
avg
b=min
a=avg
Usually, S(xi) = 1 - a/b
![Page 45: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/45.jpg)
45
Silhouette Coefficient For given point xi, we
o Let a be avg distance to points in its clustero Let b be dist to nearest other cluster (in a
sense) Usually, a < b and hence S(xi) = 1 – a/b If a is a lot less than b, then S(xi) ≈ 1
o Points inside cluster much closer together than nearest other cluster (this is good)
If a is almost same as b, then S(xi) ≈ 0o Some other cluster is almost as close as
things inside cluster (this is bad)Cluster Analysis
![Page 46: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/46.jpg)
46
Silhouette Coefficient
Silhouette coefficient is defined for each point
Avg silhouette coefficient for a clustero Measure of how good a cluster is
Avg silhouette coefficient for all pointso Measure of clustering “goodness”
What is a good number for coefficient?o Rule of thumb on next slide
Cluster Analysis
![Page 47: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/47.jpg)
47
Silhouette Coefficient
Average coefficient (to 2 decimal places)o 0.71 to 1.00 strong structure foundo 0.51 to 0.70 reasonable structure foundo 0.26 to 0.50 weak or artificial structureo 0.25 or less no significant structure
Bottom line on silhouette coefficiento Combine cohesion, separation in one
numbero One of most useful measures of qualityCluster Analysis
![Page 48: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/48.jpg)
48
External Validation
“External” implies that we measure quality based on data in clusterso Not relying on cluster topology
(“shape”) Suppose clustering data is of
several different typeso Say, different malware families
We can compute statistics on clusterso We only consider 2 stats hereCluster Analysis
![Page 49: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/49.jpg)
49
Entropy and Purity
Entropyo Standard measure of uncertainty or
randomnesso High entropy implies clusters less
uniform Purity
o Another measure of uniformityo Ideally, cluster should be more “pure”,
that is, more uniform
Cluster Analysis
![Page 50: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/50.jpg)
50
Entropy
Suppose total of m data elementso As usual, x1,x2,…,xm
Denote cluster j as Cj o Let mj be number of elements in Cj
o Let mij be count of type i in cluster Cj
Compute probabilities based on relative frequencieso That is, pij = mij / mj
Cluster Analysis
![Page 51: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/51.jpg)
51
Entropy
Then entropy of cluster Cj is Ej = − Σ pij log pij, where sum is over i
Compute entropy Ej for each cluster Cj
Overall (weighted) entropy is thenE = Σ mj/m Ej, where sum is from 1 to K
and K is number of clusters Smaller E is better
o Implies clusters less uncertain/randomCluster Analysis
![Page 52: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/52.jpg)
52
Purity
Ideally, each cluster is all one type Using same notation as in
entropy…o Purity of Cj defined as Uj = max pij o Where max is over i (i.e., different
types) If Uj is 1, then Cj all one type of data
o If Uj is near 0, no dominant type
Overall (weighted) purity isU = Σ mj/m Uj, where sum is from 1 to K
Cluster Analysis
![Page 53: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/53.jpg)
53
Entropy and Purity
Example o Based on K-means clustering
Cluster Analysis
![Page 54: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/54.jpg)
54
EM Clustering Data might be from different
probability distributionso If so, “distance” might be poor
measureo Maybe better to use mean and
variance Cluster on probability distributions?
o But distributions are unknown… Expectation maximization (EM)
o Technique to determine unknown parameters of probability distributions
Cluster Analysis
![Page 55: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/55.jpg)
55
EM Clustering Animation
Good animation on Wikipedia pagehttp://en.wikipedia.org/wiki/Expectation–maximization_algorithm
Another animation herehttp://www.cs.cmu.edu/~alad/em/
Probably others too…
Cluster Analysis
![Page 56: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/56.jpg)
56
Coin Experiment
Given 2 biased coins, A and Bo Randomly select coino Flip selected coin 10 timeso Repeat 5 times, so 50 total coin flips
Can we determine P(H) for each coin?
Easy, if you know which coin selectedo For each coin, just divide number of
heads by number of flips of that coinCluster Analysis
![Page 57: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/57.jpg)
57
Coin Example
For example, supposeCoin B: HTTTHHTHTH 5 H and 5 TCoin A: HHHHTHHHHH 9 H and 1 TCoin A: HTHHHHHTHH 8 H and 2 TCoin B: HTHTTTHHTT 4 H and 6 TCoin A: THHHTHHHTH 7 H and 3 T
Then maximum likelihood estimate isPA(H) = 24/30 = 0.80 and PB(H) = 9/20 =
0.45Cluster Analysis
![Page 58: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/58.jpg)
58
Coin Example
Suppose we have same data, but we do not know which coin was selectedCoin ?: 5 H and 5 TCoin ?: 9 H and 1 TCoin ?: 8 H and 2 TCoin ?: 4 H and 6 TCoin ?: 7 H and 3 T
Can we estimate PA(H) and PB(H)?Cluster Analysis
![Page 59: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/59.jpg)
59
Coin Example
We do not know which coin was flipped
So, there is “hidden” informationo This should sound familiar…
Train HMM on sequence of H and T ??o Using 2 hidden stateso Use resulting model to find most likely
state sequence (recall, problem 2)o Use sequence to estimate PA(H) and
PB(H)
Cluster Analysis
![Page 60: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/60.jpg)
60
Coin Example
HMM is very “heavy artillery”o And HMM needs lot of data to
converge (or lots of different initializations)
o No need to work so hard here EM algorithm
o Alternate between following 2 steps…o Expectation: Recompute expected
valueso Maximization: Recompute max
likelihoodCluster Analysis
![Page 61: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/61.jpg)
61
EM for Coin Example
Start with a guess (initialization)o Say, PA(H) = 0.6 and PB(H) = 0.5
Compute expectations (E-step) First, from current PA(H) and PB(H)
5 H, 5 T P(A) = .45, P(B) = .55 9 H, 1 T P(A) = .80, P(B) = .20 8 H, 2 T P(A) = .73, P(B) = .27 4 H, 6 T P(A) = .35, P(B) = .65 7 H, 3 T P(A) = .65, P(B) = .35
Cluster Analysis
![Page 62: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/62.jpg)
62
E-step for Coin Example
So far, we have5 H, 5 T P(A) = .45, P(B) = .55 9 H, 1 T P(A) = .80, P(B) = .20 8 H, 2 T P(A) = .73, P(B) = .27 4 H, 6 T P(A) = .35, P(B) = .65 7 H, 3 T P(A) = .65, P(B) = .35
Next, compute expected (weighted) H and T
For example, in 1st lineo For A we have 5 x .45 = 2.25 H and To For B we have 5 x .55 = 2.75 H and T
Cluster Analysis
![Page 63: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/63.jpg)
63
E-step for Coin Example
So far, we have5 H, 5 T P(A) = .45, P(B) = .55 9 H, 1 T P(A) = .80, P(B) = .20 8 H, 2 T P(A) = .73, P(B) = .27 4 H, 6 T P(A) = .35, P(B) = .65 7 H, 3 T P(A) = .65, P(B) = .35
Compute expected (weighted) H and T For example, in 2nd line
o For A, we have 9 x .8 = 7.2 H and 1 x .8 = .8 T
o For B, we have 9 x .2 = 1.8 H and 1 x .2 = .2 TCluster Analysis
![Page 64: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/64.jpg)
64
E-step for Coin Example
Rounded to nearest 0.1: Coin A Coin B5 H, 5 T P(A) = .45, P(B) = .55 2.2H 2.2T
2.8H 2.8T9 H, 1 T P(A) = .80, P(B) = .20 7.2H 0.8T
1.8H 0.2T8 H, 2 T P(A) = .73, P(B) = .27 5.9H 1.5T
2.1H 0.5T 4 H, 6 T P(A) = .35, P(B) = .65 1.4H 2.1T
2.6H 3.9T 7 H, 3 T P(A) = .65, P(B) = .35 4.5H 1.9T
2.5H 1.1T
totals 21.2H 8.5T 11.8H 8.5T
This completes E-step Note: We computed these expected
numbers based on current PA(H) and PB(H)
Cluster Analysis
![Page 65: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/65.jpg)
65
M-step for Coin Example
M-step Re-estimate PA(H) and PB(H) using
results from E-step:PA(H) = 21.2/(21.2+8.5) ≈ 0.71
PB(H) = 11.8/(11.8+8.5) ≈ 0.58
Next? E-step using these probabilitieso Then M-step, then E-step, then…o …until convergence (or we get tired)
Cluster Analysis
![Page 66: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/66.jpg)
66
EM for Clustering
How is EM relevant to clustering? Can use EM to obtain parameters of
K “hidden” distributionso That is, means and variances, μi and σi
2
Then, use μi as centers of clusterso And σi (standard deviations) as “radii”o Assume Gaussian (normal)
distributions Is this better than K-means?Cluster Analysis
![Page 67: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/67.jpg)
67
EM vs K-Means
Whether it is better or not, EM is obviously different than K-means…o …or is it?
Actually, K-means is special case of EMo Using distance instead of probabilities
E-step? Re-assign points to centroidso Like “E” in EM, this “re-shapes” clusters
M-step? Recompute centroidsCluster Analysis
![Page 68: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/68.jpg)
68
Conclusion
Clustering is fun, entertaining, very usefulo Can explore mysterious data, and more…
And K-means is really simpleo EM is powerful and not too difficult either
Measuring success is not so easyo Good clusters? And useful information? o Or just random noise? Can cluster
anything… Clustering is often a good starting point
o Help us decide whether any “there” is thereCluster Analysis
![Page 69: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/69.jpg)
69
References: K-Means A.W. Moore, K-means and hierarchical
clustering P.-N. Tan, M. Steinbach, and V. Kumar,
Introduction to Data Mining, Addison-Wesley, 2006, Chapter 8, Cluster analysis: Basic concepts and algorithms
R. Jin, Cluster validation M.J. Norusis, IBM SPSS Statistics 19
Statistical Procedures Companion, Chapter 17, Cluster analysis
Cluster Analysis
![Page 70: Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense](https://reader036.vdocuments.net/reader036/viewer/2022062719/56649eda5503460f94be96fd/html5/thumbnails/70.jpg)
70
References: EM Clustering C.B. Do and S. Batzoglou, What is the
expectation maximization algorithm?, Nature Biotechnology, 26(8):897-899, 2008
J.A. Bilmes, A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models, ICSI Report TR-97-021, 1998
Cluster Analysis