clustering… in general in vector space, clusters are vectors found within of a cluster vector,...
Post on 19-Dec-2015
219 views
TRANSCRIPT
![Page 1: Clustering… in General In vector space, clusters are vectors found within of a cluster vector, with different techniques for determining the cluster](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d3b5503460f94a15fa2/html5/thumbnails/1.jpg)
Clustering… in General In vector space, clusters are vectors found within
of a cluster vector, with different techniques for determining the cluster vector and .
Clustering is unsupervised pattern classification. Unsupervised means no correct answer or feedback. Patterns typically are samples of feature vectors or
matrices. Classification means collecting the samples into groups
of similar members.
![Page 2: Clustering… in General In vector space, clusters are vectors found within of a cluster vector, with different techniques for determining the cluster](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d3b5503460f94a15fa2/html5/thumbnails/2.jpg)
Clustering Decisions Pattern Representation
feature selection (e.g., stop word removal, stemming) number of categories
Pattern proximity distance measure on pairs of patterns
Grouping characteristics of clusters (e.g., fuzzy, hierarchical)
Clustering algorithms embody different assumptions about these decisions and the form of clusters.
![Page 3: Clustering… in General In vector space, clusters are vectors found within of a cluster vector, with different techniques for determining the cluster](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d3b5503460f94a15fa2/html5/thumbnails/3.jpg)
Formal Definitions Feature vector x is a single datum
of d measurements. Hard clustering techniques assign
a class label to each cluster; members of clusters are mutually exclusive.
Fuzzy clustering techniques assign a fractional degree of membership to each label for each x.
![Page 4: Clustering… in General In vector space, clusters are vectors found within of a cluster vector, with different techniques for determining the cluster](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d3b5503460f94a15fa2/html5/thumbnails/4.jpg)
Proximity Measures Generally, use Euclidean distance or
mean squared distance. In IR, use similarity measure from
retrieval (e.g., cosine measure for TFIDF).
![Page 5: Clustering… in General In vector space, clusters are vectors found within of a cluster vector, with different techniques for determining the cluster](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d3b5503460f94a15fa2/html5/thumbnails/5.jpg)
[Jain, Murty & Flynn] Taxonomy of Clustering
Clustering
Hierarchical Partitional
SingleLink
CompleteLink
SquareError
GraphTheoretic
MixtureResolving
ModeSeeking
k-meansExpectationMinimizationHAC
![Page 6: Clustering… in General In vector space, clusters are vectors found within of a cluster vector, with different techniques for determining the cluster](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d3b5503460f94a15fa2/html5/thumbnails/6.jpg)
Clustering Issues
Agglomerative: begin with each sample in its own cluster and merge
Divisive: begin with single cluster and split
Hard: mutually exclusive cluster membership
Fuzzy: degrees of membership in clusters
Deterministic Stochastic
Incremental: samples may be added to clusters
Batch: clusters created over entire sample space
![Page 7: Clustering… in General In vector space, clusters are vectors found within of a cluster vector, with different techniques for determining the cluster](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d3b5503460f94a15fa2/html5/thumbnails/7.jpg)
Hierarchical Algorithms Produce hierarchy of
classes (taxonomy) from singleton clusters to just one cluster.
Select level for extracting cluster set.
Representation is a dendrogram.
C1
D1 D2D3 D4
C2C3 C4
C1,3
0.99
C1,3,2
0.29
C1,3,2,4
0.00
![Page 8: Clustering… in General In vector space, clusters are vectors found within of a cluster vector, with different techniques for determining the cluster](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d3b5503460f94a15fa2/html5/thumbnails/8.jpg)
Complete-Link Revisited Used to create statistical thesaurus Agglomerative, hard, deterministic, batch1. Start with 1 cluster/sample2. Find two clusters with lowest distance3. Merge two clusters and add to hierarchy4. Repeat from 2 until termination criterion or
until all clusters have merged
![Page 9: Clustering… in General In vector space, clusters are vectors found within of a cluster vector, with different techniques for determining the cluster](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d3b5503460f94a15fa2/html5/thumbnails/9.jpg)
Single-Link Like Complete-Link except…
use minimum of distances between all pairs of samples in the two clusters (complete-link uses maximum).
Single-link has chaining effect with elongated clusters, but can construct more complex shapes.
![Page 10: Clustering… in General In vector space, clusters are vectors found within of a cluster vector, with different techniques for determining the cluster](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d3b5503460f94a15fa2/html5/thumbnails/10.jpg)
Example:Plot
05
101520253035404550
0 10 20 30 40 50
![Page 11: Clustering… in General In vector space, clusters are vectors found within of a cluster vector, with different techniques for determining the cluster](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d3b5503460f94a15fa2/html5/thumbnails/11.jpg)
Example: Proximity Matrix
21,15
26,25
29,22
31,15
21,27
23,32
29,26
33,21
21,15
0 11.2 10.6 10.0 12.0 17.1 13.6 13.4
26,25
0 4.2 11.1 5.4 7.6 3.2 8.1
29,22
0 7.3 9.4 11.7 4.0 4.1
31,15
0 15.6 18.8 11.2 6.3
21,27
0 5.4 8.1 13.4
23,32
0 8.5 14.9
29,26
0 6.4
33,21
0
![Page 12: Clustering… in General In vector space, clusters are vectors found within of a cluster vector, with different techniques for determining the cluster](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d3b5503460f94a15fa2/html5/thumbnails/12.jpg)
Complete-Link Solution
1,28
4,9
9,16
13,18
21,15 29,22
31,15 33,21 35,35 42,45
45,4246,3023,32
21,27
29,26
26,25
C1 C2 C3C4 C5
C6C7C8 C9
C10C11 C12
C13 C14
C15
![Page 13: Clustering… in General In vector space, clusters are vectors found within of a cluster vector, with different techniques for determining the cluster](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d3b5503460f94a15fa2/html5/thumbnails/13.jpg)
Single-Link Solution
1,28
4,9
9,16
13,18
21,15 29,22
31,15 33,21 35,35 42,45
45,4246,3023,32
21,27
29,26
26,25
C1 C4C5 C6
C7
C9
C13
C10
C11
C15
C2
C3
C8
C12
C14
![Page 14: Clustering… in General In vector space, clusters are vectors found within of a cluster vector, with different techniques for determining the cluster](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d3b5503460f94a15fa2/html5/thumbnails/14.jpg)
Hierarchical Agglomerative Clustering (HAC)
Agglomerative, hard, deterministic, batch1. Start with 1 cluster/sample and compute a
proximity matrix between pairs of clusters.2. Merge most similar pair of clusters and
update proximity matrix.3. Repeat 2 until all clusters merged. Difference is in how proximity matrix is
updated. Ability to combine benefits of both single and
complete link algorithms.
![Page 15: Clustering… in General In vector space, clusters are vectors found within of a cluster vector, with different techniques for determining the cluster](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d3b5503460f94a15fa2/html5/thumbnails/15.jpg)
HAC for IR
Intra-cluster Similarity
where S is TFIDF vectors for documents, c is centroid of cluster X, and d is a document.
Proximity is similarity of all documents to the cluster centroid.
Select pair of clusters that produces the smallest decrease in similarity, e.g., if merge(X,Y)=>Z, thenmax[Sim(Z)-(Sim(X)+Sim(Y))]
Sd
Xd
dS
c
cdXSim
1
),cos()(
![Page 16: Clustering… in General In vector space, clusters are vectors found within of a cluster vector, with different techniques for determining the cluster](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d3b5503460f94a15fa2/html5/thumbnails/16.jpg)
HAC for IR- AlternativesCentroid Similarity
cosine similarity between the centroid of the two clusters
UPGMA
Sd
YX
dS
c
ccYXSim
1
),cos(),(
YX
ddYXSim YdXd
*
),cos(),( 21 ,
21
![Page 17: Clustering… in General In vector space, clusters are vectors found within of a cluster vector, with different techniques for determining the cluster](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d3b5503460f94a15fa2/html5/thumbnails/17.jpg)
Partitional Algorithms Results in set of unrelated clusters. Issues:
how many clusters is enough? how to search space of possible
partitions? what is appropriate clustering
criterion?
![Page 18: Clustering… in General In vector space, clusters are vectors found within of a cluster vector, with different techniques for determining the cluster](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d3b5503460f94a15fa2/html5/thumbnails/18.jpg)
K Means Number of clusters is set by user to be k. Non-deterministic Clustering criterion is squared error:
where S is document set, L is a clustering, K is number of clusters, x is ith document in jth cluster and c is centroid of jth cluster.
K
j
n
ij
j
i
j
cxLSe1 1
2
),(
![Page 19: Clustering… in General In vector space, clusters are vectors found within of a cluster vector, with different techniques for determining the cluster](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d3b5503460f94a15fa2/html5/thumbnails/19.jpg)
k-Means Clustering Algorithm
1. Randomly select k samples as cluster centroids.
2. Assign each pattern to the closest cluster centroid.
3. Recompute centroids.4. If convergence criterion (e.g., minimal
decrease in error or no change in cluster composition) is not met, return to 2.
![Page 20: Clustering… in General In vector space, clusters are vectors found within of a cluster vector, with different techniques for determining the cluster](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d3b5503460f94a15fa2/html5/thumbnails/20.jpg)
Example:K-Means Solutions
05
101520253035404550
0 10 20 30 40 50
![Page 21: Clustering… in General In vector space, clusters are vectors found within of a cluster vector, with different techniques for determining the cluster](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d3b5503460f94a15fa2/html5/thumbnails/21.jpg)
k-Means Sensitivity to Initialization
ABC D E
F G
K=3, red started w/A, D, F; yellow w/A, B, C
![Page 22: Clustering… in General In vector space, clusters are vectors found within of a cluster vector, with different techniques for determining the cluster](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d3b5503460f94a15fa2/html5/thumbnails/22.jpg)
k-Means for IR Update centroids incrementally Calculate centroid as with
hierarchical methods. Can refine into a divisive
hierarchical method by starting with single cluster and splitting using k-means until forms k clusters with highest summed similarities. (bisecting k-means)
![Page 23: Clustering… in General In vector space, clusters are vectors found within of a cluster vector, with different techniques for determining the cluster](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d3b5503460f94a15fa2/html5/thumbnails/23.jpg)
Other Types of Clustering AlgorithmsGraph Theoretic: construct minimal
spanning tree and delete edges with largest lengths
Expectation Minimization (EM): assume clusters are drawn from distributions, use maximum likelihood to estimate parameters of distributions.
Nearest Neighbors: iteratively assign each sample to the cluster of its nearest labelled neighbor, so long as distance is below a set threshold.
![Page 24: Clustering… in General In vector space, clusters are vectors found within of a cluster vector, with different techniques for determining the cluster](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d3b5503460f94a15fa2/html5/thumbnails/24.jpg)
Comparison of Clustering Algorithms [Steinbach et al.]
Implement 3 versions of HAC and 2 versions of k-Means
Compare performance on documents hand labelled as relevant to one of a set of classes.
Well known data sets (TREC) Found that UPGMA is best of hierarchical,
but bisecting k-means seems to do better if considered over many runs.
M. Steinbach, G. Karypis, V.Kumar. A Comparison of Document Clustering Techniques, KDD Workshop on Text Mining, 2000.
![Page 25: Clustering… in General In vector space, clusters are vectors found within of a cluster vector, with different techniques for determining the cluster](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d3b5503460f94a15fa2/html5/thumbnails/25.jpg)
Evaluation Metrics 1
Evaluation: how to measure cluster quality? Entropy:
where pij is probability that a member of cluster j belongs to class i, nj is size of cluster j, m is number of clusters, n is number of docs and CS is a clustering solution.
m
j
jj
CS
iijijj
n
EnE
ppE
1
*
)log(
![Page 26: Clustering… in General In vector space, clusters are vectors found within of a cluster vector, with different techniques for determining the cluster](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d3b5503460f94a15fa2/html5/thumbnails/26.jpg)
Comparison Measure 2 F measure: combines precision and
recall treat each cluster as the result of a query
and each class as the relevant set of docs
i
i
jij
iij
jiFn
nF
jiji
jijijiF
nnji
nnji
)],(max[
),(Recall),(Precision
),(Precision*),(Recall*2),(
/),(Precision
/),(Recall
nij is # of members of class i in cluster j,nj is # in j, ni is # in i,n is # of docs.