2004/05/03 clustering 1 clustering (part one) ku-yaw chang [email protected] assistant...
TRANSCRIPT
2004/05/032004/05/03 ClusteringClustering 11
ClusteringClustering(Part One)(Part One)
Ku-Yaw ChangKu-Yaw [email protected]@mail.dyu.edu.tw
Assistant Professor, Department of Assistant Professor, Department of Computer Science and Information EngineeringComputer Science and Information Engineering
Da-Yeh UniversityDa-Yeh University
222004/05/032004/05/03 ClusteringClustering
OutlineOutline
IntroductionIntroduction
Hierarchical ClusteringHierarchical Clustering
Partitional ClusteringPartitional Clustering
332004/05/032004/05/03 ClusteringClustering
IntroductionIntroduction
Supervised learningSupervised learning Training setTraining set
Unsupervised learningUnsupervised learning Divide samples into naturally occurring groups Divide samples into naturally occurring groups
or clusters based on measures of similarity or clusters based on measures of similarity without any prior knowledge of class without any prior knowledge of class membershipmembership
442004/05/032004/05/03 ClusteringClustering
IntroductionIntroduction
ClusteringClustering Grouping samples so that the samples are Grouping samples so that the samples are
similar within each group.similar within each group.The groups are called clusters.The groups are called clusters.
In image analysisIn image analysisBe used to find groups of pixels with similar gray Be used to find groups of pixels with similar gray levels, colors, or local textureslevels, colors, or local textures
To discover various regions in the imageTo discover various regions in the image
552004/05/032004/05/03 ClusteringClustering
IntroductionIntroduction
Hierarchical ClusteringHierarchical Clustering From bottom to topFrom bottom to top
Partitional ClusteringPartitional Clustering From top to bottomFrom top to bottom The number of clusters to be constructed is The number of clusters to be constructed is
specified in advance.specified in advance.
662004/05/032004/05/03 ClusteringClustering
Hierarchical ClusteringHierarchical Clustering
A hierarchy can be represented by a tree A hierarchy can be represented by a tree structure.structure.
Animals
Dogs Cats
Large Small
St. Bernard Labrador
LongHair
ShortHair
0
1 2 3 4 5
1
2
3
4
5
Level
772004/05/032004/05/03 ClusteringClustering
Hierarchical ClusteringHierarchical Clustering
A clustering process that organizes the A clustering process that organizes the data into large groups, which contains data into large groups, which contains smaller groups, and so on.smaller groups, and so on.
May be drawn as a May be drawn as a treetree or or dendrogramdendrogram..
The finest groupThe finest group At the bottom of the dendrogramAt the bottom of the dendrogram
The coarsest groupThe coarsest group At the top of the dendrogramAt the top of the dendrogram
882004/05/032004/05/03 ClusteringClustering
Hierarchical ClusteringHierarchical Clustering
At level 0At level 0 {1}, {2}, {3}, {4}, {5}{1}, {2}, {3}, {4}, {5}
At level 1At level 1 {1, 2}, {3}, {4}, {5}{1, 2}, {3}, {4}, {5}
At level 2At level 2 {1, 2}, {3}, {4, 5}{1, 2}, {3}, {4, 5}
At level 3At level 3 {1, 2, 3}, {4, 5}{1, 2, 3}, {4, 5}
At level 4At level 4 {1, 2, 3, 4, 5}{1, 2, 3, 4, 5}
Animals
Dogs Cats
Large Small
St.Bernard Labrador
LongHair
ShortHair
0
1 2 3 4 5
1
2
3
4
5
Level
992004/05/032004/05/03 ClusteringClustering
Agglomerative Clustering AlgorithmAgglomerative Clustering Algorithm
1.1. Begin with Begin with nn clusters, each of one clusters, each of one sample.sample.
2.2. Repeat step 3 a total of Repeat step 3 a total of nn-1 times-1 times
3.3. Find the most similar clusters Find the most similar clusters CCii and and CCjj
and merge and merge CCii and and CCjj into one cluster. into one cluster.
If there is a tie, merge the first pair found.If there is a tie, merge the first pair found.
10102004/05/032004/05/03 ClusteringClustering
Hierarchical Clustering AlgorithmHierarchical Clustering Algorithm
Different methods to determine the Different methods to determine the similarity of clusters.similarity of clusters. Define a function that measures distance Define a function that measures distance
between clustersbetween clusters
The most popular distance measures are The most popular distance measures are Euclidean distanceEuclidean distance and and city block city block distancedistance..
11112004/05/032004/05/03 ClusteringClustering
Euclidean DistanceEuclidean Distance
n-dimensional feature spacen-dimensional feature space The distance between two points a = (aThe distance between two points a = (a11, …, a, …, ann) )
and b = (band b = (b11, …, b, …, bnn) is defined by) is defined by
To save computing time, the square root To save computing time, the square root would not actually be performed.would not actually be performed.
n
iiie abbad
1
2)(),(
12122004/05/032004/05/03 ClusteringClustering
City Block DistanceCity Block Distance
The sum of the absolute differences in each The sum of the absolute differences in each feature.feature.
Also calledAlso called Manhattan metricManhattan metric Taxicab distanceTaxicab distance
n
iiicb abbad
1
),(
13132004/05/032004/05/03 ClusteringClustering
Hierarchical ClusteringHierarchical Clustering
The Single-Linkage AlgorithmThe Single-Linkage Algorithm
Also known as Also known as The minimum method The minimum method The nearest neighbor methodThe nearest neighbor method
The distance between two clustersThe distance between two clusters The The smallest distancesmallest distance between two points such that between two points such that
one point is each clusterone point is each cluster
),(min),(,
badCCDji CbCa
jiSL
14142004/05/032004/05/03 ClusteringClustering
Hierarchical ClusteringHierarchical Clustering
The Single-Linkage AlgorithmThe Single-Linkage Algorithm
Use Euclidean distanceUse Euclidean distance {1}, {2}, {3}, {4}, {5}{1}, {2}, {3}, {4}, {5}
XX YY
11 44 44
22 88 44
33 1515 88
44 2424 44
55 2424 1212
11 22 33 44 55
11 -- 4.04.0 11.711.7 20.020.0 21.521.5
22 4.04.0 -- 8.18.1 16.016.0 17.917.9
33 11.711.7 8.18.1 -- 9.89.8 9.89.8
44 20.020.0 16.016.0 9.89.8 -- 8.08.0
55 21.521.5 17.017.0 9.89.8 8.08.0 --
15152004/05/032004/05/03 ClusteringClustering
Hierarchical ClusteringHierarchical Clustering
The Single-Linkage AlgorithmThe Single-Linkage Algorithm
{1,2}{1,2} 33 44 55
{1,2}{1,2} -- 8.18.1 16.016.0 17.917.9
33 8.18.1 -- 9.89.8 9.89.8
44 16.016.0 9.89.8 -- 8.08.0
55 17.917.9 9.89.8 8.08.0 --
{1, 2}, {3}, {4}, {5}{1, 2}, {3}, {4}, {5}
16162004/05/032004/05/03 ClusteringClustering
Hierarchical ClusteringHierarchical Clustering
The Single-Linkage AlgorithmThe Single-Linkage Algorithm
{1,2}{1,2} 33 {4,5}{4,5}
{1,2}{1,2} -- 8.18.1 16.016.0
33 8.18.1 -- 9.89.8
{4,5}{4,5} 16.016.0 9.89.8 --
{1, 2}, {3}, {4, 5}{1, 2}, {3}, {4, 5}
17172004/05/032004/05/03 ClusteringClustering
Hierarchical ClusteringHierarchical Clustering
The Single-Linkage AlgorithmThe Single-Linkage Algorithm
{1,2,3}{1,2,3} {4,5}{4,5}
{1,2,3}{1,2,3} -- 9.89.8
{4,5}{4,5} 9.89.8 --
{1, 2, 3}, {4, 5}{1, 2, 3}, {4, 5} {1, 2, 3, 4, 5}{1, 2, 3, 4, 5}
18182004/05/032004/05/03 ClusteringClustering
Hierarchical ClusteringHierarchical Clustering
The Complete-Linkage AlgorithmThe Complete-Linkage Algorithm
Also known as Also known as The maximum method The maximum method The farthest neighbor methodThe farthest neighbor method
The distance between two clustersThe distance between two clusters The The largest distancelargest distance between two points such that between two points such that
one point is each clusterone point is each cluster
),(max),(,
badCCDji CbCa
jiCL
19192004/05/032004/05/03 ClusteringClustering
Hierarchical ClusteringHierarchical Clustering
The Complete-Linkage AlgorithmThe Complete-Linkage Algorithm
Use Euclidean distanceUse Euclidean distance {1}, {2}, {3}, {4}, {5}{1}, {2}, {3}, {4}, {5}
XX YY
11 44 44
22 88 44
33 1515 88
44 2424 44
55 2424 1212
11 22 33 44 55
11 -- 4.04.0 11.711.7 20.020.0 21.521.5
22 4.04.0 -- 8.18.1 16.016.0 17.917.9
33 11.711.7 8.18.1 -- 9.89.8 9.89.8
44 20.020.0 16.016.0 9.89.8 -- 8.08.0
55 21.521.5 17.017.0 9.89.8 8.08.0 --
20202004/05/032004/05/03 ClusteringClustering
Hierarchical ClusteringHierarchical Clustering
The Complete-Linkage AlgorithmThe Complete-Linkage Algorithm
{1,2}{1,2} 33 44 55
{1,2}{1,2} -- 11.711.7 20.020.0 21.521.5
33 11.711.7 -- 9.89.8 9.89.8
44 20.020.0 9.89.8 -- 8.08.0
55 21.521.5 9.89.8 8.08.0 --
{1, 2}, {3}, {4}, {5}{1, 2}, {3}, {4}, {5}
21212004/05/032004/05/03 ClusteringClustering
Hierarchical ClusteringHierarchical Clustering
The Complete-Linkage AlgorithmThe Complete-Linkage Algorithm
{1,2}{1,2} 33 {4,5}{4,5}
{1,2}{1,2} -- 11.711.7 21.521.5
33 11.711.7 -- 9.89.8
{4,5}{4,5} 21.521.5 9.89.8 --
{1, 2}, {3}, {4, 5}{1, 2}, {3}, {4, 5}
22222004/05/032004/05/03 ClusteringClustering
Hierarchical ClusteringHierarchical Clustering
The Single-Linkage AlgorithmThe Single-Linkage Algorithm
{1,2}{1,2} {3,4,5}{3,4,5}
{1,2}{1,2} -- 21.521.5
{3,4,5}{3,4,5} 21.521.5 --
{1, 2} , {3, 4, 5}{1, 2} , {3, 4, 5} {1, 2, 3, 4, 5}{1, 2, 3, 4, 5}
23232004/05/032004/05/03 ClusteringClustering
ProblemProblem
A cluster contains three samples at (0,1), (0,2), A cluster contains three samples at (0,1), (0,2), and (0,3). Another cluster contains samples at and (0,3). Another cluster contains samples at (1,7), (1,8), and (1,9).(1,7), (1,8), and (1,9).
(a) What is the single-linkage distance between the (a) What is the single-linkage distance between the clusters if city block distance is used?clusters if city block distance is used?
(b) What is the single-linkage distance between the (b) What is the single-linkage distance between the clusters if Euclidean distance is used?clusters if Euclidean distance is used?
(c) What is the complete-linkage distance between the (c) What is the complete-linkage distance between the clusters if city block distance is used?clusters if city block distance is used?
(d) What is the complete-linkage distance between the (d) What is the complete-linkage distance between the clusters if Euclidean distance is used?clusters if Euclidean distance is used?
24242004/05/032004/05/03 ClusteringClustering
Hierarchical ClusteringHierarchical Clustering
The Average-Linkage AlgorithmThe Average-Linkage Algorithm
Also known as UPGMAAlso known as UPGMA Unweighted pairgroup method using arithmetic Unweighted pairgroup method using arithmetic
averagesaverages
The distance between two clustersThe distance between two clusters The The average distanceaverage distance between two points such that between two points such that
one point is each clusterone point is each cluster
ji CbCaji
jiAL badnn
CCD,
),(1
),(
25252004/05/032004/05/03 ClusteringClustering
Hierarchical ClusteringHierarchical Clustering
The Average-Linkage AlgorithmThe Average-Linkage Algorithm
Use Euclidean distanceUse Euclidean distance {1}, {2}, {3}, {4}, {5}{1}, {2}, {3}, {4}, {5}
XX YY
11 44 44
22 88 44
33 1515 88
44 2424 44
55 2424 1212
11 22 33 44 55
11 -- 4.04.0 11.711.7 20.020.0 21.521.5
22 4.04.0 -- 8.18.1 16.016.0 17.917.9
33 11.711.7 8.18.1 -- 9.89.8 9.89.8
44 20.020.0 16.016.0 9.89.8 -- 8.08.0
55 21.521.5 17.017.0 9.89.8 8.08.0 --
26262004/05/032004/05/03 ClusteringClustering
Hierarchical ClusteringHierarchical Clustering
The Average-Linkage AlgorithmThe Average-Linkage Algorithm
{1,2}{1,2} 33 44 55
{1,2}{1,2} -- 9.99.9 18.018.0 19.719.7
33 9.99.9 -- 9.89.8 9.89.8
44 18.018.0 9.89.8 -- 8.08.0
55 19.719.7 9.89.8 8.08.0 --
{1, 2}, {3}, {4}, {5}{1, 2}, {3}, {4}, {5}
27272004/05/032004/05/03 ClusteringClustering
Hierarchical ClusteringHierarchical Clustering
The Average-Linkage AlgorithmThe Average-Linkage Algorithm
{1,2}{1,2} 33 {4,5}{4,5}
{1,2}{1,2} -- 9.99.9 18.918.9
33 9.99.9 -- 9.89.8
{4,5}{4,5} 18.918.9 9.89.8 --
{1, 2}, {3}, {4, 5}{1, 2}, {3}, {4, 5}
28282004/05/032004/05/03 ClusteringClustering
Hierarchical ClusteringHierarchical Clustering
The Average-Linkage AlgorithmThe Average-Linkage Algorithm
{1,2}{1,2} {3,4,5}{3,4,5}
{1,2}{1,2} -- 14.414.4
{3,4,5}{3,4,5} 14.414.4 --
{1, 2} , {3, 4, 5}{1, 2} , {3, 4, 5} {1, 2, 3, 4, 5}{1, 2, 3, 4, 5}
29292004/05/032004/05/03 ClusteringClustering
ProblemProblem
Compute the average-linkage distance Compute the average-linkage distance between the two clusters { (3,4), (5,6) } between the two clusters { (3,4), (5,6) } and { (1,1), (2,2) }and { (1,1), (2,2) }
(a) Using city block distance between points.(a) Using city block distance between points.
(b) Using Euclidean distance between points. (b) Using Euclidean distance between points.