clustering: introduction adriano joaquim de o cruz ©2002 nce/ufrj [email protected]
TRANSCRIPT
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 3
What is cluster analysis?What is cluster analysis?
The process of grouping a set of The process of grouping a set of physical or abstract objects into classes physical or abstract objects into classes of similar objects.of similar objects.
The class label of each class is The class label of each class is unknown.unknown.
Classification separates objects into Classification separates objects into classes when the labels are known.classes when the labels are known.
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 4
What is cluster analysis? What is cluster analysis? cont.cont.
Clustering is a form of learning by Clustering is a form of learning by observations.observations.
Neural Networks learn by examples.Neural Networks learn by examples. Unsupervised learning.Unsupervised learning.
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 5
ApplicationsApplications
In business helps to discover distinct In business helps to discover distinct groups of customers.groups of customers.
In data mining used to gain insight into In data mining used to gain insight into the distribution of data, to observe the the distribution of data, to observe the characteristics of each cluster.characteristics of each cluster.
Pre-processing step for classification.Pre-processing step for classification. Pattern recognition.Pattern recognition.
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 6
RequirementsRequirements
Scalability: work with large databases.Scalability: work with large databases.
Ability to deal with different types of Ability to deal with different types of attributes (not only interval based data).attributes (not only interval based data).
Clusters of arbitrary shape, not only Clusters of arbitrary shape, not only spherical.spherical.
Minimal requirements about domain.Minimal requirements about domain.
Ability do deal with noisy data.Ability do deal with noisy data.
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 7
Requirements Requirements cont.cont.
Insensitivity to the order of input records.Insensitivity to the order of input records.
Work with samples of high Work with samples of high dimensionality.dimensionality.
Constrained-based clusteringConstrained-based clustering
Interpretability and usability: results Interpretability and usability: results should be easily interpretable.should be easily interpretable.
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 8
Sensitivity to Input OrderSensitivity to Input Order
Some algorithms are sensitive to the Some algorithms are sensitive to the order of input dataorder of input data
Leader algorithm is an exampleLeader algorithm is an example Ellipse: 2 1 3 5 4 6; Triangle: 1 2 6 4 5 3 Ellipse: 2 1 3 5 4 6; Triangle: 1 2 6 4 5 3
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 10
Heuristic Clustering TechniquesHeuristic Clustering Techniques
Incomplete or heuristic clustering: geometrical Incomplete or heuristic clustering: geometrical methods or projection techniques.methods or projection techniques.
Dimension reduction techniques (e.g. PCA) are Dimension reduction techniques (e.g. PCA) are used obtain a graphical representation in two or used obtain a graphical representation in two or three dimensions.three dimensions.
Heuristic methods based on visualisation are Heuristic methods based on visualisation are used to determine the clusters.used to determine the clusters.
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 11
Deterministic Crisp ClusteringDeterministic Crisp Clustering
Each datum will be assigned to only one Each datum will be assigned to only one cluster.cluster.
Each cluster partition defines a ordinary Each cluster partition defines a ordinary partition of the data set.partition of the data set.
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 12
Overlapping Crisp ClusteringOverlapping Crisp Clustering
Each datum will be assigned to at least Each datum will be assigned to at least one cluster.one cluster.
Elements may belong to more than one Elements may belong to more than one cluster at various degrees.cluster at various degrees.
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 13
Probabilistic ClusteringProbabilistic Clustering
For each element, a probabilistic distribution For each element, a probabilistic distribution over the clusters is determined.over the clusters is determined.
The distribution specifies the probability with The distribution specifies the probability with which a datum is assigned to a cluster.which a datum is assigned to a cluster.
If the probabilities are interpreted as degree If the probabilities are interpreted as degree of membership then these are fuzzy of membership then these are fuzzy clustering techniques.clustering techniques.
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 14
Possibilistic ClusteringPossibilistic Clustering
Degrees of membership or possibility Degrees of membership or possibility indicate to what extent a datum belongs indicate to what extent a datum belongs to the clusters.to the clusters.
Possibilistic cluster analysis drops the Possibilistic cluster analysis drops the constraint that the sum of memberships constraint that the sum of memberships of each datum to all clusters is equal to of each datum to all clusters is equal to one.one.
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 15
Hierarchical ClusteringHierarchical Clustering
Descending techniques: they divide the Descending techniques: they divide the data into more fine-grained classes.data into more fine-grained classes.
Ascending techniques: they combine Ascending techniques: they combine small classes into more coarse-grained small classes into more coarse-grained ones.ones.
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 16
Objective Function ClusteringObjective Function Clustering
An objective function assigns to An objective function assigns to each cluster partition values that each cluster partition values that have to be optimised.have to be optimised.
This is strictly an optimisation This is strictly an optimisation problem.problem.
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 18
Data Types Data Types
Interval-scaledInterval-scaled variables are continuous variables are continuous measurements of a linear scale. Ex. measurements of a linear scale. Ex. height, weight, temperature.height, weight, temperature.
Binary variablesBinary variables have only two states. have only two states. Ex. smoker, fever, client, owner.Ex. smoker, fever, client, owner.
Nominal variablesNominal variables are a generalisation are a generalisation of a binary variable with m states. Ex. of a binary variable with m states. Ex. Map colour, Marital state.Map colour, Marital state.
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 19
Data Types Data Types cont.cont.
Ordinal variablesOrdinal variables are ordered nominal are ordered nominal variables. Ex. Olympic medals, variables. Ex. Olympic medals, Professional ranks.Professional ranks.
Ratio-scaledRatio-scaled variables have a non- variables have a non-linear scale. Ex. Growth of a bacteria linear scale. Ex. Growth of a bacteria population population
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 20
Interval-scaled variables Interval-scaled variables
Interval-scaledInterval-scaled variables are continuous variables are continuous measurements of a linear scale. Ex. measurements of a linear scale. Ex. height, weight, temperature.height, weight, temperature.
Interval-scaled variables are dependent Interval-scaled variables are dependent on the units used.on the units used.
Measurement unit can affect analysis, Measurement unit can affect analysis, so standardisation should be used.so standardisation should be used.
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 21
ProblemsProblems
Person Age (yr) Height (cm)
A 35 190
B 40 190
C 35 160
D 40 160
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 22
StandardisationStandardisation
Converting original measurements to Converting original measurements to unitless values.unitless values.
Attempts to give all variables the equal Attempts to give all variables the equal weight.weight.
Useful when there is no prior knowledge Useful when there is no prior knowledge of the data.of the data.
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 23
Standardisation algorithmStandardisation algorithm
Z-scores indicate how far and in what direction Z-scores indicate how far and in what direction an item deviates from its distribution's mean, an item deviates from its distribution's mean, expressed in units of its distribution's standard expressed in units of its distribution's standard deviation.deviation.
The transformed scores will have a mean of zero The transformed scores will have a mean of zero and standard deviation of one.and standard deviation of one.
It is useful when comparing relative standings of It is useful when comparing relative standings of items from distributions with different means items from distributions with different means and/or different standard deviation.and/or different standard deviation.
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 24
Standardisation algorithmStandardisation algorithm
Consider Consider nn values of a variable values of a variable xx..
Calculate the mean value.Calculate the mean value.
Calculate the standard deviation.Calculate the standard deviation.
Calculate the z-score. Calculate the z-score.
n
iixn
x1
1
n
xxn
ii
x
1
2)(
x
ix
xxz
i
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 25
Z-scores exampleZ-scores example
Sample Heights Ages z-heights z-ages1 137,16 10 -0,45 -0,612 195,58 25 1,58 0,393 170,18 55 0,7 2,394 172,73 32 0,79 0,865 116,84 8 -1,16 -0,746 162,56 11 0,43 -0,547 157,48 9 0,26 -0,678 142,24 15 -0,28 -0,279 96,52 7 -1,87 -0,81
Means 150,14 19,11 0 028,67 15,01 1 1Std Dev
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 26
Real heights and ages chartsReal heights and ages charts
Row 2
Row 3
Row 4
Row 5
Row 6
Row 7
Row 8
Row 9
Row 10
0
25
50
75
100
125
150
175
200
Real Heights and Ages
Heights
Ages
Samples
He
ights
and
Ag
es
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 27
Z-scores for heights and agesZ-scores for heights and ages
Row 2
Row 3
Row 4
Row 5
Row 6
Row 7
Row 8
Row 9
Row 10
-2
-1,5
-1
-0,5
0
0,5
1
1,5
2
2,5
Z-scores for heights and ages
Z-heights
Z-ages
Samples
Heig
hts
and a
ges
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 28
Data chartData chart
Real data
0
20
40
60
0,00 50,00 100,00 150,00 200,00 250,00
Heights
Age
s
Ages
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 29
Data chartData chart
Z-score data
-1,00
0,00
1,00
2,00
3,00
-3,00 -1,00 1,00
heights
Ag
es
Seqüência1
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 31
Data MatricesData Matrices
Data matrix: represents Data matrix: represents nn objects with objects with pp characteristics. characteristics. Ex. Ex. personperson = { = {age, sex, incomeage, sex, income, ...}, ...}
Dissimilarity matrix: represents a Dissimilarity matrix: represents a collection of dissimilarities between all collection of dissimilarities between all pairs of objects.pairs of objects.
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 32
DissimilaritiesDissimilarities
Dissimilarity measures some form of Dissimilarity measures some form of distance between objects.distance between objects.
Clustering algorithms use dissimilarities Clustering algorithms use dissimilarities to cluster data.to cluster data.
How can dissimilarities be measured? How can dissimilarities be measured?
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 33
How to calculate dissimilarities?How to calculate dissimilarities?
The most popular methods are based on the The most popular methods are based on the distance between pairs of objects.distance between pairs of objects.
Minkowski distance:Minkowski distance:
pp is the number of characteristics is the number of characteristics qq is the distance type is the distance type qq=2 (Euclides distance), =2 (Euclides distance), qq=1 (Manhattan)=1 (Manhattan)
p
j
kjijki xxd1
1
)(),( xx
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 34
SimilaritiesSimilarities
It is also possible to work with It is also possible to work with similarities [similarities [ss((xxii,,xxjj)])]
0<=0<=ss((xxii,,xxjj)<=1)<=1
ss((xxii,,xxii)=1)=1
ss((xxii,,xxjj)=)=ss((xxjj,,xxii))
It is possible to consider that It is possible to consider that dd((xxii,,xxjj)=1-)=1-
ss((xxii,,xxjj))
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 35
DistancesDistances
Sample Heights Ages Z-heights Z-ages Euclides Manhatann Euclides Manhatann1 137,16 10 -0,45 -0,61 15,8613 22,0944 0,7574 1,05992 195,58 25 1,58 0,39 45,8167 51,3256 1,6325 1,97703 170,18 55 0,70 2,39 41,1033 55,9256 2,4915 3,09034 172,73 32 0,79 0,86 26,0054 35,4756 1,1654 1,64665 116,84 8 -1,16 -0,74 35,1080 44,4144 1,3774 1,90186 162,56 11 0,43 -0,54 14,8312 20,5278 0,6926 0,97357 157,48 9 0,26 -0,67 12,4924 17,4478 0,7207 0,92968 142,24 15 -0,28 -0,27 8,9086 12,0144 0,3886 0,54969 96,52 7 -1,87 -0,81 54,9740 65,7344 2,0368 2,6771
Means 150,14 19,11 0,0000 0,0000 2 1 2 1Std Dev 28,67 15,01 1,0000 1,0000
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 36
DissimilaritiesDissimilarities
There are other ways to obtain There are other ways to obtain dissimilarities.dissimilarities.
So we no longer speak of distances.So we no longer speak of distances. Basically dissimilarities are nonnegative Basically dissimilarities are nonnegative
numbers (numbers (dd((ii,,jj)) that are small (close to )) that are small (close to 0) when 0) when ii and and jj are similar. are similar.
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 37
PearsonPearson
Pearson product-moment correlation Pearson product-moment correlation between variables f and gbetween variables f and g
Coefficients lie between –1 and +1Coefficients lie between –1 and +1
n
ifig
n
ifif
n
igigfif
mxmx
mxmxgfR
1
2
1
2
1
)()(
))((),(
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 38
Pearson - Pearson - contcont
A correlation of +1 means that there is a A correlation of +1 means that there is a perfect positive linear relationship perfect positive linear relationship between variables. between variables.
A correlation of -1 means that there is a A correlation of -1 means that there is a perfect negative linear relationship perfect negative linear relationship between variables. between variables.
A correlation of 0 means there is no A correlation of 0 means there is no linear relationship between the two linear relationship between the two variables. variables.
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 39
Pearson - exPearson - ex
ryz = 0.9861ryz = 0.9861; ; ryw = -0.9551ryw = -0.9551; ; ryr=ryr= 0.27700.2770
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 40
Correlation and dissimilarities 1Correlation and dissimilarities 1
dd((f,gf,g)=(1-)=(1-R(f,gR(f,g))/2 (1)))/2 (1)
Variables with a high positive correlation Variables with a high positive correlation (+1) receive a dissimilarity close to 0(+1) receive a dissimilarity close to 0
Variables with strongly negative Variables with strongly negative correlation will be considered very correlation will be considered very dissimilardissimilar
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 41
Correlation and dissimilarities 2Correlation and dissimilarities 2
dd((f,gf,g)=1-|)=1-|R(f,gR(f,g)| (2))| (2)
Variables with a high positive correlation Variables with a high positive correlation (+1) and negative correlation will (+1) and negative correlation will receive a dissimilarity close to 0receive a dissimilarity close to 0
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 42
Numerical ExampleNumerical Example
Name Weight Height Month Year
Ilan 15 95 1 82
Jack 49 156 5 55
Kim 13 95 11 81
Lieve 45 160 7 56
Leon 85 178 6 48
Peter 66 176 6 56
Talia 12 90 12 83
Tina 10 78 1 84
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 43
Numerical ExampleNumerical Example
Name Weight Height Month Year
Ilan 15 95 1 82
Jack 49 156 5 55
Kim 13 95 11 81
Lieve 45 160 7 56
Leon 85 178 6 48
Peter 66 176 6 56
Talia 12 90 12 83
Tina 10 78 1 84
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 44
Numerical Example 1Numerical Example 1
Quanti Weight Height Month Year
Corr Weight 1
Height 0.957 1
Month -0.036 0.021 1
Year -0.953 -0.985 0.013 1
Diss Weight 0
(1) Height 0.021 0
Month 0.518 0.489 0
Year 0.977 0.992 0.493 0
Diss Weight 0
(2) Height 0.043 0
Month 0.964 0.979 0
Year 0.047 0.015 0.987 0
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 45
Binary VariablesBinary Variables
Binary variables have only two states.Binary variables have only two states. States can be symmetric or asymmetric.States can be symmetric or asymmetric. Binary variables are symmetric if both Binary variables are symmetric if both
states are equally valuable. Ex. genderstates are equally valuable. Ex. gender When the states are not equally When the states are not equally
important the variable is asymmetric. important the variable is asymmetric. Ex. disease tests (1-positive; 0-negative) Ex. disease tests (1-positive; 0-negative)
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 46
Contingency tablesContingency tables
Consider objects described by Consider objects described by pp binary binary variablesvariables
qq variables are equal to one on variables are equal to one on ii and and jj rr variables are 1 on variables are 1 on ii and 0 on object and 0 on object jj
Object j 1 0 Sum
1 q r q+r 0 s t s+t
Object i
Sum q+s r+t p
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 47
Symmetric VariablesSymmetric Variables
Dissimilarity based on symmetric Dissimilarity based on symmetric variables is invariant. variables is invariant.
The result should not change when The result should not change when variables are interchanged.variables are interchanged.
Simple dissimilarity coefficient:Simple dissimilarity coefficient:
tsrq
srxxd ji
),(
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 48
Symmetric VariablesSymmetric Variables
DissimilarityDissimilarity
SimilaritySimilarity
tsrq
srxxd ji
),(
tsrqtq
xxs ji ),(
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 49
Asymmetric VariablesAsymmetric Variables
Similarity based on asymmetric Similarity based on asymmetric variables is not invariant.variables is not invariant.
Two ones are more important than two Two ones are more important than two zeroszeros
Jacard coefficient:Jacard coefficient:srqsr
xxd ji
),(
srqq
xxs ji ),(
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 50
Computing dissimilaritiesComputing dissimilarities
Name fever cough Test1 Test2 Test3 Test4
Jack Y N P N N N
Mary Y N P N P N
Jim Y Y N N N N
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 51
Computing DissimilaritiesComputing Dissimilarities
Jack Mary q 1,1 r 1,0 s 0,1 t 0,0
Fever Y Y 1 0 0 0
Cough N N 0 0 0 1
Test1 P P 1 0 0 0
Test2 N N 0 0 0 1
Test3 N P 0 0 1 0
Test4 N N 0 0 0 1
2 0 1 3
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 52
Computing dissimilaritiesComputing dissimilarities
75.0211
21),(
67.0111
11),(
33.0102
10),(
maryjimd
jimjackd
srqsr
maryjackd
•Jim and Mary have the highest dissimilarity value, so they have low probability of having the same disease.
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 53
Nominal VariablesNominal Variables
A nominal variable is a generalisation of A nominal variable is a generalisation of the binary variable.the binary variable.
A nominal variable can take more than A nominal variable can take more than two statestwo states
Ex. Marital status: married, single, Ex. Marital status: married, single, divorceddivorced
Each state can be represented by a Each state can be represented by a number or letternumber or letter
There is no specific orderingThere is no specific ordering
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 54
Computing dissimilaritiesComputing dissimilarities
Consider two objects Consider two objects ii and and jj, described , described by nominal variablesby nominal variables
Each object has Each object has pp characteristics characteristics
mm is the number of matches is the number of matches
pmp
jid
),(
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 55
Binarising nominal variablesBinarising nominal variables
An nominal variable can encoded to create a An nominal variable can encoded to create a new binary variable for each statenew binary variable for each state
Example:Example: Marital state = {married, single, divorced}Marital state = {married, single, divorced} Married: 1=yes – 0=noMarried: 1=yes – 0=no Single: 1=yes – 0=noSingle: 1=yes – 0=no Divorced: 1=yes – 0=noDivorced: 1=yes – 0=no Ex. Marital state = {married}Ex. Marital state = {married} married = 1, single = 0, divorced = 0 married = 1, single = 0, divorced = 0
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 56
Ordinal variablesOrdinal variables
A discrete ordinal variable is similar to a A discrete ordinal variable is similar to a nominal variable, except that the states nominal variable, except that the states are ordered in a meaningful sequenceare ordered in a meaningful sequence
Ex. Bronze, silver and gold medalsEx. Bronze, silver and gold medals Ex. Assistant, associate, full memberEx. Assistant, associate, full member
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 57
Computing dissimilaritiesComputing dissimilarities
Consider Consider nn objects defined by a set of objects defined by a set of ordinal variables ordinal variables
ff is one of these ordinal variables and is one of these ordinal variables and have have MMff states. states.
These states define the ranking These states define the ranking rrff {1,…, {1,…, MMff}.}.
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 58
Steps to calculate dissimilaritiesSteps to calculate dissimilarities
Assume that the value of Assume that the value of ff for the for the ithith object is object is xxifif. . Replace each Replace each xxifif by its corresponding rank by its corresponding rank rrifif {1, {1,…,…,MMff}}..
Since the number of states of each variable Since the number of states of each variable differs, it is often necessary map the range onto differs, it is often necessary map the range onto [0.0,1.0] using the equation[0.0,1.0] using the equation
Dissimilarity can be computed using distance Dissimilarity can be computed using distance measures of interval-scaled variablesmeasures of interval-scaled variables
1
1
f
ifif M
rz
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 59
Ratio-scaled variablesRatio-scaled variables
Variables on a non-linear scale, such as Variables on a non-linear scale, such as exponentialexponential
To compute dissimilarities there are To compute dissimilarities there are three methodsthree methods• Treat as interval-scaled. Not always good.Treat as interval-scaled. Not always good.• Apply a transformation like Apply a transformation like y=log(x)y=log(x) and and
treat as interval-scaledtreat as interval-scaled• Treat as ordinal data and assume ranks as Treat as ordinal data and assume ranks as
interval-scaledinterval-scaled
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 60
Variables of mixed typesVariables of mixed types
One technique is to bring all variables onto a One technique is to bring all variables onto a common scale of the interval [0.0.1.0]common scale of the interval [0.0.1.0]
Suppose that the data set contains Suppose that the data set contains pp variables of mixed type. Dissimilarity is variables of mixed type. Dissimilarity is between between ii and and jj is is
p
f
fij
p
f
fij
fij d
jid
1
)(
1
)()(
),(
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 61
Dissimilarity is between Dissimilarity is between ii and and jj is is
Variables of mixed typesVariables of mixed types
otherwise
cassymmetriisfandxxif
existnotdoesxorxif
where
d
jid
jfif
jfiff
ij
p
f
fij
p
f
fij
fij
1
00
0
),(
)(
1
)(
1
)()(
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 62
The contribution of each variable is dependent on its The contribution of each variable is dependent on its typetype
f is binary or nominal:f is binary or nominal:
f is interval-based: f is interval-based:
f is ordinal of ratio-scaled: compute ranks and treat f is ordinal of ratio-scaled: compute ranks and treat as interval-basedas interval-based
Variables of mixed types contVariables of mixed types cont
1;0)( otherwisexxifd jfiffij
)min()max()(
ff
jfiffij xx
xxd
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 64
Classification typesClassification types
Clustering is an unsupervised methodClustering is an unsupervised method
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 65
Clustering MethodsClustering Methods
PartitioningPartitioning
HierarchicalHierarchical
Density-basedDensity-based
Grid-basedGrid-based
Model-basedModel-based
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 66
Partitioning MethodsPartitioning Methods
Given Given nn objects objects kk partitions are created. partitions are created.
Each partition must contain at least one Each partition must contain at least one element.element.
It uses an iterative relocation technique It uses an iterative relocation technique to improve partitioning.to improve partitioning.
Distance is the usual criterion.Distance is the usual criterion.
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 67
Partitioning Methods cont.Partitioning Methods cont.
They work well for finding spherical-shaped They work well for finding spherical-shaped clusters.clusters.
They are not efficient on very large They are not efficient on very large databases.databases.
K-means where each cluster is represented K-means where each cluster is represented by the mean value of the objects in the by the mean value of the objects in the cluster.cluster.
K-medoids where each cluster is represented K-medoids where each cluster is represented by an object near the centre of the cluster.by an object near the centre of the cluster.
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 68
Hierarchical MethodsHierarchical Methods
Creates a hierarchical decomposition of the setCreates a hierarchical decomposition of the set Agglomerative approaches start with each object Agglomerative approaches start with each object
forming a separate groupforming a separate group Merges objects or groups until all objects belong Merges objects or groups until all objects belong
to one group or a termination condition occursto one group or a termination condition occurs Divisive approaches starts with all objects in the Divisive approaches starts with all objects in the
same clustersame cluster Each successive iteration splits a cluster until all Each successive iteration splits a cluster until all
objects are on separate clusters or a termination objects are on separate clusters or a termination condition occurscondition occurs
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 69
Hierarchical Clustering contHierarchical Clustering cont
Definition of cluster proximity.Definition of cluster proximity. Min: most similar (sensitive to noise)Min: most similar (sensitive to noise) Max: most dissimilar (break large Max: most dissimilar (break large
clustersclusters
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 70
Density-based methodsDensity-based methods
Method creates clusters until the density Method creates clusters until the density in the neighbourhood exceeds some in the neighbourhood exceeds some thresholdthreshold
Able to find clusters of arbitrary shapesAble to find clusters of arbitrary shapes
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 71
Grid-based methodsGrid-based methods
Grid methods divide the object space into finite Grid methods divide the object space into finite number of cells forming a grid-like structure.number of cells forming a grid-like structure.
Cells that contain more than a certain number Cells that contain more than a certain number of elements are treated as dense.of elements are treated as dense.
Dense cells are connected to form clusters.Dense cells are connected to form clusters. Fast processing time, independent of the Fast processing time, independent of the
number of objects.number of objects. STING and CLIQUE are examples.STING and CLIQUE are examples.
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 72
Model-based methodsModel-based methods
Model-based methods hypothesise a Model-based methods hypothesise a model for each cluster and find the best model for each cluster and find the best fit of the data to the given model.fit of the data to the given model.
Statistical modelsStatistical models SOM networksSOM networks
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 73
Partition methodsPartition methods
Given a database of Given a database of nn objects a partition objects a partition method organises them into method organises them into kk clusters clusters ((kk<= <= nn))
The methods try to minimise an objective The methods try to minimise an objective function such as distancefunction such as distance
Similar objects are “close” to each otherSimilar objects are “close” to each other