clustering: introduction adriano joaquim de o cruz ©2002 nce/ufrj [email protected]

73
Clustering: Introduction Adriano Joaquim de O Cruz ©2002 NCE/UFRJ [email protected]

Upload: scarlett-robeson

Post on 14-Dec-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Clustering: Introduction

Adriano Joaquim de O Cruz ©2002

NCE/UFRJ

[email protected]

IntroductionIntroduction

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 3

What is cluster analysis?What is cluster analysis?

The process of grouping a set of The process of grouping a set of physical or abstract objects into classes physical or abstract objects into classes of similar objects.of similar objects.

The class label of each class is The class label of each class is unknown.unknown.

Classification separates objects into Classification separates objects into classes when the labels are known.classes when the labels are known.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 4

What is cluster analysis? What is cluster analysis? cont.cont.

Clustering is a form of learning by Clustering is a form of learning by observations.observations.

Neural Networks learn by examples.Neural Networks learn by examples. Unsupervised learning.Unsupervised learning.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 5

ApplicationsApplications

In business helps to discover distinct In business helps to discover distinct groups of customers.groups of customers.

In data mining used to gain insight into In data mining used to gain insight into the distribution of data, to observe the the distribution of data, to observe the characteristics of each cluster.characteristics of each cluster.

Pre-processing step for classification.Pre-processing step for classification. Pattern recognition.Pattern recognition.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 6

RequirementsRequirements

Scalability: work with large databases.Scalability: work with large databases.

Ability to deal with different types of Ability to deal with different types of attributes (not only interval based data).attributes (not only interval based data).

Clusters of arbitrary shape, not only Clusters of arbitrary shape, not only spherical.spherical.

Minimal requirements about domain.Minimal requirements about domain.

Ability do deal with noisy data.Ability do deal with noisy data.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 7

Requirements Requirements cont.cont.

Insensitivity to the order of input records.Insensitivity to the order of input records.

Work with samples of high Work with samples of high dimensionality.dimensionality.

Constrained-based clusteringConstrained-based clustering

Interpretability and usability: results Interpretability and usability: results should be easily interpretable.should be easily interpretable.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 8

Sensitivity to Input OrderSensitivity to Input Order

Some algorithms are sensitive to the Some algorithms are sensitive to the order of input dataorder of input data

Leader algorithm is an exampleLeader algorithm is an example Ellipse: 2 1 3 5 4 6; Triangle: 1 2 6 4 5 3 Ellipse: 2 1 3 5 4 6; Triangle: 1 2 6 4 5 3

Clustering TechniquesClustering Techniques

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 10

Heuristic Clustering TechniquesHeuristic Clustering Techniques

Incomplete or heuristic clustering: geometrical Incomplete or heuristic clustering: geometrical methods or projection techniques.methods or projection techniques.

Dimension reduction techniques (e.g. PCA) are Dimension reduction techniques (e.g. PCA) are used obtain a graphical representation in two or used obtain a graphical representation in two or three dimensions.three dimensions.

Heuristic methods based on visualisation are Heuristic methods based on visualisation are used to determine the clusters.used to determine the clusters.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 11

Deterministic Crisp ClusteringDeterministic Crisp Clustering

Each datum will be assigned to only one Each datum will be assigned to only one cluster.cluster.

Each cluster partition defines a ordinary Each cluster partition defines a ordinary partition of the data set.partition of the data set.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 12

Overlapping Crisp ClusteringOverlapping Crisp Clustering

Each datum will be assigned to at least Each datum will be assigned to at least one cluster.one cluster.

Elements may belong to more than one Elements may belong to more than one cluster at various degrees.cluster at various degrees.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 13

Probabilistic ClusteringProbabilistic Clustering

For each element, a probabilistic distribution For each element, a probabilistic distribution over the clusters is determined.over the clusters is determined.

The distribution specifies the probability with The distribution specifies the probability with which a datum is assigned to a cluster.which a datum is assigned to a cluster.

If the probabilities are interpreted as degree If the probabilities are interpreted as degree of membership then these are fuzzy of membership then these are fuzzy clustering techniques.clustering techniques.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 14

Possibilistic ClusteringPossibilistic Clustering

Degrees of membership or possibility Degrees of membership or possibility indicate to what extent a datum belongs indicate to what extent a datum belongs to the clusters.to the clusters.

Possibilistic cluster analysis drops the Possibilistic cluster analysis drops the constraint that the sum of memberships constraint that the sum of memberships of each datum to all clusters is equal to of each datum to all clusters is equal to one.one.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 15

Hierarchical ClusteringHierarchical Clustering

Descending techniques: they divide the Descending techniques: they divide the data into more fine-grained classes.data into more fine-grained classes.

Ascending techniques: they combine Ascending techniques: they combine small classes into more coarse-grained small classes into more coarse-grained ones.ones.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 16

Objective Function ClusteringObjective Function Clustering

An objective function assigns to An objective function assigns to each cluster partition values that each cluster partition values that have to be optimised.have to be optimised.

This is strictly an optimisation This is strictly an optimisation problem.problem.

Data TypesData Types

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 18

Data Types Data Types

Interval-scaledInterval-scaled variables are continuous variables are continuous measurements of a linear scale. Ex. measurements of a linear scale. Ex. height, weight, temperature.height, weight, temperature.

Binary variablesBinary variables have only two states. have only two states. Ex. smoker, fever, client, owner.Ex. smoker, fever, client, owner.

Nominal variablesNominal variables are a generalisation are a generalisation of a binary variable with m states. Ex. of a binary variable with m states. Ex. Map colour, Marital state.Map colour, Marital state.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 19

Data Types Data Types cont.cont.

Ordinal variablesOrdinal variables are ordered nominal are ordered nominal variables. Ex. Olympic medals, variables. Ex. Olympic medals, Professional ranks.Professional ranks.

Ratio-scaledRatio-scaled variables have a non- variables have a non-linear scale. Ex. Growth of a bacteria linear scale. Ex. Growth of a bacteria population population

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 20

Interval-scaled variables Interval-scaled variables

Interval-scaledInterval-scaled variables are continuous variables are continuous measurements of a linear scale. Ex. measurements of a linear scale. Ex. height, weight, temperature.height, weight, temperature.

Interval-scaled variables are dependent Interval-scaled variables are dependent on the units used.on the units used.

Measurement unit can affect analysis, Measurement unit can affect analysis, so standardisation should be used.so standardisation should be used.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 21

ProblemsProblems

Person Age (yr) Height (cm)

A 35 190

B 40 190

C 35 160

D 40 160

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 22

StandardisationStandardisation

Converting original measurements to Converting original measurements to unitless values.unitless values.

Attempts to give all variables the equal Attempts to give all variables the equal weight.weight.

Useful when there is no prior knowledge Useful when there is no prior knowledge of the data.of the data.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 23

Standardisation algorithmStandardisation algorithm

Z-scores indicate how far and in what direction Z-scores indicate how far and in what direction an item deviates from its distribution's mean, an item deviates from its distribution's mean, expressed in units of its distribution's standard expressed in units of its distribution's standard deviation.deviation.

The transformed scores will have a mean of zero The transformed scores will have a mean of zero and standard deviation of one.and standard deviation of one.

It is useful when comparing relative standings of It is useful when comparing relative standings of items from distributions with different means items from distributions with different means and/or different standard deviation.and/or different standard deviation.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 24

Standardisation algorithmStandardisation algorithm

Consider Consider nn values of a variable values of a variable xx..

Calculate the mean value.Calculate the mean value.

Calculate the standard deviation.Calculate the standard deviation.

Calculate the z-score. Calculate the z-score.

n

iixn

x1

1

n

xxn

ii

x

1

2)(

x

ix

xxz

i

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 25

Z-scores exampleZ-scores example

Sample Heights Ages z-heights z-ages1 137,16 10 -0,45 -0,612 195,58 25 1,58 0,393 170,18 55 0,7 2,394 172,73 32 0,79 0,865 116,84 8 -1,16 -0,746 162,56 11 0,43 -0,547 157,48 9 0,26 -0,678 142,24 15 -0,28 -0,279 96,52 7 -1,87 -0,81

Means 150,14 19,11 0 028,67 15,01 1 1Std Dev

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 26

Real heights and ages chartsReal heights and ages charts

Row 2

Row 3

Row 4

Row 5

Row 6

Row 7

Row 8

Row 9

Row 10

0

25

50

75

100

125

150

175

200

Real Heights and Ages

Heights

Ages

Samples

He

ights

and

Ag

es

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 27

Z-scores for heights and agesZ-scores for heights and ages

Row 2

Row 3

Row 4

Row 5

Row 6

Row 7

Row 8

Row 9

Row 10

-2

-1,5

-1

-0,5

0

0,5

1

1,5

2

2,5

Z-scores for heights and ages

Z-heights

Z-ages

Samples

Heig

hts

and a

ges

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 28

Data chartData chart

Real data

0

20

40

60

0,00 50,00 100,00 150,00 200,00 250,00

Heights

Age

s

Ages

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 29

Data chartData chart

Z-score data

-1,00

0,00

1,00

2,00

3,00

-3,00 -1,00 1,00

heights

Ag

es

Seqüência1

SimilaritiesSimilarities

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 31

Data MatricesData Matrices

Data matrix: represents Data matrix: represents nn objects with objects with pp characteristics. characteristics. Ex. Ex. personperson = { = {age, sex, incomeage, sex, income, ...}, ...}

Dissimilarity matrix: represents a Dissimilarity matrix: represents a collection of dissimilarities between all collection of dissimilarities between all pairs of objects.pairs of objects.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 32

DissimilaritiesDissimilarities

Dissimilarity measures some form of Dissimilarity measures some form of distance between objects.distance between objects.

Clustering algorithms use dissimilarities Clustering algorithms use dissimilarities to cluster data.to cluster data.

How can dissimilarities be measured? How can dissimilarities be measured?

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 33

How to calculate dissimilarities?How to calculate dissimilarities?

The most popular methods are based on the The most popular methods are based on the distance between pairs of objects.distance between pairs of objects.

Minkowski distance:Minkowski distance:

pp is the number of characteristics is the number of characteristics qq is the distance type is the distance type qq=2 (Euclides distance), =2 (Euclides distance), qq=1 (Manhattan)=1 (Manhattan)

p

j

qq

kjijki xxd1

1

)(),( xx

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 34

SimilaritiesSimilarities

It is also possible to work with It is also possible to work with similarities [similarities [ss((xxii,,xxjj)])]

0<=0<=ss((xxii,,xxjj)<=1)<=1

ss((xxii,,xxii)=1)=1

ss((xxii,,xxjj)=)=ss((xxjj,,xxii))

It is possible to consider that It is possible to consider that dd((xxii,,xxjj)=1-)=1-

ss((xxii,,xxjj))

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 35

DistancesDistances

Sample Heights Ages Z-heights Z-ages Euclides Manhatann Euclides Manhatann1 137,16 10 -0,45 -0,61 15,8613 22,0944 0,7574 1,05992 195,58 25 1,58 0,39 45,8167 51,3256 1,6325 1,97703 170,18 55 0,70 2,39 41,1033 55,9256 2,4915 3,09034 172,73 32 0,79 0,86 26,0054 35,4756 1,1654 1,64665 116,84 8 -1,16 -0,74 35,1080 44,4144 1,3774 1,90186 162,56 11 0,43 -0,54 14,8312 20,5278 0,6926 0,97357 157,48 9 0,26 -0,67 12,4924 17,4478 0,7207 0,92968 142,24 15 -0,28 -0,27 8,9086 12,0144 0,3886 0,54969 96,52 7 -1,87 -0,81 54,9740 65,7344 2,0368 2,6771

Means 150,14 19,11 0,0000 0,0000 2 1 2 1Std Dev 28,67 15,01 1,0000 1,0000

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 36

DissimilaritiesDissimilarities

There are other ways to obtain There are other ways to obtain dissimilarities.dissimilarities.

So we no longer speak of distances.So we no longer speak of distances. Basically dissimilarities are nonnegative Basically dissimilarities are nonnegative

numbers (numbers (dd((ii,,jj)) that are small (close to )) that are small (close to 0) when 0) when ii and and jj are similar. are similar.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 37

PearsonPearson

Pearson product-moment correlation Pearson product-moment correlation between variables f and gbetween variables f and g

Coefficients lie between –1 and +1Coefficients lie between –1 and +1

n

ifig

n

ifif

n

igigfif

mxmx

mxmxgfR

1

2

1

2

1

)()(

))((),(

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 38

Pearson - Pearson - contcont

A correlation of +1 means that there is a A correlation of +1 means that there is a perfect positive linear relationship perfect positive linear relationship between variables. between variables.

A correlation of -1 means that there is a A correlation of -1 means that there is a perfect negative linear relationship perfect negative linear relationship between variables. between variables.

A correlation of 0 means there is no A correlation of 0 means there is no linear relationship between the two linear relationship between the two variables. variables.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 39

Pearson - exPearson - ex

ryz = 0.9861ryz = 0.9861; ; ryw = -0.9551ryw = -0.9551; ; ryr=ryr= 0.27700.2770

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 40

Correlation and dissimilarities 1Correlation and dissimilarities 1

dd((f,gf,g)=(1-)=(1-R(f,gR(f,g))/2 (1)))/2 (1)

Variables with a high positive correlation Variables with a high positive correlation (+1) receive a dissimilarity close to 0(+1) receive a dissimilarity close to 0

Variables with strongly negative Variables with strongly negative correlation will be considered very correlation will be considered very dissimilardissimilar

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 41

Correlation and dissimilarities 2Correlation and dissimilarities 2

dd((f,gf,g)=1-|)=1-|R(f,gR(f,g)| (2))| (2)

Variables with a high positive correlation Variables with a high positive correlation (+1) and negative correlation will (+1) and negative correlation will receive a dissimilarity close to 0receive a dissimilarity close to 0

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 42

Numerical ExampleNumerical Example

Name Weight Height Month Year

Ilan 15 95 1 82

Jack 49 156 5 55

Kim 13 95 11 81

Lieve 45 160 7 56

Leon 85 178 6 48

Peter 66 176 6 56

Talia 12 90 12 83

Tina 10 78 1 84

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 43

Numerical ExampleNumerical Example

Name Weight Height Month Year

Ilan 15 95 1 82

Jack 49 156 5 55

Kim 13 95 11 81

Lieve 45 160 7 56

Leon 85 178 6 48

Peter 66 176 6 56

Talia 12 90 12 83

Tina 10 78 1 84

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 44

Numerical Example 1Numerical Example 1

Quanti Weight Height Month Year

Corr Weight 1

Height 0.957 1

Month -0.036 0.021 1

Year -0.953 -0.985 0.013 1

Diss Weight 0

(1) Height 0.021 0

Month 0.518 0.489 0

Year 0.977 0.992 0.493 0

Diss Weight 0

(2) Height 0.043 0

Month 0.964 0.979 0

Year 0.047 0.015 0.987 0

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 45

Binary VariablesBinary Variables

Binary variables have only two states.Binary variables have only two states. States can be symmetric or asymmetric.States can be symmetric or asymmetric. Binary variables are symmetric if both Binary variables are symmetric if both

states are equally valuable. Ex. genderstates are equally valuable. Ex. gender When the states are not equally When the states are not equally

important the variable is asymmetric. important the variable is asymmetric. Ex. disease tests (1-positive; 0-negative) Ex. disease tests (1-positive; 0-negative)

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 46

Contingency tablesContingency tables

Consider objects described by Consider objects described by pp binary binary variablesvariables

qq variables are equal to one on variables are equal to one on ii and and jj rr variables are 1 on variables are 1 on ii and 0 on object and 0 on object jj

Object j 1 0 Sum

1 q r q+r 0 s t s+t

Object i

Sum q+s r+t p

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 47

Symmetric VariablesSymmetric Variables

Dissimilarity based on symmetric Dissimilarity based on symmetric variables is invariant. variables is invariant.

The result should not change when The result should not change when variables are interchanged.variables are interchanged.

Simple dissimilarity coefficient:Simple dissimilarity coefficient:

tsrq

srxxd ji

),(

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 48

Symmetric VariablesSymmetric Variables

DissimilarityDissimilarity

SimilaritySimilarity

tsrq

srxxd ji

),(

tsrqtq

xxs ji ),(

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 49

Asymmetric VariablesAsymmetric Variables

Similarity based on asymmetric Similarity based on asymmetric variables is not invariant.variables is not invariant.

Two ones are more important than two Two ones are more important than two zeroszeros

Jacard coefficient:Jacard coefficient:srqsr

xxd ji

),(

srqq

xxs ji ),(

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 50

Computing dissimilaritiesComputing dissimilarities

Name fever cough Test1 Test2 Test3 Test4

Jack Y N P N N N

Mary Y N P N P N

Jim Y Y N N N N

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 51

Computing DissimilaritiesComputing Dissimilarities

Jack Mary q 1,1 r 1,0 s 0,1 t 0,0

Fever Y Y 1 0 0 0

Cough N N 0 0 0 1

Test1 P P 1 0 0 0

Test2 N N 0 0 0 1

Test3 N P 0 0 1 0

Test4 N N 0 0 0 1

2 0 1 3

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 52

Computing dissimilaritiesComputing dissimilarities

75.0211

21),(

67.0111

11),(

33.0102

10),(

maryjimd

jimjackd

srqsr

maryjackd

•Jim and Mary have the highest dissimilarity value, so they have low probability of having the same disease.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 53

Nominal VariablesNominal Variables

A nominal variable is a generalisation of A nominal variable is a generalisation of the binary variable.the binary variable.

A nominal variable can take more than A nominal variable can take more than two statestwo states

Ex. Marital status: married, single, Ex. Marital status: married, single, divorceddivorced

Each state can be represented by a Each state can be represented by a number or letternumber or letter

There is no specific orderingThere is no specific ordering

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 54

Computing dissimilaritiesComputing dissimilarities

Consider two objects Consider two objects ii and and jj, described , described by nominal variablesby nominal variables

Each object has Each object has pp characteristics characteristics

mm is the number of matches is the number of matches

pmp

jid

),(

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 55

Binarising nominal variablesBinarising nominal variables

An nominal variable can encoded to create a An nominal variable can encoded to create a new binary variable for each statenew binary variable for each state

Example:Example: Marital state = {married, single, divorced}Marital state = {married, single, divorced} Married: 1=yes – 0=noMarried: 1=yes – 0=no Single: 1=yes – 0=noSingle: 1=yes – 0=no Divorced: 1=yes – 0=noDivorced: 1=yes – 0=no Ex. Marital state = {married}Ex. Marital state = {married} married = 1, single = 0, divorced = 0 married = 1, single = 0, divorced = 0

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 56

Ordinal variablesOrdinal variables

A discrete ordinal variable is similar to a A discrete ordinal variable is similar to a nominal variable, except that the states nominal variable, except that the states are ordered in a meaningful sequenceare ordered in a meaningful sequence

Ex. Bronze, silver and gold medalsEx. Bronze, silver and gold medals Ex. Assistant, associate, full memberEx. Assistant, associate, full member

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 57

Computing dissimilaritiesComputing dissimilarities

Consider Consider nn objects defined by a set of objects defined by a set of ordinal variables ordinal variables

ff is one of these ordinal variables and is one of these ordinal variables and have have MMff states. states.

These states define the ranking These states define the ranking rrff {1,…, {1,…, MMff}.}.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 58

Steps to calculate dissimilaritiesSteps to calculate dissimilarities

Assume that the value of Assume that the value of ff for the for the ithith object is object is xxifif. . Replace each Replace each xxifif by its corresponding rank by its corresponding rank rrifif {1, {1,…,…,MMff}}..

Since the number of states of each variable Since the number of states of each variable differs, it is often necessary map the range onto differs, it is often necessary map the range onto [0.0,1.0] using the equation[0.0,1.0] using the equation

Dissimilarity can be computed using distance Dissimilarity can be computed using distance measures of interval-scaled variablesmeasures of interval-scaled variables

1

1

f

ifif M

rz

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 59

Ratio-scaled variablesRatio-scaled variables

Variables on a non-linear scale, such as Variables on a non-linear scale, such as exponentialexponential

To compute dissimilarities there are To compute dissimilarities there are three methodsthree methods• Treat as interval-scaled. Not always good.Treat as interval-scaled. Not always good.• Apply a transformation like Apply a transformation like y=log(x)y=log(x) and and

treat as interval-scaledtreat as interval-scaled• Treat as ordinal data and assume ranks as Treat as ordinal data and assume ranks as

interval-scaledinterval-scaled

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 60

Variables of mixed typesVariables of mixed types

One technique is to bring all variables onto a One technique is to bring all variables onto a common scale of the interval [0.0.1.0]common scale of the interval [0.0.1.0]

Suppose that the data set contains Suppose that the data set contains pp variables of mixed type. Dissimilarity is variables of mixed type. Dissimilarity is between between ii and and jj is is

p

f

fij

p

f

fij

fij d

jid

1

)(

1

)()(

),(

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 61

Dissimilarity is between Dissimilarity is between ii and and jj is is

Variables of mixed typesVariables of mixed types

otherwise

cassymmetriisfandxxif

existnotdoesxorxif

where

d

jid

jfif

jfiff

ij

p

f

fij

p

f

fij

fij

1

00

0

),(

)(

1

)(

1

)()(

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 62

The contribution of each variable is dependent on its The contribution of each variable is dependent on its typetype

f is binary or nominal:f is binary or nominal:

f is interval-based: f is interval-based:

f is ordinal of ratio-scaled: compute ranks and treat f is ordinal of ratio-scaled: compute ranks and treat as interval-basedas interval-based

Variables of mixed types contVariables of mixed types cont

1;0)( otherwisexxifd jfiffij

)min()max()(

ff

jfiffij xx

xxd

Clustering MethodsClustering Methods

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 64

Classification typesClassification types

Clustering is an unsupervised methodClustering is an unsupervised method

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 65

Clustering MethodsClustering Methods

PartitioningPartitioning

HierarchicalHierarchical

Density-basedDensity-based

Grid-basedGrid-based

Model-basedModel-based

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 66

Partitioning MethodsPartitioning Methods

Given Given nn objects objects kk partitions are created. partitions are created.

Each partition must contain at least one Each partition must contain at least one element.element.

It uses an iterative relocation technique It uses an iterative relocation technique to improve partitioning.to improve partitioning.

Distance is the usual criterion.Distance is the usual criterion.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 67

Partitioning Methods cont.Partitioning Methods cont.

They work well for finding spherical-shaped They work well for finding spherical-shaped clusters.clusters.

They are not efficient on very large They are not efficient on very large databases.databases.

K-means where each cluster is represented K-means where each cluster is represented by the mean value of the objects in the by the mean value of the objects in the cluster.cluster.

K-medoids where each cluster is represented K-medoids where each cluster is represented by an object near the centre of the cluster.by an object near the centre of the cluster.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 68

Hierarchical MethodsHierarchical Methods

Creates a hierarchical decomposition of the setCreates a hierarchical decomposition of the set Agglomerative approaches start with each object Agglomerative approaches start with each object

forming a separate groupforming a separate group Merges objects or groups until all objects belong Merges objects or groups until all objects belong

to one group or a termination condition occursto one group or a termination condition occurs Divisive approaches starts with all objects in the Divisive approaches starts with all objects in the

same clustersame cluster Each successive iteration splits a cluster until all Each successive iteration splits a cluster until all

objects are on separate clusters or a termination objects are on separate clusters or a termination condition occurscondition occurs

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 69

Hierarchical Clustering contHierarchical Clustering cont

Definition of cluster proximity.Definition of cluster proximity. Min: most similar (sensitive to noise)Min: most similar (sensitive to noise) Max: most dissimilar (break large Max: most dissimilar (break large

clustersclusters

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 70

Density-based methodsDensity-based methods

Method creates clusters until the density Method creates clusters until the density in the neighbourhood exceeds some in the neighbourhood exceeds some thresholdthreshold

Able to find clusters of arbitrary shapesAble to find clusters of arbitrary shapes

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 71

Grid-based methodsGrid-based methods

Grid methods divide the object space into finite Grid methods divide the object space into finite number of cells forming a grid-like structure.number of cells forming a grid-like structure.

Cells that contain more than a certain number Cells that contain more than a certain number of elements are treated as dense.of elements are treated as dense.

Dense cells are connected to form clusters.Dense cells are connected to form clusters. Fast processing time, independent of the Fast processing time, independent of the

number of objects.number of objects. STING and CLIQUE are examples.STING and CLIQUE are examples.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 72

Model-based methodsModel-based methods

Model-based methods hypothesise a Model-based methods hypothesise a model for each cluster and find the best model for each cluster and find the best fit of the data to the given model.fit of the data to the given model.

Statistical modelsStatistical models SOM networksSOM networks

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 73

Partition methodsPartition methods

Given a database of Given a database of nn objects a partition objects a partition method organises them into method organises them into kk clusters clusters ((kk<= <= nn))

The methods try to minimise an objective The methods try to minimise an objective function such as distancefunction such as distance

Similar objects are “close” to each otherSimilar objects are “close” to each other