unsupervised machine learningbejar/url/material/urltransbook.pdfjavier béjar url - spring 2020 cs -...

�

Unsupervised Machine Learning

(Course Slides)

URL

Master in Artificial Intelligence

Javier Béjar

(2020 Spring Semester)

This work is licensed under the Creative CommonsAttribution-NonCommercial-ShareAlike License. cbea

To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/2.0/ orsend a letter to:

Creative Commons,559 Nathan Abbott Way, Stanford,California 94305,USA.

Contents

1 Introduction to Knowledge Discovery 1

2 Data Preprocessing 15

3 Data Clustering 57

4 Cluster Validation 109

5 Clustering of Large Datasets 129

6 Consensus Clustering 149

7 Clustering Structured Data 163

8 Semisupervised Clustering 187

1

Preface

These are the slides for the first half of the course Unsupervised and Reinforcement Learning (URL)from the master on Artificial Intelligence of the Barcelona Computer Science School (Facultatd’Informàtica de Barcelona), Technical University of Catalonia (UPC, BarcelonaTech).

This slides are used in class to present the topics of the course and have been prepared usingthe papers and book references that you can find in the course website http://www.cs.upc.edu/~bejar/URL/URL.html).

This document is a complement so you can prepare the classes and use it as a reference, but it isnot a substitute for the classes or the material from the webpage of the course.

Javier BéjarBarcelona, 2020

3

http://www.cs.upc.edu/~bejar/URL/URL.html

http://www.cs.upc.edu/~bejar/URL/URL.html

1Introduction to Knowledge Discovery

1

Knowledge Discovery

Javier Béjar

URL - Spring 2020

CS - MIA

Knowledge Discovery (KDD)

Knowledge Discovery in Databases (KDD)

• Practical application of the methodologies from machinelearning/statistics to large amounts of data

• Problem: the impossible task of manually analyzing all thedata we are systematically collecting

• Useful for automating/helping the process of analysis/discovery

• Final goal: to extract (semi)automatically actionable/usefulknowledge

“We are drowning in information and starving for knowledge”

URL - Spring 2020 - MAI 1/20

Knowledge Discovery in Databases

• The high point of KDD starts around late 1990s

• Many companies show their interest in obtaining the (possibly)valuable information stored in their databases (purchasetransactions, e-commerce, web data, ...)

• The area has moved/integrated/transmuted several times toinclude several sometimes interchangeable terms: BusinessIntelligence, Business Analytics, Predictive Analytics, DataScience, Big Data ...

• The Venn Diagram Wars


Venn Wars: The funny one

Venn Wars: The job description one

Venn Wars: The scientific one

Venn Wars: The multidisciplinary one

Venn Wars: The reality check one

KDD definitions

“It is the search of valuable information in great volumes ofdata”

“It is the explorations and analysis, by automatic or semiau-tomatic tools, of great volumes of data in order to discoverpatterns and rules”

“It is the nontrivial process of identifying valid, novel, po-tentially useful, and ultimately understandable patternsin data”


Elements of KDD

Pattern: Any representation formalism capable to describe thecommon characteristics of data

Valid: A pattern is valid if it is able to predict the behaviourof new information with a degree of certainty

Novelty: It is novel any knowledge that it is not know respectthe domain knowledge and any previous discoveredknowledge


Elements of KDD

Useful: New knowledge is useful if it allows to perform actionsthat yield some benefit given a established criteria

Understandable: The knowledge discovered must be analyzed byan expert in the domain, in consequence theinterpretability of the result is important


The KDD process

KDD as a process

• The actual discovery of patterns is only one part of a morecomplex process

• Raw data in not always ready for processing (80/20 projecteffort)

• Some general methodologies have been defined for the wholeprocess (CRISP-DM or SEMMA)

• These methodologies address KDD as an engineering process,despite being business oriented are general enough to beapplied on any data discovery domain


The KDD process

• Steps of the Knowledge Discoveryin DB process1. Domain study2. Creating the dataset3. Data preprocessing4. Dimensionality reduction5. Selection of the discovery goal6. Selection of the adequate

methodologies7. Data Mining8. Result assessment and

interpretation9. Using the knowledge


Goals of the KDD process

There are different goals that can be pursued as the result of thediscovery process, among them:

Classification We need models that allow to discriminate instancesthat belong to a previously known set of groups (themodel could or could not be interpretable)

Clustering/Partitioning/Segmentation We need to discovermodels that clusters the data into groups with commoncharacteristics (a characterizations of the groups isdesirable)



Regression We look for models that predicts the behaviour ofcontinuous variables as a function of others

Summarization We look for a compact description thatsummarizes the characteristics of the data

Causal dependence We need models that reveal the causaldependence among the variables and assess thestrength of this dependence



Structure dependence We need models that reveal patternsamong the relations that describe the structure of thedata

Change We need models that discover patterns in data that hastemporal or spatial dependence


Methodologies for KDD

• Decision trees, decision rules

• Usually are interpretable models• Can be used for: Classification, regression, and summarization• trees: C4.5, CART, QUEST, rules: RIPPER, CN2, ..

• Classifiers, Regression

• Low interpretability but good accuracy• Can be used for: Classification and regression• Statistical regression, function approximation, Neural networks,

Support Vector Machines, k-NN, Local Weighted Regression, ...


Methodologies for KDD

• Clustering• Partition datasets or discover groups• Can be used for: Clustering, summarization• Statistical Clustering, Unsupervised Machine learning,

Unsupervised Neural networks (Self-Organizing Maps)

• Dependency models• Obtain models of the dependence relations (structural, causal

temporal) among attributes/instances• Can be used for: causal dependence discovery, temporal

change, substructure discovery• Bayesian networks, association rules, Markov models, graphs

algorithms, ...


Applications

Applications

• Business

• Costumer segmentation, costumer profiling, costumertransaction data, customer churn

• Fraud detection

• Control/analysis of industrial processes

• E-commerce, On-line recommendation, Financial data...

• WEB mining• Text mining, document search/organization

• Social networks analysis

• User behavior


Applications

• Scientific applications• Medicine (patient data, MRI scans, ECG, EEG, ...)

• Pharmacology (Drug discovery, screening, in-silicon testing)

• Astronomy (astronomical bodies identification)

• Genetics (gen identification, DNA microarrays, bioinformatics)

• Satellite data (meteorology, astronomy, geological, ...)

• Large scientific experiments (CERN LHC, ITER)


Challenges

Open problems

• Scalability (More data, more attributes)

• Overfitting (Patterns with low interest)

• Temporal data/relational data/structured data

• Methods for data cleaning (Missing data and noise)

• Pattern comprehensibility

• Use of domain knowledge

• Integration with other techniques (OLAP, DataWarehousing,Intelligent Decision Support Systems)

• Privacy


2Data Preprocessing

15

Data Preprocessing

Javier Béjar

URL - Spring 2020

CS - MAI

Introduction

Data representation

• Unstructured datasets:

• Examples described by a flat set of attributes: attribute-valuematrix

• Structured datasets:• Individual examples described by attributes but with relations

among them: sequences (time, spatial, ...), trees, graphs

• Sets of structured examples (sequences, graphs, trees)


Unstructured data

• Only one table of observations

• Each example represents aninstance of the problem

• Each instance is represented by aset of attributes (discrete,continuous)

A B C · · ·1 3.1 a · · ·1 5.7 b · · ·0 -2.2 b · · ·1 -9.0 c · · ·0 0.3 d · · ·1 2.1 a · · ·...

...... . . .


Structured data

• One sequential relation amonginstances (Time, Strings)• Several instances with

internal structure• Subsequences of

unstructured instances• One large instance

• Several relations amonginstances (graphs, trees)• Several instances with

internal structure• One large instance

Data Streams

• Endless sequence of data• Several streams

synchronized

• Unstructured instances

• Structured instances

• Static/Dynamic model


Data representation

• Most unsupervised learning algorithms are specifically fitted forunstructured data

• The data representation is equivalent to a database table(attribute-value pairs)

• Specialized algorithms have been developed for structured data:Graph clustering, Sequence mining, Frequent substructures

• The representation of these types of data is sometimesalgorithm dependent


Data Preprocessing

Data preprocessing

• Usually raw data is not directly adequate for analysis

• The usual reasons:• The quality of the data (noise/missing values/outliers)

• The dimensionality of the data (too many attributes/too manyexamples)

• The first step of any data task is to assess the quality of thedata

• The techniques used for data preprocessing are usually orientedto unstructured data


Outliers

• Outliers: Examples with extreme values compared to the restof the data

• Can be considered as examples with erroneous values

• Have an important impact on some algorithms

Outliers

• The exceptional values could appear in all or only a fewattributes

• The usual way to correct this problem is to eliminate theexamples

• If the exceptional values are only in a few attributes these couldbe treated as missing values


Parametric Outliers Detection

• Assumes a probabilistic distribution for the attributes

• Univariate• Perform Z-test or student’s test

• Multivariate

• Deviation method: reduction in data variance when eliminated

• Angle based: variance of the angles to other examples

• Distance based: variation of the distance from the mean of thedata in different dimensions


Non parametric Outliers Detection

• Histogram based: Define a multidimensional grid and discardcells with low density

• Distance based: Distance of outliers to their k-nearest neighborsare larger

• Density based: Approximate data density using Kernel Densityestimation or heuristic measures (Local Outlier Factor, LOF)


Outliers: Local Outlier Factor

• LOF quantifies the outlierness of an example adjusting forvariation in data density

• Uses the distance of the k-th neighbor Dk(x) of an exampleand the set of examples that are inside this distance Lk(x)

• The reachability distance between two data points Rk(x, y)

is defined as the maximum between the distance dist(x, y) andthe y’s k-th neighbor distance Lk(y)



• The average reachability distance ARk(x) with respect ofan example’s neighborhood (Lk(x)) is defined as the average ofthe reachabilities of the example to its neighbors

• The LOF of an example is computed as the mean ratiobetween ARk(x) and the average reachability of its k neighbors:

LOFk(x) =1

k

∑

y∈Lk(x)

ARk(x)

ARk(y)

• This value ranks all the examples



2 0 2 4 6

1

0

1

2

3

4

5

2 0 2 4 6

1

0

1

2

3

4

5

Missing values

• Missing values appear because of errors or omissions during thegathering of the data

• They can be substituted to increase the quality of the dataset(value imputation)• Global constant for all the values

• Mean or mode of the attribute (global central tendency)

• Mean or mode of the attribute but only of the k nearestexamples (local central tendency)

• Learn a model for the data (regression, bayesian) and use it topredict the values

• Problem: changes the statistical distribution of the dataURL - Spring 2020 - MAI 14/77

Missing values

Missing Values Mean substitution 1-neighbor substitution


Normalization

Normalizations are applied to quantitative attributes in order toeliminate the effect of having different scale measures

• Range normalization: Transform all the values of theattribute to a preestablished scale (e.g.: [0,1], [-1,1])

x− xminxmax − xmin

• Distribution normalization: Transform the data to a specificstatistical distribution with preestablished parameters (e.g.:Gaussian N (0, 1))

x− µxσx

Discretization

Discretization allows transforming quantitative attributes toqualitative attributes

• Equal size bins: Pick the number of values and divide therange of data in equal sized bins

• Equal frequency bins: Pick the number of values and dividethe range of data so each bin has the same number of examples(the size of the intervals will be different)


Discretization

Discretization allows transforming quantitative attributes toqualitative attributes

• Distribution approximation: Calculate a histogram of thedata and fit a kernel function (KDE), the intervals are wherethe function has its minima

• Other techniques: Apply entropy based measures, MinimumDescription Length (MDL), clustering


Discretization

Same size

Same Frequency

Histogram


Python Notebooks

These two Python Notebooks show some examples of the effect ofmissing values imputation and data discretization and normalization

• Missing Values Notebook (click here to go to the url)

• Preprocessing Notebook (click here to go to the url)

If you have downloaded the code from the repository you will beable to play with the notebooks (run jupyter notebook to open thenotebooks)


Dimensionality Reduction

The curse of dimensionality

• Problems due to the dimensionality of data

• The computational cost of processing the data

• The quality of the data

• Elements that define the dimensionality of data• The number of examples

• The number of attributes

• Usually the problem of having too many examples can besolved using sampling.


Reducing attributes

• The number of attributes has an impact on the performance:• Poor scalability

• Inability to cope with irrelevant/noisy/redundant attributes

• Methodologies to reduce the number of attributes:• Dimensionality reduction: Transforming to a space of less

dimensions

• Feature subset selection: Eliminating not relevant attributes


Dimensionality reduction

• New dataset that preserves most of the information of theoriginal data but with less attributes

• Many techniques have been developed for this purpose• Projection to a space that preserve the statistical distribution of

the data (PCA, ICA)

• Projection to a space that preserves distances among the data(Multidimensional scaling, random projection, nonlinear scaling)


Principal Component Qnalysis

• Principal Component Analysis:

• Data is projected onto a set of orthogonal dimensions(components) that are a linear combination of the originalattributes

• The components are uncorrelated and are ordered by theinformation they have

• We assume data follows gaussian distribution

• Global variance is preserved


Principal Component Analysis

Computes a projection matrix where the dimensions are orthogonal(linearly independent) and data variance is preserved

Y

X

w1*Y+w2*Xw3*

Y+w4*X


Principal Component Analysis

• Principal components: vectors that are the best linearapproximation of the data

f(λ) = µ+ Vqλ

µ is a location vector in Rp, Vq is a p× q matrix of qorthogonal unit vectors and λ is a q vector of parameters

• The reconstruction error for the data is minimized:

mınµ,{λi},Vq

N∑

i=1

||xi − µ− Vqλi||2


Principal Component Analysis - Computation

• Optimizing partially for µ and λi:

µ = x

λi = V Tiq (xi − x)

• We can obtain the matrix Vq by minimizing:

mınVq

N∑

i=0

‖(xi − x)− VqV Tq (xi − x)‖22

• Assuming x = 0 we can obtain the projection matrixHq = VqV

Tq by Singular Value Decomposition of the data

matrix XX = UDV T


Principal Component Analysis - Computation

• U is a N × p orthogonal matrix, its columns are the leftsingular vectors

• V is a p× p diagonal matrix with ordered diagonal values calledthe singular values

• The columns of UD are the principal components

• The solution to the minimization problem are the first qprincipal components

• The singular values are proportional to the reconstruction error


Principal Component Analysis - Intuition

X

Y

Original Data


X

Y

First component along the maximum variance of the data


X

Y

Next component maximum variance perpendicular to the othercomponents

Kernel PCA

• PCA is a linear transformation, this means that if data islinearly separable, the reduced dataset will be linearly separable(given enough components)

• We can use the kernel trick to map the original attribute to aspace where non linearly separable data is linearly separable

• Distances among examples are defined as a dot product thatcan be obtained using a kernel:

d(xi, xj) = Φ(xi)TΦ(xj) = K(xi, xj)


Kernel PCA

• Different kernels can be used to perform the transformation tothe feature space (polynomial, gaussian, ...)

• The computation of the components is equivalent to PCA butperforming the eigen decomposition of the covariance matrixcomputed for the transformed examples

C =1

M

M∑

j=1

Φ(xj)Φ(xj)T

• The components are lineal combinations of features in thefeature space


Kernel PCA

• Pro: Helps to discover patterns that are non linearly separablein the original space

• Con: Does not give a weight/importance for the newcomponents


Kernel PCA

Sparse PCA

• PCA transforms data to a space of the same dimensionality (alleigenvalues are non zero)

• An alternative is to solve the minimization problem posed bythe reconstruction error using regularization

• A penalization term is added to the objective functionproportional to the norm of the eigenvalues matrix

mınU,V‖X − UV ‖22 + α‖V ‖1

• The `-1 norm regularization will encourage sparse solutions(zero eigenvalues)


Multidimensional Scaling

A transformation matrix transforms a dataset from M dimensions toN dimensions preserving pairwise data distances

[MxN]



• Multidimensional Scaling: Projects the data to a space withless dimensions preserving the pair distances among the data

• A projection matrix is obtained by optimizing a function of thepairwise distances (stress function)

• The actual attributes are not used in the transformation

• Different objective functions that can be used (least squares,Sammong mapping, classical scaling, ...).



• Least Squares Multidimensional Scaling (MDS)

• The distorsion is defined as the square distance between theoriginal distance matrix and the distance matrix of the new data

SD(z1, z2, ..., zn) =

[∑

i 6=i′(dii′ − ‖zi − zi′‖2)2

]

• The problem is defined as:

arg mınz1,z2,...,zn

SD(z1, z2, ..., zn)



• Several optimization strategies can be used

• If the distance matrix is euclidean it can be solved using eigendecomposition just like PCA

• In other cases gradient descent can be used using the derivativeof SD(z1, z2, ..., zn) and a step α in the following fashion:

1. Begin with a guess for Z2. Repeat until convergence:

Z(k+1) = Z(k) − α∇SD(Z)


Multidimensional Scaling - Other functions

• Sammong Mapping (emphasis on smaller distances)

SD(z1, z2, ..., zn) =

[∑

i 6=i′

(dii′ − ‖zi − zi′‖)2dii′

]

• Classical Scaling (similarity instead of distance)

SD(z1, z2, ..., zn) =

[∑

i 6=i′(sii′ − 〈zi − z, zi′ − z〉)2

]

• Non metric MDS (assumes a ranking among the distances, noneuclidean space)

SD(z1, z2, ..., zn) =

∑i,i′ [θ(||zi − zi′ ||)− dii′ ]2∑

i,i′ d2i,i′

Random Projection

• A random transformation matrix is generated:• Rectangular matrix N × d

• Columns must have unit length

• Elements are generated from a gaussian distribution

• A matrix generated this way is almost orthogonal

• The projection will preserve the relative distances among pairsof examples

• The Johnson-Lindenstrauss lemma allows to pick a number ofdimensions to obtain the desired approximation


Nonnegative Matrix Factorization (NMF)

• This formulation assumes that the data is a sum of unknownpositive latent variables

• NMF performs an approximation of a matrix as the product oftwo matrices

V = W ×H

• The main difference with PCA is that the values of the matricesare constrained to be positive

• The positiveness assumption helps to interpret the result• Eg.: In text mining, a document is an aggregation of topics


Nonlinear scaling

• The previous methods perform a linear transformation betweenthe original space and the final space

• For some datasets this kind of transformation is not enough tomaintain the information of the original data

• Nonlinear transformations methods:• ISOMAP

• Local Linear Embedding

• Local MDS

• t-SNE


ISOMAP

• Assumes a low dimensional dataset embedded in a largernumber of dimensions

• The geodesic distance is used instead of the euclidean distance

• The relation of an instance with its immediate neighbors ismore representative of the structure of the data

• The transformation generates a new space that preservesneighborhood relationships


ISOMAP


ISOMAP

Euclidean

Geodesic

a

b


ISOMAP - Algorithm

1. For each data point find its k closest neighbors (points atminimal euclidean distance)

2. Build a graph where each point has an edge to its closestneighbors

3. Approximate the geodesic distance for each pair of points bythe shortest path in the graph

4. Apply a MDS algorithm to the distance matrix of the graph


ISOMAP - Example

a

b

a

b

Original Transformed


Local Linear Embedding

• Performs a transformation that preserves local structure

• Assumes that each instance can be reconstructed by a linearcombination of its neighbors (weights)

• From these weights a new set of data points that preserve thereconstruction is computed for a lower dimensional space

• Different variants of the algorithm exist


Local Linear Embedding


Local Linear Embedding - Algorithm

1. For each data point find the K nearest neighbors in the originalspace of dimension p (N (i))

2. Approximate each point by a mixture of the neighbors:

mınWik

‖xi −∑

k∈N (i)

wikxk‖2

and∑

k∈N (i)wik = 1 and K < p

3. Find points yi in a space of dimension d < p that minimize:

N∑

i=0

‖yi −∑

k∈N (i)

wikyk‖2


Local MDS

• Performs a transformation that preserves locality of closerpoints and puts farther away non neighbor points

• Given a set of pairs of points N where a pair (i, i′) belong tothe set if i is among the K neighbors of i′ or viceversa

• Minimize the function:

SL(z1, z2, . . . , zN ) =∑

(i,i′)∈N(dii′−‖zi−zi′‖)2− τ

∑

(i,i′)6∈N(‖zi−zi′‖)

• The parameters τ controls how much the non neighbors arescattered


t-SNE

• t-Stochastic Neighbor Embedding (t-SNE)

• Used as visualization tool

• Assumes distances define a probability distribution

• Obtains a low dimensional space with the closest distribution

• Tricky to use (see this link)


t-SNE

• Distances from each example to the rest are scaled to sum one(so it is a probability distribution)

• We want to project the data, so we preserve this probabilitydistribution on a lower dimensionality space

• Examples are distributed in the new space and their distancedistributions are computed

• Examples are iteratively moved to minimize the Kulback-Leiblerdistance among the distribution of the neighbours distances inthe original and in the projected space


t-SNE

Each example has a distance probability distribution

t-SNE

A similarity distribution is obtained

t-SNE

We distribute the data in a lower dimensionality space


t-SNE

Data is moved so the similarity distributions get closer


Application: Wheel chair control characterization

• Wheelchair with shared control (patient/computer)

• Recorded trajectories of several patients in different situations• Angle/distance to the goal, Angle/distance to the nearest

obstacle from around the chair (210 degrees)

• Characterization about how the computer helps the patientswith different handicaps

• Is there any structure in the trajectory data?


Application: Wheel chair control (PCA)


Application: Wheel chair control (SparsePCA)


Application: Wheel chair control (MDS)


Application: Wheel chair control (ISOMAPKn=3)


Application: Wheel chair control (ISOMAPKn=10)


Unsupervised Attribute Selection

• To eliminate from the dataset all the redundant or irrelevantattributes

• The original attributes are preserved

• Less developed than in Supervised Attribute Selection• Problem: An attribute can be relevant or not depending on the

goal of the discovery process

• There are mainly two techniques for attribute selection:Wrapping and Filtering


Attribute selection - Wrappers

• A model evaluates the relevance of subsets of attributes

• In supervised learning this is easy, in unsupervised learning it isvery difficult

• Results depend on the chosen model and on how well thismodel captures the actual structure of the data


Attribute selection - Wrapper Methods

• Clustering algorithms that compute weights for the attributesbased on probability distributions

• Clustering algorithms with an objective function that penalizesthe size of the model

• Consensus clustering


Attribute selection - Filters

• A measure evaluates the relevance of each attribute individually

• This kind of measures are difficult to obtain for unsupervisedtasks

• The idea is to obtain a measure that evaluates the capacity ofeach attribute to reveal the structure of the data (eg.: classseparability, similarity of instances in the same class)


Attribute selection - Filter Methods

• Measures of properties of the spatial structure of the data(Entropy, PCA, laplacian matrix)

• Measures of the relevance of the attributes respect the inherentstructure of the data

• Measures of attribute correlation


Laplacian Score

• The Laplacian Score is a filter method that ranks thefeatures respect to their ability of preserving the naturalstructure of the data.

• This method uses the spectral matrix of the graph computedfrom the near neighbors of the examples


Laplacian Score

• The Similarity matrix is usually computed using a gaussiankernel (edges not present have a value of 0)

Sij = e||xi−xj ||2

σ

• The Degree matrix is a diagonal matrix where the elements arethe sum of the rows of S

• The Laplacian matrix is computed as

L = S −D


Laplacian Score

• The score first computes for each attribute r and their valuesfr the transformation fr as:

fr = fr −fTr D1

1TD11

• and then the score Lr is computed as:

Lr =fTr Lfr

fTr Dfr

• This gives a ranking for the relevance of the attributes


Python Notebooks

These two Python Notebooks show some examples dimensionalityreduction and feature selection

• Dimensionality reduction and feature selection Notebook (clickhere to go to the url)

• Linear and non linear dimensionality reduction Notebook (clickhere to go to the url)

If you have downloaded the code from the repository you will able toplay with the notebooks (run jupyter notebook to open thenotebooks)


Python Code

• In the code from the repository inside subdirectoryDimReduction you have the Authors python script

• The code uses the datasets in the directory Data/authors• Auth1 has fragments of books that are novels or philosophy

works• Auth2 has fragments of books written in English and books

translated to English

• The code transforms the text to attribute vectors and appliesdifferent dimensionality reduction algorithms

• Modifying the code you can process one of the datasets andchoose how the text is transformed into vectors


3Data Clustering

57

Unsupervised Learning

Javier Béjar

URL - Spring 2019

CS - MAI



• Learning can be done in a supervised or unsupervised way

• There is a strong bias in the machine learning communitytowards supervised learning

• But a lot of concepts are learned unsupervisedly

• The discovery of new concepts is always unsupervised



• We assume that data is embedded in a N-dimensional spacewith a similarity/dissimilarity function

• Similarity defines how examples are related to each other

• Bias:• Examples are more related to the nearest examples than to the

farthest

• Patterns are compact groups that are maximally separated fromeach other

• Areas: Statistics, machine learning, graph theory, fuzzy theory,physics



• Discovery goals:• Summarization: To obtain representations that describe an

unlabeled dataset

• Understanding: To discover the concepts inside the data

• Difficult tasks because discovery is biased by context

• Different answers could be valid depending of the discoverygoal or the domain

• There are few criteria to validate the results

• Representation of the clusters: Unstructured (partitions) orrelational (hierarchies)

Unsupervised Learning Algorithms - Strategies

Hierarchical algorithms

• Examples are organized as a binary tree

• Based on the relationship among examples defined bysimilarity/dissimilarity functions

• No explicit division in groups, has to be chosen a posteriori

Partitional algorithms

• Only a partition of the dataset is obtained

• Based on the optimization of a criteria (assumptions about thecharacteristics of the cluster model)


Hierarchical Algorithms


• Based on graph theory

• The examples form a full connected graph

• Similarity defines the length of the edges

• Clustering is decided using a connectivity criteria

• Based on matrix algebra

• A distance matrix is calculated from the examples

• Clustering is computed using the distance matrix

• The distance matrix is updated after each iteration (differentupdating criteria)



• Graphs• Single Linkage, Complete Linkage, MST• Divisive, Agglomerative

• Matrices• Johnson algorithm• Different update criteria (S-L, C-L, Centroid, minimum

variance)

Computational costFrom O(n_inst3 × n_dims) to O(n_inst2 × n_dims)


Agglomerative Graph Algorithm

Algorithm: Agglomerative graph algorithm

Compute distance/similarity matrixrepeat

Find the pair of examples with smallest similarityAdd an edge to the graph corresponding to this pairif Agglomeration criteria holds then

Merge the clusters the pair belongs to

until Only one cluster exists

• Single linkage = New edge is between two disconnected graphs• Complete linkage = New edge creates a clique with all the

nodes of both subgraphs


Hierarchical algorithms - Graphs

2 3 4 51 6 8 2 72 1 5 33 10 94 4

5 2 3 1 4 2 3 1 54

Single Link Complete Link

2

1

3

4

5

2

1

3

4

5

2

1

3

4

5

2

1

3

4

5

2

1

3

4

5

2

1

3

4

5

2

1

3

4

5


Agglomerative Johnson algorithm

Algorithm: Agglomerative Johnson algorithm

Compute Distance/similarity matrixrepeat

Find pair of groups/examples with the smallest similarityMerge the pair of groups/examplesDelete the rows and columns corresponding to the pairAdd a new row and column with the new distances for the newgroup

until Matrix has one element

• Single linkage = Distance between the closest examples• Complete linkage = Distance between the farthest examples• Average linkage = Distance between group centroids


Hierarchical algorithms - Matrices

2 3 4 51 6 8 2 72 1 5 33 10 94 4

2,3 4 51 7 2 72,3 7.5 64 4

1,4 52,3 7.25 61,4 5.5

1,4,52,3 6.725


Hierarchical algorithms - Problems

• A partition of the data has to be decided a posteriori

• Some undesirable and strange behaviours could appear(chaining, inversions, breaking large clusters)

• Some algorithms have problems when with different sized andconvex shaped clusters appear in the data

• Dendrograms are not a practical representation for largeamounts of data

• Computational cost is too high for large datasets• Time is O(n2) in the best case, O(n3) in general


Hierarchical algorithms - Example

Data Single Link Complete Link

−1 0 1 2 3 4 5

−2

02

46

8

x1

x2

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

2.0 1.5 1.0 0.5 0.0

151723162018211922241113141012241798635

12 10 8 6 4 2 0

151723192224182116201014118126351314279

Median Centroid Ward

4 3 2 1 0

111381263510241791418211620192224151723

4 3 2 1 0

182116201922241517231114138126351042179

50 40 30 20 10 0

142796810123511131517231418211620192224

Python Notebooks

This Python Notebook shows examples of using differenthierarchical clustering algorithms

• Hierarchical Clustering Algorithms Notebook (click here to goto the url)



Other hierarchical algorithms

• Learning has an incremental nature (experience is acquiredfrom continuous observation, not at once)

• Concepts are learned at the same time than their relationships(polithetic hierarchies of concepts)

• Learning is a search in the space of hierarchies

• An objective function measures the utility of the structure

• The updating of the structure is performed by a set ofconceptual operators

• The result depends on the order of the examples


Concept Formation - COBWEB

JH Gennari, P Langley, D Fisher, Models of incrementalconcept formation, Artificial intelligence, 1989

• Based on ideas from cognitive psychology• Learning is incremental• Concepts are organized in a hierarchy• Concepts are organized around a prototype and described

probabilistically• Hierarchical concept representation is modified via cognitive

operators• Builds a hierarchy top/down• Four conceptual operators• Heuristic measure to find the basic level (Category utility)


Probabilistic hierarchy

Negro

Blanco

Triángulo

Cuadrado

Círculo

Negro

Blanco

Triángulo

Cuadrado

Círculo

Negro

Blanco

Triángulo

Cuadrado

Círculo

Negro

Blanco

Triángulo

Cuadrado

Círculo

Negro

Blanco

Triángulo

Cuadrado

Círculo

1.0

0.0

1.00.0

0.0

0.0

1.0

0.0

0.0

1.0

0.0

1.0

0.01.0

0.0

0.0

0.0

1.0

0.66

0.33

0.25

0.75

0.25

0.25

0.50

P(V|C)

Forma

Color

P(C0)=1.0 P(V|C)

Forma

Color

P(V|C)

Forma

Color

P(V|C)

Forma

Color

P(V|C)

Forma

Color

P(C0)=0.25

P(C0)=0.50

P(C0)=0.75

P(C0)=0.25

COBWEB - Category utility (CU)

• Category utility balances:• Intra class similarity: P (Ai = Vij |Ck)

• Inter class similarity: P (Ck|Ai = Vij)

• It measures the difference between a partition of the data andno partition at all

• For qualitative attributes and k categories {C1, ... Ck} isdefined as:

K∑k=1

P (Ck)I∑

i=1

J∑j=1

P (Ai = Vij|Ck)2 −I∑

i=1

J∑j=1

P (Ai = Vij)2

K

(see the full derivation on the paper)


Operators

• Incorporate: Put the example inside an existing class• New class: Create a new class at this level• Merge: Two concepts are merged and the example is

incorporated inside the new class• Divide: A concept is substituted by its children

Ei

MERGE

E i+1

SPLIT


COBWEB Algorithm

Procedure: Depth-first limited search COBWEB (x: Example, H:Hierarchy)

Update the father with the new exampleif we are in a leaf then

Create a new level with this exampleelse

Compute CU of incorporating the example to each classSave the two best CUCompute CU of merging the best two classesCompute CU of splitting the best classCompute CU of creating a new class with the exampleRecursive call with the best choice



To find the optimal partition of N objects in K groups is NP-hard,we need approximated algorithms

• Model/prototype based algorithms (K-means, Gaussian MixtureModels, Fuzzy K-means, Leader algorithm, ...)

• Density based algorithms (DBSCAN, DENCLUE, ...)

• Grid based algorithms (STING, CLIQUE, ...)

• Graph theory based algorithms (Spectral Clustering, ...)• Other approaches:

• Affinity Clustering• Unsupervised Neural networks• SVM clustering


Model/Prototype Clustering

K-means

• Our model is a set of k hyperspherical clusters

• An iterative algorithm assigns each example to one of K groups(K is a parameter)

• Optimization criteria: Minimize the distance of eachexample to the centroid of the cluster (squared error)

Distorsion =K∑

k=1

∑

i∈Ck

‖ xi − µk ‖2

• Optimization by a Hill Climbing/gradient descent searchalgorithm

• The algorithm converges to a local minima


K-means

Algorithm: K-means (X: Examples, k:integer)

Generate k initial prototypes (e.g. the first k examples)Assign the k examples to its nearest prototypeSumD = Sum of square distances examples-prototypesrepeat

Recalculate prototypesReassign examples to its nearest prototypeSumI = SumDSumD = Sum of square distances examples-prototypes

until SumI - SumD < ε


K-means

1

11 1

1

22

2

2

22

22

1

1

11

2

1

12

1

11 1

1

2

2 2

2

22

2

2

22

22

1

11 1

1

2

22

2

2

22

22

2

1

1

1

1

11 1

1

22

2

2

22

22

1

1

11

2

1

1

K-means - Practical problems

• The algorithm is sensitive to initialization (to run from severalrandom initializations is a common practice)

• Sensitive to clusters with different sizes/densities and outliers

• To find the value of k is not a trivial problem

• No guarantee about the quality of the solution

• A solution is found even when not hyperspherical clusters exist

• The spatial complexity makes it not suitable for large datasets


K-means++ - Initialization Strategies

• K-means++ modifies the initialization strategy

• It tries to maximize distance among initial centers

• Algorithm:

1. Choose one center uniformly from among all the data

2. For each data point x, compute d(x, c), the distance between xand the nearest center already chosen

3. Choose one new data point at random as a new center, using aweighted probability distribution where a point x is chosen withprobability proportional to d(x, c)2

4. Repeat Steps 2 and 3 until k centers have been chosen

5. Proceed with the standard K-means algorithm

Bisecting K-means

• Bisecting K-means iteratively splits one of the current clustersinto two until obtaining the desired number of clusters

• Pros:• Reduces the effect of initialization

• A hierarchy is obtained

• It can be used to determine K

• Con: Different criteria could be used to decide which cluster tosplit (the largest, the one with largest variance...)


Bisecting K-means

• Algorithm:1. Choose a number of partitions

2. Apply K-means to the dataset with k=2

3. Evaluate the quality of the current partition

4. Pick the cluster to be split using a quality criterion

5. Apply K-means to the cluster with k=2

6. If the number of clusters is less than desired repeat from step 3


Global K-means

• Minimizes initialization dependence exploring all clusterings thatcan be generated using the examples as initialization points

• For generating a partition with K clusters explores all thealternative partitions from 1 to K clusters.

• Pro: Reduces the initialization problem/obtains all partitionsfrom 2 to K

• Con: Computational cost (runs K-means K ×N times)


Global K-means

• Algorithm:• Compute the centroid of the partition with 1 cluster

• For C from 2 to k:• for each example e, compute K-means initialized with theC − 1 centroids from the previous iteration and an additionalwith e as the C-th centroid

• Keep the clustering with the best objective function as theC-clusters solution


Other K-means variants

• Kernel K-means:• Distances are computed using a kernel• Pro: Clusters that are non linearly separable can be discovered

(non convex)• Con: Centroids are in the feature space, no interpretation in

the original space (image problem)

• Fast K-means• Use of the triangular inequality to reduce the number of

distance computations for assigning examples

• K-Harmonic means• Uses the Harmonic mean of the squared distances instead of

the distorsion as objective function• Pro: Less sensitive to initialization

K-medoids

• K-means assumes a centroid can be computed

• In some problems a centroid makes no sense (nominalattributes, structured data)

• One or more examples for each cluster are maintained as arepresentative of the cluster (medoid)

• The distance from each example to the medoid of their clusteris used as optimization criteria

• Pro: It is not sensitive to outliers

• Con: For one representative the cost per iteration is O(n2), formore it is NP-hard


K-medoids - PAM

Partitioning Around Medoids (PAM):

1. Randomly select k of the n data points as the medoids

2. Associate each data point to the closest medoids

3. For each medoid m• For each non-medoid o: Swap m and o and compute the cost

4. Keep the best solution

5. If medoids change, repeat from step 2


Incremental algorithms: Leader Algorithm

• The previous algorithms need all the data from the beginning

• An incremental strategy is needed when data comes as astream (Leader Algorithm):

• A distance/similarity threshold (D) determines the extent of acluster

• Inside the threshold: Incremental updating of the model(prototype)

• Outside the threshold: A new cluster is created

• The threshold D determines the granularity of the clusters

• The clusters are dependent on the order of the examples


Leader Algorithm

Algorithm: Leader Algorithm (X: Examples, D:double)

Generate a prototype with the first examplewhile there are examples do

e= current exampled= distance of e to the the nearest prototypeif d ≤ D then

Add the example to the clusterRecompute the prototype

elseCreate a new cluster with this example


Leader Algorithm

1

2

111

1

11

1 1

111

1

11

1

2

2

2

2

22

2

211

1

31

11

1

11

1

2

2

2

2

22

2

11

1

22

22

2

2

3

33

3

3

33

33

1

1

111 1

Mixture Decomposition - EM algorithm

• We assume that data are drawn from a mixture of probabilitydistributions (usually Gaussian)

• Search the space of parameters of the distributions to obtainthe mixture that explains better the data (parameterestimation)

• The model of the data is:

P (x|θ) =K∑

k=1

wkP (x|θk)

with K the number of clusters and∑K

k=1wk = 1

• Each example has a probability to belong to a clusterURL - Spring 2019 - MAI 36/94




• The goal is to estimate the parameters of the distribution thatdescribes each class (e.g.: µ and σ)

• The algorithm maximizes the likelihood of the distributionrespect the data

• It performs iteratively two steps:

• Expectation: We calculate a function that assigns a degree ofmembership to all the instances to any of the K probabilitydistributions

• Maximization: We re-estimate the parameters of thedistributions to maximize the membership


EM Algorithm (K Gaussian)

• For the Gaussian case:

P (x|−→µ ,Σ) =K∑

k=1

wkP (x|−→µi ,Σk)

Being −→µ the vectors of means and Σ the covariance matrices



The computations depend on the assumptions that we make aboutthe attributes of the data (independent or not, same σ, ...)

• Attributes are independent: µi and σi have to be computed foreach cluster (O(k) parameters) (model: hyper spheres orellipsoids parallel to coordinate axis)

• Attributes are not independent: µi, σi and σij have to becomputed for each cluster (O(k2) parameters) (model: hyperellipsoids non parallel to coordinate axis)



• For the case of A independent attributes:

P (x|−→µk,Σk) =A∏

j=1

P (x|µkj, σkj)

• The model to fit is

P (x|−→µ ,−→σ ) =K∑

k=1

wk

A∏

j=1

P (x|µkj, σkj)



• K initial distributions are generated N(µk, σk), µk and σkcorrespond to the mean and the variance of the attributes

• Repeat until convergence (no loglikelihood improvement):

1. Expectation: Compute the membership of each example toeach mixture component

• Each instance will have a weight (γik) depending on thecomponents computed by the previous iteration

2. Maximization: Recompute the parameters using the weightsfrom the previous step to obtain the new µk. σk and wk foreach distribution


EM Algorithm (K Gaussian) - Expectation

• The expectation step computes the weights for each exampleand component

γik = wkP (xi|µk, σk)

• This represents the probability that an example xi is generatedby component Ck


EM Algorithm (K Gaussian) - Maximization

• The maximization steps recomputes µ, σ and w for eachcomponent proportionally to the weights computed in theexpectation step

µk =

∑Ni=1 γikxi∑Ni=1 γik

σk =

∑Ni=1 γik(xk − µi)

2

∑Ni=1 γik

wk =1

N

N∑

k=1

γik


EM Gaussian Mixtures - Example

Initial Assignment



Expectation + Maximization = new parameters

m1 m2

s1 s2

weights



Expectation + Maximization = new parameters

m1 m2

s1 s2

weights


EM algorithm - Comments

• K-means is a particular case of this algorithm (hard partition)

• The main advantage is that we obtain a membership as aprobability (soft assignments)

• Using different probability distributions we can find differentkinds of structures in the data

• For each probability model we use we need to derive thecalculations for the iterative updating of their parameters


Dirichlet Process Mixture Model

• One of the problems of GMM is to decide a priori the numberof components

• This can be included in the model using a mixture model thatuses as a prior a Dirichlet Process distribution (represents thedistribution of the number of mixtures and their weights)

• Dirichlet Process distribution assumes an unbound number ofcomponents

• A finite weight is distributed among all the components

• The fitting of the model will decide what number ofcomponents better suits the data


Dirichlet Process Mixture Model


Fuzzy Clustering

• Fuzzy clustering relax the hard partition constraint of K-means

• Each example has a continuous membership to each partition

• A new optimization function is introduced:

L =N∑

i=1

K∑

k=1

δ(Ck, xi)b‖xi − µk‖2

with∑K

k=1 δ(Ck, xi) = 1 and b is a blending factor

• When the clusters are overlapped this is an advantage over hardpartition algorithms


Fuzzy Clustering

• C-means is the most known fuzzy clustering algorithm, it is thefuzzy version of K-means• Membership is computed as the normalized inverse distance to

all the clusters• The updating of the cluster centers is computed as:

µj =

∑Ni=1 δ(Cj, xi)

bxi∑Ni=1 δ(Cj, xi)b

• And the updating of the memberships:

δ(Cj, xi) =(1/dij)

1/(1−b)∑K

k=1(1/dik)1/(1−b), dij = ‖xi − µj‖2

Notice that this is exactly a softmax of the distances to thecentroids


Fuzzy Clustering

• The C-means algorithm looks for spherical clusters, otheralternatives:

• Gustafson-Kessel algorithm: A covariance matrix is introducedfor each cluster in the objective function that allows elipsoidshapes and different cluster sizes

• Gath-Geva algorithm: Adds to the objective function the sizeand an estimation of the density of the cluster

• Different objective functions can be used to detect specificshapes in the data (lines, rectangles, ...)


Python Notebooks

This Python Notebook shows examples of using different theK-means and GMM and their problems

• Prototype Based Clustering Algorithms Notebook (click here togo to the url)



Density/Grid Clustering

Density/Grid Based Clustering

• The number of clusters is not decided beforehand

• We are looking for regions with high density of examples

• We are no limited to predefined shapes (there is no model)

• Different approaches:

• Density estimation

• Grid partitioning

• Multidimensional histograms

• Usually applied to datasets with low dimensionality


Density estimation


Grid Partitioning


Multidimensional Histograms


DBSCAN/OPTICS

Ester, Kriegel, Sander, Xu A Density-Based Algorithm forDiscovering Clusters in Large Spatial Databases with Noise(DBSCAN) (1996)

Ankerst, Breunig, Kriegel, Sander OPTICS: Ordering Points ToIdentify the Clustering Structure (2000)

• Used in spatial databases, but can be applied to data with moredimensionality

• Based on finding areas of high density, it finds arbitrary shapes


DBSCAN/OPTICS

• We define ε-neighbourhood, as the examples that are at adistance less than ε to a given instance

Nε(x) = {y ∈ X|d(x, y) ≤ ε}

• We define core point as the examples that have a certainnumber of elements in Nε(x)

Core_point ≡ |Nε(x)| ≥MinPts

ε

ε-neighborhood

DBSCAN - Density

• Two examples p and q are Direct Density Reachable withrespect to ε and MinPts if:

1. p ∈ Nε(q)

2. |Nε(q)| ≥MinPts

• Two examples p and q are Density Reachable if there is asequence of examples p = p1, p2, . . . , pn = q where for all pi,pi+1 is Direct Density Reachable from pi

• Two examples p and q are Density connected if there is anexample o such that both p and q are Density Reachablefrom o


DBSCAN - Algorithm

q

p

p DDR q

q

p1

p2

p

p DR q


DBSCAN - Cluster Definition

Cluster

Given a dataset D, a cluster C with respect ε and MinPts is anysubset of D that:

1. ∀p, q p ∈ C ∧ density_reachable(q, p) −→ q ∈ C2. ∀p, q ∈ C density_connected(p, q)

Any example that can not be connected using these relationships istreated as noise


DBSCAN - Algorithm

Algorithm

1. We start with an arbitrary example and compute all densityreachable examples with respect ε and MinPts.

2. If it is a core point we will obtain a group, otherwise, it is aborder point and we will start from other unclassified instance

To decrease the computational cost R∗ trees are used to store andcompute the neighborhood of instances

ε and MinPts are set from the thinnest cluster

The OPTICS algorithm defines a heuristic to find a good set ofvalues for these parameters

DBSCAN - Algorithm

Datos Epsilon, MinPts=5 First Iteration (DDR) First Iteration (DR,DC)


Python Notebooks

This Python Notebook compares Prototype Based and DensityBased Clustering algorithm

• Density Based Clustering Notebook (click here to go to the url)



Other Approaches

Spectral Clustering

• Spectral Graph Theory defines properties that hold theeigenvalues and eigenvectors of the adjacency matrix orLaplacian matrix of a graph

• Spectral clustering uses the spectral properties of the similaritymatrix

• The distance matrix represents the graph that connects theexamples• Complete graph

• Neighborhood graph (different definitions)

• Different clustering algorithms can be defined from thediagonalization of this matrix


Spectral Clustering (First approach)

• We start with the similarity matrix (W ) of the data

• The degree of a vertex is defined as:

di =n∑

j=1

wij

• We define the degree matrix D as the diagonal matrix withvalues d1, d2, . . . , dn

• We can define different Laplace matrices:

• Unnormalized: L = D −W

• Normalized: Lsym = D−1/2LD−1/2 or also Lrw = D−1L



1

2

3

4

5

0.80.3

0.2 0.7

0.70.4

W =

0 0.8 0 0.2 0

0.8 0 0.3 0 0

0 0.3 0 0.7 0.4

0.2 0 0.7 0 0.7

0 0 0.4 0.7 0

D =

1 0 0 0 0

0 1.1 0 0 0

0 0 1.4 0 0

0 0 0 1.6 0

0 0 0 0 1.1



• Algorithm:1. Compute the Laplace matrix from the similarity matrix

2. Compute the first K eigenvalues of the Laplace matrix

3. Use the eigenvectors as new datapoints

4. Apply K-means as clustering algorithm

• We are actually embedding the dataset in a space of lowerdimensionality



L =

1 −0.8 0 −0.2 0

−0.8 1.1 −0.3 0 0

0 −0.3 1.4 −0.7 −0.4−0.2 0 −0.7 1.6 −0.70 0 −0.4 −0.7 1.1

Eigvec(1, 2) =

0.07 0.70

0.04 0.69

0.22 0.10

0.93 −0.100.24 0.03


0.0 0.2 0.4 0.6 0.8

0.1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7


Spectral Clustering (Second approach)

• From the similarity matrix and its Laplacian it is possible toformulate it as a graph partitioning problem

• Given two disjoint sets of vertex A and B, we define:

cut(A,B) =∑

i∈A,j∈Bwij

• We can partition the graph solving the mincut problemchoosing a partition that minimizes :

cut(A1, . . . , Ak) =k∑

i=1

cut(Ai, Ai)


Spectral Clustering (Second approach)

• Using directly the weights of the Laplacian not always givesgood results, alternative objective functions are:

RatioCut(A1, . . . , Ak) =k∑

i=1

cut(Ai, Ai)

|Ai|(1)

Ncut(A1, . . . , Ak) =k∑

i=1

cut(Ai, Ai)

vol(Ai)(2)

• Where |Ai| is the size of the partition and vol(Ai) is the sum ofthe degrees of the vertex in Ai


Affinity Propagation Clustering

• Affinity clustering is a message passing algorithm related tograph partitioning and belief propagation in probabilisticgraphical models

• It chooses a set of examples as the cluster prototypes andcomputes how the rest of examples are attached to them

• Each pair of examples have a similarity defined s(i, k)

• Each example has a value r(k, k) that represents the preferencefor each point to be an exemplar

• The algorithm does not set apriori the number of clusters,


Affinity Propagation Clustering - Messages

• The examples pass two kind of messages

• Responsibility r(i, k), this is a message that an example i passesto the candidate to exemplars k of the point. This representsthe evidence of how good is k for being the exemplar of i

• Availability a(i, k), sent from candidate to exemplar k to pointi. It represents the accumulated evidence of how appropriatewould be for point i to choose point k as its exemplar


Affinity Propagation Clustering - Updating

• All availabilities are initialized to 0

• The responsibilities are updated as:

r(i, k) = s(i, k)−maxk′ 6=k{a(i, k′) + s(i, k′)}

• The availabilities are updated as:

a(i, k) = min{0, r(k, k) +∑

i′ 6∈{i,k}max(0, r(i′, k))}


Affinity Propagation Clustering - Updating

• The self availability a(k, k) is updated as:

a(k, k) =∑

i′ 6=k

max(0, r(i′, k))}

• The exemplar for a point is identified by the point thatmaximizes a(i, k) + r(i, k), if this point is the same point, thenit is an exemplar.


Affinity Propagation Clustering - Algorithm

1. Update the responsibilities given the availabilities

2. Update the availabilities given the responsibilities

3. Compute the exemplars

4. Terminate if the exemplars do not change in a number ofiterations


Affinity Propagation Clustering - Algorithm

Unsupervised Neural Networks

• Self-organizing maps are unsupervised neural networks

• Can be seen as an on-line constrained version of K-means

• The data is transformed to fit in a 1-d or 2-d regular mesh

• The nodes of this mesh are the prototypes

• This algorithm can be used as a dimensionality reductionmethod (from N to 2 dimensions)


Self-Organizing Maps

• To build the map we have to decide the size and shape of themesh (rectangular/hexagonal)• Each node a multidimensional prototype of p features

Algorithm: Self-Organizing Map algorithm

Initial prototypes are distributed regularly on the meshfor Predefined number of iterations do

foreach Example xi doFind the nearest prototype (mj)Determine the neighborhood of mj (M)foreach Prototype mk ∈M do

mk = mk + α(xi −mk)



• Each iteration transforms the mesh closer to the data,maintaining the 2D relationship between prototypes

• The neighborhood of a prototype is defined by the adjacency ofthe cells and the distance of the prototypes

• Performance depends on the learning rate α, that usually isdecreased from 1 to 0 during the iterations

• The number of neighbors used in the update is decreasedduring the iterations from a predefined number to 1

• Some variations of the algorithm use the distance of theprototypes as weights for the update



Python Notebooks

This Python Notebook has examples for Spectral and AffinityPropagation Clustering

• Spectral and Affinity Propagation Clustering Notebook (clickhere to go to the url)



Python Code

• In the code from the repository inside subdirectory Clusteringyou have the python programs HierarchicalAuthors,PartitionalAuthors and DensityBasedCity

• The first and second ones use the authors dataset and allowsto compare hierarchical clustering and partitional clusteringwith this data. You can observe what happens with bothdatasets and using different attributes

• The third one uses data from the City datasets that representsevents in different cities (Tweets and Crime) showing results fora variety clustering algorithms. You can use data from differentcities.

Applications

Barcelona Twitter/Instagram Dataset

• Goal: To analyze the geographical behavior of peopleliving/visiting a city

• Dataset: Tweets/posts inside a geographical area

• Attributes: geographical information / time stamp of the post

• Processes:• Geographical discretization for user representation

• Discovery of behavior profiles


Discovery Goals - Are there distinguishable groupsof behaviors?

• Focus on the discovery of different groups of people

• The hypothesis is that users can be segmented according towhere they are at different times of the day

• We could answer to questions like:

• people from one part of the city stay usually in that part?

• people from outside the city go to the same places?

• public transportation is preferred by different profiles? ...


Data Attributes

• Raw geolocalization and timestamp are too fine grained

• Even when we have millions of events the probability of havingtwo events at the same place and at the same time is low

• Discretizing localization and time will increase the probability offinding patterns

• How do we discretize space and time? Clustering


Clustering of the events - Leader (200m/500m ra-dius)


Finding Geographical Profiles

• We could explore the aggregated behavior of a user for a longperiod (month, year)

• Each example represents the places/times of the events of auser for the period

• We obtain a representation similar to the Bag of Words used intext mining

• User ⇒ Document

• Time × Place ⇒ Word


Partitioning the data

• Difficult to choose the adequate clustering algorithm

• Some choices will depend on:• Types of attributes: continuous or discrete values, sparsity of

the data

• Size of the dataset: Aggregating the events reduces largely thesize of the data (no scalability issues)

• Assumptions about the model that represents our goals: Shapeof the clusters, Separability of the clusters/Distribution ofexamples

• Interpretability/Representability of the clusters


K-means - Twitter - Moving from near cities toBarcelona


K-means - Twitter - Using nort-west freeways


4Cluster Validation

109

Clustering Evaluation

Javier Béjar

URL - Spring 2020

CS - MAI

Cluster Evaluation

Model evaluation

• The evaluation of unsupervised learning is difficult

• There is no goal model to compare with

• The true result is unknown, it may depend on the context, thetask to perform, ...

• Why do we want to evaluate them?

• To avoid finding patterns in noise

• To compare clustering algorithms

• To compare different models/parameters


What can be evaluated?

• Cluster tendency, there are clusters in the data?

• Compare the clusters to the true partition of the data

• Quality of the clusters without reference to external information

• Compare the results of different clustering algorithms

• Evaluate algorithm parameters

• For instance, to determine the correct number of clusters


Model evaluation - Cluster Tendency

• Before clustering a dataset we can test if there are actuallyclusters

• We have to test the hypothesis of the existence of patterns inthe data versus a dataset uniformly distributed (homogeneousdistribution)


Model evaluation - Cluster Tendency

• Hopkins Statistic1. Sample n points (pi) from the dataset (D) uniformly and

compute the distance to their nearest neighbor (d(pi))2. Generate n points (qi) uniformly distributed in the space of the

dataset and compute their distance to nearest neighbors in D(d(qi))

3. Compute the quotient:

H =

∑ni=1 d(pi)∑n

i=1 d(pi) +∑n

i=1 d(qi)

4. If data are uniformly distributed the value of H will be around0.5


Hopkins Statistic - Example

Cluster Quality criteria

• We can use different methodologies/criterion to evaluate thequality of a clustering:

• External criteria: Comparison with a model partition/labeleddata

• Internal criteria: Quality measures based on theexamples/quality of the partition

• Relative criteria: Comparison with other clusterings


Internal criteria

Internal criteria

• Measure properties expected in a good clustering• Compact groups

• Well separated groups

• The indices are based on the model of the groups

• We can use indices based on the attributes values measuringthe properties of a good clustering

• These indices are based on statistical properties of theattributes of the model• Values distribution

• Distances distribution


Internal criteria - Indices

• Some of the indices correspond directly to the objectivefunction optimized:

• Quadratic error/Distorsion (k-means)

SSE =

k∑

k=1

∑

∀xi∈Ck

‖ xi − µk ‖2

• Log likelihood (Mixture of gaussians/EM)



• For prototype based algorithms several measures can be use tocompute quality indices

• Scatter matrices: interclass distance, intraclass distance,separation

SWk=

∑

∀xi∈Ck

(xi − µk)(xi − µk)T

SBk= |Ck|(µk − µ)(µk − µ)T

SMk,l=

∑

∀i∈Ck

∑

∀j∈Cl

(xi − xj)(xi − xj)T



• Trace criteria (lower overall intracluster distance/higher overallintercluster distance)

Tr(SW ) =1

K

K∑

i=1

SWkTr(SB) =

1

K

K∑

i=1

SBk

• Calinski-Harabasz index (interclass-intraclass distance ratio)

CH =

∑Ki=0 |Ci| × ‖µi − µ‖2/(K − 1)

∑Kk=1

∑|Ci|i=0 ‖xi − µi‖2/(N −K)



• Davies-Bouldin criteria (maximum interclass-intraclass distanceratio)

R =1

K

K∑

i=1

Ri

where

Rij =SWi

+ SWj

SMij

Ri = maxj:j 6=i

Rij



• Silhouette index (maximum class spread/variance)

S =1

N

N∑

i=0

bi − aimax(ai, bi)

Whereai =

1

|Cj| − 1

∑

y∈Cj ,y 6=xi

‖y − xi‖

bi = minl∈H,l 6=j

1

|Cl|∑

y∈Cl

‖y − xi‖

with xi ∈ Cj, H = {h : 1 ≤ h ≤ K}



• More than 30 indices can be found in the literature

• Several studies and comparisons have been performed

• Recent studies (Arbelatiz et al, 2013) have exhaustively testedthese indices, some have a performance significativelly betterthat others

• Some of the indices show a similar performance (notstatistically different)

• The study concludes that Silhouette, Davies-Bouldin andCalinski Harabasz perform well in a wide range of situations


Internal criteria - 5 clusters different variance

Internal criteria - 5 clusters different variance -Scores


External criteria

External criteria

• These indices measure the similarity of a clustering to a modelpartition P

• Without a model they can be used to compare the results ofusing different parameters or different algorithms• For instance, can be used to assess the sensitivity to

initialization

• The main advantage is that these indices are independent ofthe examples/cluster description

• That means that they can be used to assess any clusteringalgorithm


External criteria - Indices

• All the indices are based on the coincidence of each pair ofexamples in the groups of two clusterings

• The computations are based on four values:

• The two examples in the same cluster in both partitions (a)

• The two examples in the same cluster in C, but not in P (b)

• The two examples in the same cluster in P , but not in C (c)

• The two examples in different cluster in both partitions (d)


External criteria - Indices

• Rand/Adjusted Rand statistic:

Rand =(a+ d)

(a+ b+ c+ d); ARand =

a− (a+c)(a+b)a+b+c+d

(a+c)+(a+b)2

− (a+b)(a+c)a+b+c+d

• Jaccard Coefficient:

J =a

(a+ b+ c)

• Folkes and Mallow index:

FM =

√a

a+ b· a

a+ c


External criteria - Indices - Information Theory

• Defining Mutual Information between two partitions as:

MI(Yi, Yk) =∑

Xic∈Yi

∑

Xkc′∈Yk

|X ic ∩Xk

c′|N

log2(N |X i

c ∩Xkc′|

|X ic||Xk

c′|)

• and Entropy of a partition as

H(Yi) = −∑

Xic∈Yi

|X ic|

Nlog2(

|X ic|

N)

where X ic ∩Xk

c′ is the number of objects that are in theintersection of the two groups


External criteria - Indices - Information Theory

• Normalized Mutual Information:

NMI(Yi, Yk) =MI(Yi, Yk)√H(Yi)H(Yk)

• Variation of Information:

V I(C,C ′) = H(C) +H(C ′)− 2I(C,C ′)

• Adjusted Mutual Information:

AMI(U, V ) =MI(U, V )− E(MI(U, V ))

max(H(U), H(V ))− E(MI(U, V ))


External criteria - ARI/AMI Scores

Number of clusters

Number of clusters

• A topic related to cluster validation is to decide if the numberof clusters obtained is the correct one

• This point is important specially for the algorithms that needthis value as a parameter

• The usual procedure is to compare the characteristics ofclusterings of different sizes

• Usually internal criteria indices are used in this comparison

• A graphic of this indices for different number of clusters canshow what number of clusters is more probable


Number of clusters - Indices

• Some of the internal validity indices can be used for thispurpose: Calinsky Harabasz index, Silhouette index• Using the within class scatter matrix (SW ) other criteria can be

defined:• Hartigan index:

H(k) =

[SW (k)

SW (k + 1)− 1

](n− k − 1)

• Krzanowski Lai index:

KL(k) =

∣∣∣∣DIFF (k)

DIFF (k + 1)

∣∣∣∣

being DIFF (k) = (k − 1)2/pSW (k − 1)− k2/pSW (k)


The Gap Statistic

• Assess the number of clusters comparing a clustering with theexpected distribution of data given the null hypothesis (noclusters)

• Computes different clusterings of the data increasing thenumber of clusters and compare to clusters of data (B)generated with a uniform distribution

• The interclass distance matrix SW is computed for both andcompared.

• The correct number of clusters is where the widest gap appearsbetween the SW of the data and the uniform data


The Gap Statistic

• The Gap statistic:

Gap(k) = (1/B)∑

b

log(SW (k)b)− log(SW (k))

The first term is the mean of SW for the clusters obtained fromthe uniform distributed data

• From the st. dev. (sdk) of∑

b log(SW (k)b) is defined sk as:

sk = sdk√

1 + 1/B

• The probable number of clusters is the smallest number thatholds:

Gap(k) ≥ Gap(k + 1)− sk+1


The Gap Statistic

2 3 4 5 6 7 8

log(Sw)

Cluster Stability

• The idea is that if the model chosen for clustering a dataset iscorrect, it should be stable for different samplings of the data

• The procedure is to obtain different subsamples of the data,cluster them and test their stability


Cluster Stability

• Using disjoint samples:

• Dataset divided in two disjoint samples that are clusteredseparately

• Indices can be defined to assess stability, for example using thedistribution of the number of neighbors that belong to thecomplementary sample

• Using non disjoint samples:

• Dataset divided in three disjoint samples (S1,S2, S3)• Two clusterings are obtained from S1 ∪ S3, S2 ∪ S3• Indices can be defined about the coincidence of the common

examples in both partitions


Python Notebooks

This Python Notebook has examples for Measures of ClusteringValidation

• Clustering Validation Notebook (click here to go to the url)



Python Code

• In the code from the repository inside subdirectory Validationyou have the python program ValidationAuthors,

• The authors dataset is clustered with different algorithms(K-means, GMM, Spectral) and different validity indices areplotted for the number of clusters


128 CHAPTER 4. CLUSTER VALIDATION

5Clustering of Large Datasets

129

Clustering in KDD

Javier Béjar

URL - Spring 2020

CS - MIA

Introduction

Clustering in KDD

• One of the main tasks in the KDD process is the analysis ofdata when we do not know its structure

• This task is very different from the task of prediction where weknow the goal and we try to approximate it

• A great part of the KDD tasks are non supervised problems(KDNuggets poll, 2-3 most frequent task)

• Problems: Scalability, arbitrary cluster shapes, limited types ofdata, finding the correct parameters, ...

• There are some new algorithms that deal with these kind ofproblems


Scalability Strategies

Strategies for cluster scalability

• One-pass• Process data as a stream

• Summarization/Data compression• Compress examples to fit more data in memory

• Sampling/Batch algorithms• Process a subset of the data and maintain/compute a global

model

• Approximation• Avoid expensive computations by approximate estimation

• Paralelization/Distribution• Divide the task in several parts and merge models


One pass

• This strategy is based on incremental clustering algorithms

• They are cheap but order of processing affects greatly theirquality

• Although they can be used as a preprocessing step

• Two steps algorithms

1. A large number of clusters is generated using the one-passalgorithm

2. A more accurate algorithm clusters the preprocessed data


Data Compression/Summarization

• Not all the data is necessary to discover the clusters

• Discard sets of examples and summarize by:

• Sufficient statistics

• Density approximations

• Discard data irrelevant for the model (do not affect the result)


Approximation

• Not using all the information available to make decisions

• Using K-neighbours (data structures for computingk-neighbours)

• Preprocessing the data using a cheaper algorithm

• Generate batches using approximate distances (eg: canopyclustering)

• Use approximate data structures

• Use of hashing or approximate counts for distances andfrequency computation


Batches/Sampling

• Process only data that fits in memory

• Obtain from the data set:

• Samples (process only a subset of the dataset)• Determine the size of the sample so all the clusters are

represented

• Batches (process all the dataset)


Paralelization/Distribution/Divide&Conquer

• Paralelization usually depends on the specific algorithm

• Some not easy to parallelize (eg: hierarchical clustering)

• Some have specific parts that can be solved in parallel or byDivide&Conquer

• Distance computations in k-means

• Parameter estimation in EM algorithms

• Grid density estimations

• Space partitioning

• Batches and sampling are more general approaches

• The problem is how to merge all the different partitions

Scalable Algorithms

Scalable Hierarchical Clustering

Patra, Nandi, Viswanath Distance based clustering method forarbitrary shaped clusters in large datasets Pattern Recognition, 2011,44, 2862-2870

• Strategy: One pass + Summarization• The leader algorithm is used as a one pass summarization using

Leader algorithm (many clusters)• Single link is used to cluster the summaries• Guarantees the equivalence to SL at top levels• Summarization makes the algorithm independent of the dataset

size (depends on the radius used on the leader algorithm andthe volume of the data)• Complexity O(c2)


One pass + Single Link

1st Phase

2nd Phase

Leader Algorithm

Hierarchical Clustering


BIRCH

Zhang, Ramakrishnan, Livny BIRCH: An Efficient Data ClusteringMethod for Very Large Databases (1996)

• Strategy: One-pass + Summarization

• Hierarchical clustering with limited memory

• Incremental algorithm

• Based on probabilistic prototypes and distances

• We need two pass from the database

• Based on an specialized data structure named CF-tree(Clustering Feature Tree)


BIRCH (CF-tree)

• Balanced n-ary tree containing clusters represented byprobabilistic prototypes

• Leaves have capacity L prototypes and clusters radius can notbe more than T

• Non terminal nodes have a fixed branching factor (B), eachelement summarizes its subtree

• Choice of parameters is crucial because available space could befilled during the process

• This is solved by changing the parameters (basically T ) andrecompressing the tree (T determines the granularity of thefinal groups)


BIRCH - Insertion algorithm

1. Traverse the tree until reaching a leave and choose the nearestprototype

2. On this leave we could introduce the instance in an existinggroup or create a new prototype depending on if the distance islarger than parameter T

3. If the current leave has no space for the new prototype, thencreate a new terminal node and distribute the prototypesamong the current node and the new node


BIRCH - Insertion algorithm (cont.)

4. The distribution is performed choosing the two most differentprototypes and dividing the rest using their proximity to thesetwo prototypes

5. This creates a new node in the ascendant node, if the new nodeexceeds the capacity of the father then it is split and the processis continued until the root of the tree is reached if necessary

6. Additionally we could perform merge operations to compact thetree and reduce space


BIRCH - Insertion Algorithm

Insertion + division


BIRCH - Clustering algorithm

1. Phase 1: Construction of the CF-tree, we obtain a hierarchythat summarizes the database as a set of groups whichgranularity is defined by T

2. Phase 2: Optionally we modify the CF-tree in order to reduceits size by merging near groups and deleting outliers

3. Phase 3: We use the prototypes inside the leaves of the treesas new instances and we run a clustering algorithm with them(for instance K-means)

4. Phase 4: We refine the groups assigning the instances fromthe original database to the prototypes obtained in the previousphase


One pass + CFTREE (BIRCH)

1st Phase - CFTree 2nd Phase - Kmeans

Scalable K-means clustering

Bradley, Fayyad, Reina Scaling Clustering Algorithms to LargeDatabases Knowledge Discovery and Data Mining (1998)

• Strategy: Sampling + Summarization

• Clustering algorithms need to have all data in main memory toperform their task

• We try to obtain scalability looking for an algorithm that:

• Only look at the data one time• To be anytime (always a result is available)• To be incremental (more data not to start from scratch)• To be suspendable (continue from current solution)• To use limited memory


Scalable K-means clustering (Algorithm)

• Obtain a sample that fits in memory

• Update the actual model

• Classify new instances as:• Necessary

• Discardable (We keep their information as sufficient statistics)

• Summarizable using data compression

• Decide if the model is stable or we keep clustering more data


Canopy Clustering

McCallum, Nigam, Ungar Efficient clustering of high-dimensionaldata sets with application to reference matching (2002)

• Strategy: Divide & Conquer + Approximation

• The approach is based on a two stages clustering

• The first stage can be seen as a preprocess to determine theneighborhood of the densities and reducing the number ofdistances to compute on the second stage

• This first stage is the called canopy clustering, relies on a cheapdistance and two parameters T1 > T2

• This parameters are used as two centered spheres thatdetermine how to classify the examples.

Canopy Clustering - Algorithm

• Algorithm:1. One example is picked at random and the cheap distance from

this example to the rest is computed2. All the examples that are at less than T2 are deleted and

included in the canopy3. The points at less than T1 are added to the canopy of this

examples without deleting them4. The procedure is repeated until the example list is empty5. Canopies can share examples

• After the data can be clustered with different algorithms

• For agglomerative clustering only the distances among theexamples in the canopies have to be computed


Canopy Clustering - Algorithm

1st Canopy 2nd Canopy 3rd Canopy

Mini-batch K-means

Sculley Web-scale k-means clustering Proceedings of the 19thinternational conference on World wide web, 2010, 1177-1178

• Strategy: Sampling

• Apply K-means to a sequence of bootstrap samples of the data

• Each iteration the samples are assigned to prototypes and theprototypes are updated with the new sample

• Each iteration the weight of the samples is reduced (learningrate)

• The quality of the results depends on the size of the batches

• Convergence is detected when prototypes are stable


Mini-batch K-means (algorithm)

Given: k, mini-batch size b, iterations t, data set XInitialize each c ∈ C with an x picked randomly from Xv ← 0for i ← 1 to t do

M ← b examples picked randomly from Xfor x ∈ M do

d[x] ← f(C,x)

for x ∈ M doc ← d[x]v[c] ← v[c] + 1η ← 1

v[c]

c ← (1-η)c+ηx

CURE

Guha, Rastogi, Shim CURE: An efficient clustering algorithm forlarge databases (1998)

• Strategy: Sampling + Divide & Conquer• Hierarchical agglomerative clustering• Scalability is obtained by using sampling techniques and

partitioning the dataset• Uses a set of representatives (c) for cluster instead of centroids

(non spherical groups)• Distance is computed as the nearest pair of representatives

among groups• The clustering algorithm is agglomerative and merges pairs of

groups until k groups are obtainedURL - Spring 2020 - MAI 24/31

CURE - Algorithm

1. Draws a random sample from the dataset

2. Partitions the sample in p groups

3. Executes the clustering algorithm on each partition

4. Deletes outliers

5. Runs the clustering algorithm on the union of all groups until itobtains k groups

6. Label the data accordingly to the similarity to the k groups


CURE - Algorithm

Sampling+Partition Clustering partition 1

Clustering partition 2 Join partitions Labelling data

DATA

Rough-DBSCAN

Viswanath, Babu Rough-DBSCAN: A fast hybrid density basedclustering method for large data sets Pattern Recognition Letters,2009, 30, 1477 - 1488

• Strategy: One-pass + Summarization• Two stages algorithm:

1. Preprocess using the leader algorithm• Determine the instances that belong to the higher densities and

their neighbours2. Apply DBSCAN algorithm

• Determine the densities for the selected instances• Approximate the values of the densities from their distances

and the sizes of the neighbor• Assign the neighbors accordingly to the found densities


MapReduce Clustering

Zhao, W., Ma, H., He, Q. Parallel K-Means Clustering Based onMapReduce Cloud Computing, LNCS 5931, 674-679, Springer 2009

• Strategy: Distribution/Divide & Conquer

• Applied to K-means and GMM

• Mappers have a copy of the current centroids and assigns theclosest one to examples

• Reducers compute the new centroids according to theassignments


MapReduce Clustering

N Mappers K Reducers

Assign Prototype 1

Prototype k

DA

TA

Prototypes (i)

Assign

Assign

Assign

Prototype 2

...Prototypes (i+1)


Peer2Peer clustering

• Use Peer2Peer networks as a divide and conquer strategy

• Each member of the network process a chunk of the data

• Peers interchange messages with data or intermediate results

• Strategies:• Synchronous: All peers interchange information at timed

intervals• Asynchronous: All peers work independently and interchange

information randomly

• Messages:• Messages consist of prototypes that are integrated (possibly

with a weighting strategy)• Messages consist of examples or summarized examples

Python Notebooks

This Python Notebook has examples comparing K-means algorithmwith two scalable algorithms Mini Batch K-means and BIRCH

• Clustering DM Notebook (click here to go to the url)



148 CHAPTER 5. CLUSTERING OF LARGE DATASETS

6Consensus Clustering

149

Consensus Clustering

Javier Béjar

URL - Spring 2020

CS - MAI


• The ensemble of classifiers is a well established strategy insupervised learning

• Unsupervised learning aims the same goal: Consensus clusteringor clustering ensemble

• The idea is to merge complementary perspectives of the datainto a more stable partition



• Given a set of partitions of the same data X :

P = {P 1, P 2, ..., P n}

with:

P 1 = {C11 , C

12 , ..., C

1k1}

...

P n = {Cn1 , C

n2 , ..., C

nkn}

to obtain a new partition that uses the information of all npartitions


Goals

• Robustness, the combination has a better performance thaneach individual partition in some sense

• Consistency, the combination is similar to the individualpartitions

• Stability, the resulting partition is less sensitive to outliers andnoise

• Novelty, the combination is able to obtain different partitionsthat can not be obtained by the clustering methods thatgenerated the individual partitions


advantages

• Knowledge reuse, the consensus can be computed from thepartition assignments so previous partitions using the same ordifferent attributes can be used

• Distributed computing, the individual partitions can beobtained independently

• Privacy, only the assignments of the individual partitions areneeded for the consensus


Consensus Process

Consensus Process

• Consensus clustering is based generally in a two steps process:

1. Generate the individual partitions to be combined

2. Combine the partitions to generate the final partition

=


Partition Generation

• Different example representations: Diversity by generatingpartitions with different subsets of attributes

• Different clustering algorithms: Take advantage that allclustering algorithms have a different biases

• Different parameter initialization: Use clusteringalgorithms able to produce different partitions using differentparameters

• Subspace projection: Use dimensionality reductionstechniques

• Subsets of examples: Use random subsamples of the dataset(bootstrapping)

Consensus Generation

• Coocurrence based methods: Use the labels obtained fromeach individual clustering and the coincidence of the labels forthe examples• Relabeling and voting, co-association matrix, graph and

hypergraph partitioning, information theory measures, finitemixture models

• Median partition based methods: Given a set of partitions(P) and a similarity function (Γ(Pi, Pj)), find the partition (Pc)that maximizes the similarity to the set:

Pc = arg maxP∈Px

∑

Pi∈PΓ(P, Pi)


Coocurrence based methods

Relabeling and voting

• First, solve the labeling correspondence problem

• After, determine the consensus using different strategies ofvoting

Dimitriadou Weingessel, Hornik Voting-Merging: An EnsembleMethod for Clustering Lecture Notes in Computer Science, 2001, 2130

1. Generate a clustering

2. Determine the correspondence with the current consensus

3. Each example gets a vote from their cluster assignment

4. Update the consensus


Co-Association matrix

• Co-Association matrix: Count how many times a pair ofexamples are in the same cluster

• Use the matrix as a similarity or a new set of characteristics

• Apply a cluster algorithm to the information from theco-association matrix

=

0 1 2 3 4 5 6 7 8 9 10 11

11

910

87

46

20

13

5

0 0 0 0 0 0 0 0 1 3 3 3

0 0 0 0 0 0 0 0 1 3 3 3

0 0 0 0 0 0 0 0 1 3 3 3

0 0 0 1 2 1 2 2 3 1 1 1

0 0 0 1 3 2 3 3 2 0 0 0

0 0 0 1 3 2 3 3 2 0 0 0

0 0 0 1 3 2 3 3 2 0 0 0

3 3 3 2 0 1 0 0 0 0 0 0

3 3 3 2 0 1 0 0 0 0 0 0

3 3 3 2 0 1 0 0 0 0 0 0

2 2 2 3 1 2 1 1 1 0 0 0

1 1 1 2 2 3 2 2 1 0 0 0

Co-Association matrix

Fred Finding Consistent Clusters in Data Partitions MultipleClassifier Systems, 2001, 309-318

Fred, Jain Combining Multiple Clusterings Using EvidenceAccumulation IEEE Trans. Pattern Anal. Mach. Intell., 2005, 27,835-850

1. Compute the co-association matrix

2. Apply hierarchical clustering (different criteria)

3. Use a heuristic to cut the resulting dendrogram


Graph and hypergraph partitioning

• Define consensus as a graph partitioning problem

• Different methods to build a graph or hypergraph from thepartitions

Strehl, Ghosh Cluster ensembles- A knowledge reuse framework forcombining multiple partitions Journal of Machine Learning Research,MIT Press, 2003, 3, 583-617

• Cluster based Similarity Partitioning Algorithm (CSPA)

• HyperGraph-Partitioning Algorithm (HGPA)

• Meta-CLustering Algorithm (MCLA)


CSPA

• Compute a similarity matrix from the clusterings• Hyperedges matrix: For all clusterings, compute an indicator

matrix (H) that represents the links among examples andclusters (Hypergraph)• Compute the similarity matrix as:

S =1

rHHT

where r is the number of clusterings• Apply a graph partitioning algorithm to the distance matrix

(METIS)• Drawback: Quadratic cost in the number of examples O(n2kr)


CSPA: example

C1 C2 C3

x1 1 2 1x2 1 2 1x3 1 1 2x4 2 1 2x5 2 3 2

=⇒

C1,1 C1,2 C2,1 C2,2 C2,3 C3,1 C3,2

x1 1 0 0 1 0 1 0x2 1 0 0 1 0 1 0x3 1 0 1 0 0 0 1x4 0 1 1 0 0 0 1x5 0 1 0 0 1 0 1

S =

x1 x2 x3 x4 x5

x1 1 1 1/3 0 0x2 1 1 1/3 0 0x3 1/3 1/3 1 2/3 1/3x4 0 0 2/3 1 2/3x5 0 0 1/3 2/3 1


HGPA

• Partitions the hypergraph generated by the examples and theircluterings• The indicator matrix is partitioned into k clusters of

approximately the same size• The HMETIS hypergraph partitioning algorithm is used• Linear in the number of examples O(nkr)

MCLA

• Group and collapse hyperedges and assign the objects to thehyperedge in which they participate the most

• Algorithm1. Build a meta-graph with the hyperedges as vertices (edges have

the vertices similarities as weights, Jaccard)

2. Partition the hyperdeges into k metaclusters

3. Collapse the hyperedges of each metacluster

4. Assign examples to their most associated metacluster

• Linear in the number of examples O(nk2r2)


MCLA: Metagraph

C1,1 C1,2 C2,1 C2,2 C2,3 C3,1 C3,2

x1 1 0 0 1 0 1 0x2 1 0 0 1 0 1 0x3 1 0 1 0 0 0 1x4 0 1 1 0 0 0 1x5 0 1 0 0 1 0 1

C11 C12

C21

C22

C23

C31

C32


Information Theory

• Information Theory measures are used to assess the similarityamong the clusters of the parititions

• For instance:• Normalized Mutual Information

• Category Utility

• The labels can be transformed to a new set of features for eachexample (measuring example coincidence)

• The new features can be used to partition the examples using aclustering algorithm


Finite Mixture Models

• The problem is transformed into the estimation of theprobability of assignment

• The mixture is composed by the product of multinomialdistributions, one for each clustering

• Each example is described by the set of assignments of eachclustering

• An EM algorithm is used to find the probability distributionthat maximized the agreement


Median partition basedmethods

Median Partition Methods

• Given a set of partitions (P) and a similarity function amongpartitions Γ(Pi, Pj), the Median Partition Pc is the one thatmaximizes the similarity to the set

Pc = arg maxP∈Px

∑

Pi∈PΓ(P, Pi)

• Has been proven to be a NP-hard problem for some similarityfunctions Γ


Similarity functions

• Based on the agreements and disagreements of pairs ofexamples between two partitions• Rand index, Jaccard coefficient, Mirkin distance (and their

randomness adjusted versions)

• Based on set matching• Purity, F-measure

• Based on information theory measures (how much informationtwo partitions share)• NMI, Variation of Information, V-measure


Strategies

• Best of k (the partition of the set that minimizes the distance)

• Optimization using local search: Hill Climbing, SimulatedAnnealing, Genetic Algorithms• Perform a movement of examples between two clusters of the

current solution to improve the partition

• Non Negative Matrix Factorization• Find the partition matrix closest to the averaged association

matrix of a set of partitions


Python Notebooks

This Python Notebook has examples of consensus clustering

• Consensus clustering Notebook (click here to go to the url)



7Clustering Structured Data

163

Clustering of structured data

Javier Béjar

URL - Spring 2020

CS - MAI

Introduction

Clustering of structured data

• There are some domains where patterns are more complex

• In these domains examples are related to each other

• Mining these relationships is more interesting than obtainingpatterns from the examples individually

• For instance:

• Temporal domains

• Relational databases

• Structured instances (trees, graphs)

• Usually these domains need specific methods


Sequences

Clustering of sequences

• Data have a sequential relationship among examples

• We can have a unique sequence or a set of sequences

• The classical techniques from time series analysis do not apply(AR, ARIMA, GARCH, Kalman filter ...)

• What makes different these data?

• Usually qualitative data

• Very short series or long series that have to be segmented

• Interest in the relationships among series

• Interest only in a part of the series (episodes, anomalies,novelty, ...)


Clustering Sequences

• Clustering of temporal series: Clustering algorithms appliedto a set of short series

• How to segment a unique series in a set of series? what partsare interesting?

• Representation of the series, representation of the groups

• New distance/similarity measures (scale invariant, shapedistances, ...)


Clustering Sequences - Segmentation

• A unique series is provided and we must divide it into a set ofsubseries (series segmentation)

• Extract subseries using a sliding window• Width of the window

• overlapping/non overlapping

• Only some parts of the series are the target (episodes)

• Anomaly Detection

• Change Detection

• Be careful with unbalanced datasets


Clustering Sequences - Segmentation

W1 W2 W3

T

T

... ...

Ep1 Ep2 Ep3 Ep4 Ep5 Ep6

Wn

Sliding Window

Episodes


Clustering Sequences - Feature Extraction

• Raw time series are not always the best input

• Feature extraction: Generate informative features

• Frequency/Time domain features (Fourier, Wavelets, ...)

• Extreme points (maximum, minima, inflection points)

• Probabilistic models (Hidden Markov Models, ARIMA)

• Symbolic representation: SAX, SFA


Clustering Sequences - Feature Extraction

Original Series Discrete Fourier Coefficients

Discrete Wavelet Coefficients 1-Lag SeriesURL - Spring 2020 - MAI 7/41

Symbolic Aggregate approXimation (SAX)

• Transforms a time series into a set of discrete symbols

• Data are discretized to strings of length M with a vocabularyof size N

• Algorithm:• Standardize the series (N (0, 1))

• For each of the M subwindows compute their mean

• Map each mean to a Gaussian distribution discretized to N

segments of equal frequency

• Transformed data can be used as a string or a integer valuedseries


SAX

a

b

c

a b c a a b


Clustering Sequences - Distance Functions

• Usual distance functions ignore timedynamic• Euclidean, hamming, ...

• Patterns in series contain noise,time/amplitude scaling, translations• Dynamic Time Warping (DTW)• Longest Common Subsequence (LCSS)• Edit Distance with Real Penalty (ERP)• Edit Distance on Real Sequence (EDR)• Spatial Assembly Distance (SpADe)

Dymamic Time Warping (DTW)

• Matches the dynamic of the series

• Series can be of different lengths

• The cost of matching two points is their distance (e.g.euclidean)

• A point from one series can be matched to multiple points ofthe other

• DTW is the minimum cost of the possible matchings

• Computing the distance is O(n2), but it can be reduced bylimiting the number of points able to match to a point


Clustering Sequences - Distance Functions -DTW


Clustering of Data Streams


Data streams: Modeling an on-line continuous series of data

• Each item of the series is an example (one value, a vector ofvalues, structured data)• For instance, sensory data (one or multiple synchronized data),

stream of documents (twitter/news)

• Data are generated from a set of clusters (stable or changingover time)• For instance, states from a process or semantic topics



• Data are processed incrementally (model changes with time)• Only the current model

• Periodic snapshots

• Different goals:

• Model the domain

• Detect anomalies/novelty/bursts

• Detect change (Concept drift)


Clustering of Data Streams - Elements

• Clustering has an on-line and an off-line phase

• Elements:• The data structure used to summarize the data

• The window model used to decide the influence of the currentand past data

• The mechanism for identifying outliers

• The clustering algorithm used to obtain the partition of thedata


Clustering of Data Streams - Summary datastruc-ture

• Involved in the on-line phase

• Data are summarized using sufficient statistics (num ofexamples, sum of values, sum of squared products of values, ...)

• Usually a hierarchical datastructure (different levels ofgranularity)

• Indexing structure that can be updated incrementally

• Stores raw data or prototypes depending on space constraints


Clustering of Data Streams - Window model

• Sliding window model• Fixed time window

• Only data inside the window updates the structure

• Damped window model• A weight is associated to examples and clusters

• Influence of data depends on time, old data fades away or arediscarded

• Landmark window model• Defines points of interest in time or amount of data

• Data before the landmark are discarded


Clustering of Data Streams - Outliers

• Difficult task because data evolve with time

• Most methods work around the idea of microclusters

• A microcluster represents a dense area in the space of examples

• The indexing structure tracks the evolution of the microclusters

• Different thresholds determine if a microcluster is kept ordiscarded


CluStream - Prototype Based

Aggarwal et al. On Clustering Massive Data Streams: A SummarizationParadigm Data Streams-Models and Algorithms, Springer, 2007, 31, 9-38

• On-line phase:• Maintains microclusters (more than final number of clusters)• New data is incorporated to a microcluster or generates new

microclusters• The number of microclusters is fixed, they are merged to maintain

the number• Periodically the microclusters are stored

• Off-line phase:• Given a time window the stored microclusters are used to compute

the microclusters inside the time frame• K-means used to compute the clusters for the time window

DenStream - Density Based

Cao, Ester, Qian, Zhou Density-Based Clustering over an EvolvingData Stream with Noise Proceedings of the Sixth SIAM InternationalConference on Data Mining, 2006

• On-line phase:• Core-micro-clusters (a weighted sum of close points)• The weight of a point fades exponentially with time (damping

window model)• New examples are merged and mc are classified as:

• core-mc, sets of points with weight over a threshold• potential-mc• outlier-mc, sets of points with weight below a threshold

• outlier-mc dissapear with time• Off-line phase: Modified version of DBSCAN


Python Notebooks

This Python Notebook has examples of time series clustering

• Time Series clustering Notebook (click here to go to the url)



Graph mining

Mining of Structures

• A lot of information has a relational structure

• Methods and models used for unstructured data are notexpressive enough

• Sometimes structure can be flattened, but lots of interestinginformation is lost

• Relational database ⇒ unique merged table

• Attributes representing relations ⇒ inapplicable attributes

• Graph data ⇒ strings based on graph traversal

• Documents ⇒ bag of words


Mining of Structures (WWW/Social networks)


Mining of Structures (XML documents/Text)


Mining of Structures (Chemical compounds/Geneinteractions)


Mining of Structures

• All these types of data have in common that can be representedusing graphs and trees

• Historically we can find different approaches to the discovery ofpatterns in graphs/trees:

• Inductive logic programming: Structure is represented usinglogic formulas

• Graph algorithms

• Classic algorithms for detecting dense subgraphs (cliques)

• Graph isomorphism algorithms

• Graph partitioning algorithms


Mining of Structures: Computational issues

• Most of the problems used to discover structures in graphs areNP-Hard

• Graph partitioning (Not for bi-partitioning)

• Graph isomorphism

• Two different problems:

• Clustering large graphs (only one structure) ⇒ Partitioning

• Clustering sets of graphs ⇒ common substructures


Clustering Large Graphs

• Some information can be described as a large graph (severalinstances connected by different relations)

• For instance: Social networks, Web pages,

• We want to discover interesting substructures by:

• Dividing the graph in subgraphs (k-way partitioning, nodeclustering)

• Extracting dense substructures


Graph partitioning (2-way)

• The simplest partitioning of a graph is to divide the graph intwo subgraphs

• We assume that edges have values as labels (similarity,distance, ...)

• This problem is the minimum cut-problem:

“Given a graph, divide the set of nodes in two groups so thecost of the edges connecting the nodes between the groups isminimum”

• This problem is related to the maximum flow problem that canbe solved in polynomial time


Graph partitioning (2-way) - Karger’s Algorithm

• Randomized algorithm that approximates the min cut of agraph for undirected graphs

• Computational cost O(|E|)• Has to be repeated O(|V | log |E|) to have high probability of

finding the global minimum

• Algorithm:1. Pick an edge at random and join its vertices, reconnect the

remaining vertices to the new vertex

2. Repeat until only two vertices remain


Graph partitioning (k-way)

• The general problem is NP-hard

• It can be solved approximately by local search algorithms (hillclimbing, simulated annealing)

• Kerninghan-Lin Algorithm:

1. Start with a random cut of the graph (k-clusters)

2. Interchange a pair of nodes from different partitions thatreduces the cut

3. Iterate until no improvement

• Different variations of this algorithm changing the strategy forselecting the pair of nodes to interchange


Classical Clustering algorithms

Classical clustering algorithms can be adapted to obtain a graphpartition

• K-means and K-medoids variations

• Nodes of the graphs as prototypes• Objective functions to define node membership to clusters

(geodesic distance)• Network structure indices

• Spectral Clustering

• Define the Laplacian matrix from the graph• Perform the eigendecomposition• The largest Eigenvalues determine the number of clusters


Social Networks - Community Discovery

• Graph partitioning is the problem of Community Discovery inthe area of Social Networks Analysis

• Based on graph measures detecting dense connected areas

• Edge betweenness centrality:

B(e) =NumConstrainedPaths(e, i, j)

NumShortPaths(i, j)

• Random Walk Betweeness: Compute how often a random walkstarting on node i passes through node j

• Modularity: Percentage of edges within communities comparedwith the expected number if they are not a community

Social Networks - Girvan Newman

• Girvan-Newman Algorithm (Betweenness)

1. Rank edges by B(e)

2. Delete edge with the highest score

3. Iterate until a specific criteria holds (eg. number ofcomponents)


Social Networks - Girvan Newman


Social Networks - Louvain

• Louvain Algorithm (Modularity)

1. Begin with a community for each node of the graph

2. Repeat until no change:• For each node i in the graph and for all its neighbors j of i,

consider the effect on the modularity of changing thecommunity of i to the community of j. Change the node ifmodularity increases

3. Build a new graph where the new nodes are the communitiesand the weights of the edges connecting communities are thesum of the edges among the nodes in the original graph

4. Repeat from 2 until no changes


Social Networks - Louvain

Modularity Optimization

3

3

1

1 2

7

7

3

10

47 26

1st step 2nd step

Community Aggregation


Mining Sets of Graphs

• Some information can be described as a collection of graphs

• For example: XML documents, chemical molecules

• We look at a graph as a complex object

• We have to adapt the elements of clustering algorithms tothese objects:

• Distance measures to compare graphs

• Summarization of graphs as prototypes


Graph Edit distance

• Edit distance can be adapted to graphs

• Define add/delete/substitute costs for edges and vertices

• Different costs leads to different functions (sometimes do nothold distance properties)

• Distance is the minimum cost path that transforms one graphinto another

G1 G2Del V Subst V Subst V

Graph Kernels

• Specific graph kernels can be used to embed the data in ametric space

• A base similarity or distance function can be used (like graphedit distance)

• Diffusion kernels (extend similarity to closest neighbors)

• Walk kernels (computing the similarity of traversal paths)

• RBF kernels


Python Notebooks

This Python Notebook has examples of community discovery usinggeolocation information from Twitter for London, Paris andBarcelona

• Dense Subgraphs Notebook (click here to go to the url)



8Semisupervised Clustering

187

Semi-supervised Clustering

Javier Béjar

URL - Spring 2020

CS - MAI

Semisupervised Clustering


• Sometimes we have available some information about thedataset we are analyzing unsupervisedly

• Could be interesting to incorporate this information to theclustering process in order to:• Bias the search of the algorithm toward the solutions more

consistent with our knowledge

• Improve the quality of the result reducing the algorithm naturalbias (predictivity/stability)



• The information that we have available can be of differentkinds:• Sets of labeled instances

• Constrains among certain instances: Instances that have to bein the same group/instances that can not belong to the samegroup

• General information about the properties that the instances ofa group must hold


How to use supervised information

• It will depend on the model we can obtain1. Begin with a prior model that changes how the search is

performed

2. Bias the search, pruning the models that are not consistentwith the semisupervised knowledge

3. Modify the similarity among instances to match the constraintsimposed by the prior knowledge


SemisupervisedClustering/Labeled Examples

Semi supervised clustering using labeled examples

• Assuming that we have some labeled examples, these can beused to obtain an initial model

• We only have to know what examples belong to clusters, theactual clusters are not needed

• We can begin from this model the clustering process, used as astarting point of the search

• This initial model changes the search and biases the final model



Basu, Banerjee, Mooney Semi supervised clustering by seeding ICML2002

• Algorithm based on K-means• The usual initialization of K-means is by selecting randomly the

initial prototypes• Two alternatives:

• Use the labeled examples to build the initial prototypes (seeding)• Use the labeled examples to build the initial prototypes and

constrain the model so the labeled examples are always in theinitial clusters (seed and constraint)

• The initial prototypes give an initial probability distribution forthe clustering




Seeded-KMeans

Algorithm: Seeded-KMeans

Input: The dataset X , the number of clusters K, a set S of labeledinstances (k groups)

Output: A partition of X in K groupsbegin

Compute K initial prototypes (µi) using the labeled instancesrepeat

Assign each example from X to their nearest prototype µi

Recompute the prototype µi with the examples assigneduntil Convergence

Constrained-KMeans

Algorithm: Constrained-KMeans

Input: The dataset X , the number of clusters K, a set S of labeledinstances (k groups)


Compute K initial prototypes (µi) using the labeled instancesrepeat

Maintain the examples from S in their initial classesAssign each example from X to their nearest prototype µi

Recompute the prototype µi with the examples assigneduntil Convergence

SemisupervisedClustering/Constraints

Semi supervised clustering using constraints

• To have labeled examples means that the number of clustersand something about the characteristic of the data are known

• Sometimes it is easier to have information about if twoexamples belong to the same or different clusters

• This information can be expressed by means of constraintsamong examples: must links and cannot links

• This information can be used to bias the search and only lookfor models that maintain these constraints





Basu, Bilenko, Mooney A probabilistic framework for semi-supervisedclustering ICML 2002

• Algorithm based on K-means (spherical clusters based onprototypes)

• A set of must-link and cannot-link constraints is defined over asubset of examples

• The quality function of the K-means algorithm is modified tobias the search

• A hidden markov random field is defined using the constraints



• The labels of the examples canbe used to define a markovrandom field

• The must-links and cannot-linksdefine the dependence amongthe variables

• The clustering of the exampleshas to maximize the probabilityof the hidden markov randomfield

Data

Hidden�MRF

Must�link

Must�link

Cannot�link



• A new objective function for the K-Means is defined

• The main idea is to introduce a penalty term to the objectivefunction that:• Penalizes the clustering that puts examples with must-links in

different clusters

• Penalizes the clustering that puts examples with cannot-links inthe same cluster

• This penalty has to be proportional to the distance among theinstances


HMRF-KMeans

Algorithm: HMRF-KMeans

Input: The data X , the num of clusters K, must and cannot links, adistance function D and weights for violating the constraints


Compute K initial prototypes (µi) using constraintsrepeat

E-step: Reassign the labels of the examples using theprototypes (µi) to minimize Jobj

M-step: Given the cluster labels recalculate cluster centroidsto minimize Jobj

until Convergence

SemisupervisedClustering/Distance Learning

Semi supervised clustering with Distance Learning

• Other approach consists on learning a more adequate distancefunction to fulfill the constraints

• The constraints are used as a guide to find a distance matrixthat represents the relations among examples

• The problem can be defined as an optimization problem thatoptimizes the distances among examples with respect to theconstraints

• These methods are related to kernel methods, the goal is tolearn a Kernel matrix that represents a new space where theinstances have appropriate distances



• Relevant Component Analysis [Yeung, Chang (2006)](optimization of linear combination of distance kernelsgenerated by must and cannot links)

• Optimization with the Graph Spectral Matrix, maintaining theconstrains in the new space and the structure of the originalspace

• Learning of Mahalanobis distances: separate/approach thedifferent dimensions to match the constraints

• Kernel Clustering (Kernel K-means) with kernel matrix learningvia regularization



Original First Iteration

Second Iteration Third Iteration