unsupervised machine learningbejar/url/material/urltransbook.pdfjavier béjar url - spring 2020 cs -...

206
(2020 Spring Semester)

Upload: others

Post on 22-Jul-2020

13 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Unsupervised Machine Learning

(Course Slides)

URL

Master in Artificial Intelligence

Javier Béjar

(2020 Spring Semester)

Page 2: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical
Page 3: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

This work is licensed under the Creative CommonsAttribution-NonCommercial-ShareAlike License. cbea

To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/2.0/ orsend a letter to:

Creative Commons,559 Nathan Abbott Way, Stanford,California 94305,USA.

Page 4: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical
Page 5: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Contents

1 Introduction to Knowledge Discovery 1

2 Data Preprocessing 15

3 Data Clustering 57

4 Cluster Validation 109

5 Clustering of Large Datasets 129

6 Consensus Clustering 149

7 Clustering Structured Data 163

8 Semisupervised Clustering 187

1

Page 6: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

2

Page 7: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Preface

These are the slides for the first half of the course Unsupervised and Reinforcement Learning (URL)from the master on Artificial Intelligence of the Barcelona Computer Science School (Facultatd’Informàtica de Barcelona), Technical University of Catalonia (UPC, BarcelonaTech).

This slides are used in class to present the topics of the course and have been prepared usingthe papers and book references that you can find in the course website http://www.cs.upc.edu/~bejar/URL/URL.html).

This document is a complement so you can prepare the classes and use it as a reference, but it isnot a substitute for the classes or the material from the webpage of the course.

Javier BéjarBarcelona, 2020

3

Page 8: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

4

Page 9: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

1Introduction to Knowledge Discovery

1

Page 10: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Knowledge Discovery

Javier Béjar

URL - Spring 2020

CS - MIA

Knowledge Discovery (KDD)

Page 11: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Knowledge Discovery in Databases (KDD)

• Practical application of the methodologies from machinelearning/statistics to large amounts of data

• Problem: the impossible task of manually analyzing all thedata we are systematically collecting

• Useful for automating/helping the process of analysis/discovery

• Final goal: to extract (semi)automatically actionable/usefulknowledge

“We are drowning in information and starving for knowledge”

URL - Spring 2020 - MAI 1/20

Knowledge Discovery in Databases

• The high point of KDD starts around late 1990s

• Many companies show their interest in obtaining the (possibly)valuable information stored in their databases (purchasetransactions, e-commerce, web data, ...)

• The area has moved/integrated/transmuted several times toinclude several sometimes interchangeable terms: BusinessIntelligence, Business Analytics, Predictive Analytics, DataScience, Big Data ...

• The Venn Diagram Wars

URL - Spring 2020 - MAI 2/20

Page 12: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Venn Wars: The funny one

Venn Wars: The job description one

Page 13: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Venn Wars: The scientific one

Venn Wars: The multidisciplinary one

Page 14: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Venn Wars: The reality check one

KDD definitions

“It is the search of valuable information in great volumes ofdata”

“It is the explorations and analysis, by automatic or semiau-tomatic tools, of great volumes of data in order to discoverpatterns and rules”

“It is the nontrivial process of identifying valid, novel, po-tentially useful, and ultimately understandable patternsin data”

URL - Spring 2020 - MAI 8/20

Page 15: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Elements of KDD

Pattern: Any representation formalism capable to describe thecommon characteristics of data

Valid: A pattern is valid if it is able to predict the behaviourof new information with a degree of certainty

Novelty: It is novel any knowledge that it is not know respectthe domain knowledge and any previous discoveredknowledge

URL - Spring 2020 - MAI 9/20

Elements of KDD

Useful: New knowledge is useful if it allows to perform actionsthat yield some benefit given a established criteria

Understandable: The knowledge discovered must be analyzed byan expert in the domain, in consequence theinterpretability of the result is important

URL - Spring 2020 - MAI 10/20

Page 16: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

The KDD process

KDD as a process

• The actual discovery of patterns is only one part of a morecomplex process

• Raw data in not always ready for processing (80/20 projecteffort)

• Some general methodologies have been defined for the wholeprocess (CRISP-DM or SEMMA)

• These methodologies address KDD as an engineering process,despite being business oriented are general enough to beapplied on any data discovery domain

URL - Spring 2020 - MAI 11/20

Page 17: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

The KDD process

• Steps of the Knowledge Discoveryin DB process1. Domain study2. Creating the dataset3. Data preprocessing4. Dimensionality reduction5. Selection of the discovery goal6. Selection of the adequate

methodologies7. Data Mining8. Result assessment and

interpretation9. Using the knowledge

URL - Spring 2020 - MAI 12/20

Goals of the KDD process

There are different goals that can be pursued as the result of thediscovery process, among them:

Classification We need models that allow to discriminate instancesthat belong to a previously known set of groups (themodel could or could not be interpretable)

Clustering/Partitioning/Segmentation We need to discovermodels that clusters the data into groups with commoncharacteristics (a characterizations of the groups isdesirable)

URL - Spring 2020 - MAI 13/20

Page 18: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Goals of the KDD process

Regression We look for models that predicts the behaviour ofcontinuous variables as a function of others

Summarization We look for a compact description thatsummarizes the characteristics of the data

Causal dependence We need models that reveal the causaldependence among the variables and assess thestrength of this dependence

URL - Spring 2020 - MAI 14/20

Goals of the KDD process

Structure dependence We need models that reveal patternsamong the relations that describe the structure of thedata

Change We need models that discover patterns in data that hastemporal or spatial dependence

URL - Spring 2020 - MAI 15/20

Page 19: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Methodologies for KDD

• Decision trees, decision rules

• Usually are interpretable models• Can be used for: Classification, regression, and summarization• trees: C4.5, CART, QUEST, rules: RIPPER, CN2, ..

• Classifiers, Regression

• Low interpretability but good accuracy• Can be used for: Classification and regression• Statistical regression, function approximation, Neural networks,

Support Vector Machines, k-NN, Local Weighted Regression, ...

URL - Spring 2020 - MAI 16/20

Methodologies for KDD

• Clustering• Partition datasets or discover groups• Can be used for: Clustering, summarization• Statistical Clustering, Unsupervised Machine learning,

Unsupervised Neural networks (Self-Organizing Maps)

• Dependency models• Obtain models of the dependence relations (structural, causal

temporal) among attributes/instances• Can be used for: causal dependence discovery, temporal

change, substructure discovery• Bayesian networks, association rules, Markov models, graphs

algorithms, ...

URL - Spring 2020 - MAI 17/20

Page 20: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Applications

Applications

• Business

• Costumer segmentation, costumer profiling, costumertransaction data, customer churn

• Fraud detection

• Control/analysis of industrial processes

• E-commerce, On-line recommendation, Financial data...

• WEB mining• Text mining, document search/organization

• Social networks analysis

• User behavior

URL - Spring 2020 - MAI 18/20

Page 21: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Applications

• Scientific applications• Medicine (patient data, MRI scans, ECG, EEG, ...)

• Pharmacology (Drug discovery, screening, in-silicon testing)

• Astronomy (astronomical bodies identification)

• Genetics (gen identification, DNA microarrays, bioinformatics)

• Satellite data (meteorology, astronomy, geological, ...)

• Large scientific experiments (CERN LHC, ITER)

URL - Spring 2020 - MAI 19/20

Challenges

Page 22: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Open problems

• Scalability (More data, more attributes)

• Overfitting (Patterns with low interest)

• Temporal data/relational data/structured data

• Methods for data cleaning (Missing data and noise)

• Pattern comprehensibility

• Use of domain knowledge

• Integration with other techniques (OLAP, DataWarehousing,Intelligent Decision Support Systems)

• Privacy

URL - Spring 2020 - MAI 20/20

Page 23: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

2Data Preprocessing

15

Page 24: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Data Preprocessing

Javier Béjar

URL - Spring 2020

CS - MAI

Introduction

Page 25: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Data representation

• Unstructured datasets:

• Examples described by a flat set of attributes: attribute-valuematrix

• Structured datasets:• Individual examples described by attributes but with relations

among them: sequences (time, spatial, ...), trees, graphs

• Sets of structured examples (sequences, graphs, trees)

URL - Spring 2020 - MAI 1/77

Unstructured data

• Only one table of observations

• Each example represents aninstance of the problem

• Each instance is represented by aset of attributes (discrete,continuous)

A B C · · ·1 3.1 a · · ·1 5.7 b · · ·0 -2.2 b · · ·1 -9.0 c · · ·0 0.3 d · · ·1 2.1 a · · ·...

...... . . .

URL - Spring 2020 - MAI 2/77

Page 26: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Structured data

• One sequential relation amonginstances (Time, Strings)• Several instances with

internal structure• Subsequences of

unstructured instances• One large instance

• Several relations amonginstances (graphs, trees)• Several instances with

internal structure• One large instance

Data Streams

• Endless sequence of data• Several streams

synchronized

• Unstructured instances

• Structured instances

• Static/Dynamic model

URL - Spring 2020 - MAI 4/77

Page 27: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Data representation

• Most unsupervised learning algorithms are specifically fitted forunstructured data

• The data representation is equivalent to a database table(attribute-value pairs)

• Specialized algorithms have been developed for structured data:Graph clustering, Sequence mining, Frequent substructures

• The representation of these types of data is sometimesalgorithm dependent

URL - Spring 2020 - MAI 5/77

Data Preprocessing

Page 28: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Data preprocessing

• Usually raw data is not directly adequate for analysis

• The usual reasons:• The quality of the data (noise/missing values/outliers)

• The dimensionality of the data (too many attributes/too manyexamples)

• The first step of any data task is to assess the quality of thedata

• The techniques used for data preprocessing are usually orientedto unstructured data

URL - Spring 2020 - MAI 6/77

Outliers

• Outliers: Examples with extreme values compared to the restof the data

• Can be considered as examples with erroneous values

• Have an important impact on some algorithms

Page 29: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Outliers

• The exceptional values could appear in all or only a fewattributes

• The usual way to correct this problem is to eliminate theexamples

• If the exceptional values are only in a few attributes these couldbe treated as missing values

URL - Spring 2020 - MAI 8/77

Parametric Outliers Detection

• Assumes a probabilistic distribution for the attributes

• Univariate• Perform Z-test or student’s test

• Multivariate

• Deviation method: reduction in data variance when eliminated

• Angle based: variance of the angles to other examples

• Distance based: variation of the distance from the mean of thedata in different dimensions

URL - Spring 2020 - MAI 9/77

Page 30: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Non parametric Outliers Detection

• Histogram based: Define a multidimensional grid and discardcells with low density

• Distance based: Distance of outliers to their k-nearest neighborsare larger

• Density based: Approximate data density using Kernel Densityestimation or heuristic measures (Local Outlier Factor, LOF)

URL - Spring 2020 - MAI 10/77

Outliers: Local Outlier Factor

• LOF quantifies the outlierness of an example adjusting forvariation in data density

• Uses the distance of the k-th neighbor Dk(x) of an exampleand the set of examples that are inside this distance Lk(x)

• The reachability distance between two data points Rk(x, y)

is defined as the maximum between the distance dist(x, y) andthe y’s k-th neighbor distance Lk(y)

URL - Spring 2020 - MAI 11/77

Page 31: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Outliers: Local Outlier Factor

• The average reachability distance ARk(x) with respect ofan example’s neighborhood (Lk(x)) is defined as the average ofthe reachabilities of the example to its neighbors

• The LOF of an example is computed as the mean ratiobetween ARk(x) and the average reachability of its k neighbors:

LOFk(x) =1

k

y∈Lk(x)

ARk(x)

ARk(y)

• This value ranks all the examples

URL - Spring 2020 - MAI 12/77

Outliers: Local Outlier Factor

2 0 2 4 6

1

0

1

2

3

4

5

2 0 2 4 6

1

0

1

2

3

4

5

Page 32: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Missing values

• Missing values appear because of errors or omissions during thegathering of the data

• They can be substituted to increase the quality of the dataset(value imputation)• Global constant for all the values

• Mean or mode of the attribute (global central tendency)

• Mean or mode of the attribute but only of the k nearestexamples (local central tendency)

• Learn a model for the data (regression, bayesian) and use it topredict the values

• Problem: changes the statistical distribution of the dataURL - Spring 2020 - MAI 14/77

Missing values

Missing Values Mean substitution 1-neighbor substitution

URL - Spring 2020 - MAI 15/77

Page 33: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Normalization

Normalizations are applied to quantitative attributes in order toeliminate the effect of having different scale measures

• Range normalization: Transform all the values of theattribute to a preestablished scale (e.g.: [0,1], [-1,1])

x− xminxmax − xmin

• Distribution normalization: Transform the data to a specificstatistical distribution with preestablished parameters (e.g.:Gaussian N (0, 1))

x− µxσx

Discretization

Discretization allows transforming quantitative attributes toqualitative attributes

• Equal size bins: Pick the number of values and divide therange of data in equal sized bins

• Equal frequency bins: Pick the number of values and dividethe range of data so each bin has the same number of examples(the size of the intervals will be different)

URL - Spring 2020 - MAI 17/77

Page 34: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Discretization

Discretization allows transforming quantitative attributes toqualitative attributes

• Distribution approximation: Calculate a histogram of thedata and fit a kernel function (KDE), the intervals are wherethe function has its minima

• Other techniques: Apply entropy based measures, MinimumDescription Length (MDL), clustering

URL - Spring 2020 - MAI 18/77

Discretization

Same size

Same Frequency

Histogram

URL - Spring 2020 - MAI 19/77

Page 35: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Python Notebooks

These two Python Notebooks show some examples of the effect ofmissing values imputation and data discretization and normalization

• Missing Values Notebook (click here to go to the url)

• Preprocessing Notebook (click here to go to the url)

If you have downloaded the code from the repository you will beable to play with the notebooks (run jupyter notebook to open thenotebooks)

URL - Spring 2020 - MAI 20/77

Dimensionality Reduction

Page 36: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

The curse of dimensionality

• Problems due to the dimensionality of data

• The computational cost of processing the data

• The quality of the data

• Elements that define the dimensionality of data• The number of examples

• The number of attributes

• Usually the problem of having too many examples can besolved using sampling.

URL - Spring 2020 - MAI 21/77

Reducing attributes

• The number of attributes has an impact on the performance:• Poor scalability

• Inability to cope with irrelevant/noisy/redundant attributes

• Methodologies to reduce the number of attributes:• Dimensionality reduction: Transforming to a space of less

dimensions

• Feature subset selection: Eliminating not relevant attributes

URL - Spring 2020 - MAI 22/77

Page 37: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Dimensionality reduction

• New dataset that preserves most of the information of theoriginal data but with less attributes

• Many techniques have been developed for this purpose• Projection to a space that preserve the statistical distribution of

the data (PCA, ICA)

• Projection to a space that preserves distances among the data(Multidimensional scaling, random projection, nonlinear scaling)

URL - Spring 2020 - MAI 23/77

Principal Component Qnalysis

• Principal Component Analysis:

• Data is projected onto a set of orthogonal dimensions(components) that are a linear combination of the originalattributes

• The components are uncorrelated and are ordered by theinformation they have

• We assume data follows gaussian distribution

• Global variance is preserved

URL - Spring 2020 - MAI 24/77

Page 38: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Principal Component Analysis

Computes a projection matrix where the dimensions are orthogonal(linearly independent) and data variance is preserved

Y

X

w1*Y+w2*Xw3*

Y+w4*X

URL - Spring 2020 - MAI 25/77

Principal Component Analysis

• Principal components: vectors that are the best linearapproximation of the data

f(λ) = µ+ Vqλ

µ is a location vector in Rp, Vq is a p× q matrix of qorthogonal unit vectors and λ is a q vector of parameters

• The reconstruction error for the data is minimized:

mınµ,{λi},Vq

N∑

i=1

||xi − µ− Vqλi||2

URL - Spring 2020 - MAI 26/77

Page 39: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Principal Component Analysis - Computation

• Optimizing partially for µ and λi:

µ = x

λi = V Tiq (xi − x)

• We can obtain the matrix Vq by minimizing:

mınVq

N∑

i=0

‖(xi − x)− VqV Tq (xi − x)‖22

• Assuming x = 0 we can obtain the projection matrixHq = VqV

Tq by Singular Value Decomposition of the data

matrix XX = UDV T

URL - Spring 2020 - MAI 27/77

Principal Component Analysis - Computation

• U is a N × p orthogonal matrix, its columns are the leftsingular vectors

• V is a p× p diagonal matrix with ordered diagonal values calledthe singular values

• The columns of UD are the principal components

• The solution to the minimization problem are the first qprincipal components

• The singular values are proportional to the reconstruction error

URL - Spring 2020 - MAI 28/77

Page 40: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Principal Component Analysis - Intuition

X

Y

Original Data

Principal Component Analysis - Intuition

X

Y

First component along the maximum variance of the data

Page 41: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Principal Component Analysis - Intuition

X

Y

Next component maximum variance perpendicular to the othercomponents

Kernel PCA

• PCA is a linear transformation, this means that if data islinearly separable, the reduced dataset will be linearly separable(given enough components)

• We can use the kernel trick to map the original attribute to aspace where non linearly separable data is linearly separable

• Distances among examples are defined as a dot product thatcan be obtained using a kernel:

d(xi, xj) = Φ(xi)TΦ(xj) = K(xi, xj)

URL - Spring 2020 - MAI 32/77

Page 42: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Kernel PCA

• Different kernels can be used to perform the transformation tothe feature space (polynomial, gaussian, ...)

• The computation of the components is equivalent to PCA butperforming the eigen decomposition of the covariance matrixcomputed for the transformed examples

C =1

M

M∑

j=1

Φ(xj)Φ(xj)T

• The components are lineal combinations of features in thefeature space

URL - Spring 2020 - MAI 33/77

Kernel PCA

• Pro: Helps to discover patterns that are non linearly separablein the original space

• Con: Does not give a weight/importance for the newcomponents

URL - Spring 2020 - MAI 34/77

Page 43: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Kernel PCA

Sparse PCA

• PCA transforms data to a space of the same dimensionality (alleigenvalues are non zero)

• An alternative is to solve the minimization problem posed bythe reconstruction error using regularization

• A penalization term is added to the objective functionproportional to the norm of the eigenvalues matrix

mınU,V‖X − UV ‖22 + α‖V ‖1

• The `-1 norm regularization will encourage sparse solutions(zero eigenvalues)

URL - Spring 2020 - MAI 36/77

Page 44: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Multidimensional Scaling

A transformation matrix transforms a dataset from M dimensions toN dimensions preserving pairwise data distances

[MxN]

URL - Spring 2020 - MAI 37/77

Multidimensional Scaling

• Multidimensional Scaling: Projects the data to a space withless dimensions preserving the pair distances among the data

• A projection matrix is obtained by optimizing a function of thepairwise distances (stress function)

• The actual attributes are not used in the transformation

• Different objective functions that can be used (least squares,Sammong mapping, classical scaling, ...).

URL - Spring 2020 - MAI 38/77

Page 45: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Multidimensional Scaling

• Least Squares Multidimensional Scaling (MDS)

• The distorsion is defined as the square distance between theoriginal distance matrix and the distance matrix of the new data

SD(z1, z2, ..., zn) =

[∑

i 6=i′(dii′ − ‖zi − zi′‖2)2

]

• The problem is defined as:

arg mınz1,z2,...,zn

SD(z1, z2, ..., zn)

URL - Spring 2020 - MAI 39/77

Multidimensional Scaling

• Several optimization strategies can be used

• If the distance matrix is euclidean it can be solved using eigendecomposition just like PCA

• In other cases gradient descent can be used using the derivativeof SD(z1, z2, ..., zn) and a step α in the following fashion:

1. Begin with a guess for Z2. Repeat until convergence:

Z(k+1) = Z(k) − α∇SD(Z)

URL - Spring 2020 - MAI 40/77

Page 46: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Multidimensional Scaling - Other functions

• Sammong Mapping (emphasis on smaller distances)

SD(z1, z2, ..., zn) =

[∑

i 6=i′

(dii′ − ‖zi − zi′‖)2dii′

]

• Classical Scaling (similarity instead of distance)

SD(z1, z2, ..., zn) =

[∑

i 6=i′(sii′ − 〈zi − z, zi′ − z〉)2

]

• Non metric MDS (assumes a ranking among the distances, noneuclidean space)

SD(z1, z2, ..., zn) =

∑i,i′ [θ(||zi − zi′ ||)− dii′ ]2∑

i,i′ d2i,i′

Random Projection

• A random transformation matrix is generated:• Rectangular matrix N × d

• Columns must have unit length

• Elements are generated from a gaussian distribution

• A matrix generated this way is almost orthogonal

• The projection will preserve the relative distances among pairsof examples

• The Johnson-Lindenstrauss lemma allows to pick a number ofdimensions to obtain the desired approximation

URL - Spring 2020 - MAI 42/77

Page 47: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Nonnegative Matrix Factorization (NMF)

• This formulation assumes that the data is a sum of unknownpositive latent variables

• NMF performs an approximation of a matrix as the product oftwo matrices

V = W ×H

• The main difference with PCA is that the values of the matricesare constrained to be positive

• The positiveness assumption helps to interpret the result• Eg.: In text mining, a document is an aggregation of topics

URL - Spring 2020 - MAI 43/77

Nonlinear scaling

• The previous methods perform a linear transformation betweenthe original space and the final space

• For some datasets this kind of transformation is not enough tomaintain the information of the original data

• Nonlinear transformations methods:• ISOMAP

• Local Linear Embedding

• Local MDS

• t-SNE

URL - Spring 2020 - MAI 44/77

Page 48: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

ISOMAP

• Assumes a low dimensional dataset embedded in a largernumber of dimensions

• The geodesic distance is used instead of the euclidean distance

• The relation of an instance with its immediate neighbors ismore representative of the structure of the data

• The transformation generates a new space that preservesneighborhood relationships

URL - Spring 2020 - MAI 45/77

ISOMAP

URL - Spring 2020 - MAI 46/77

Page 49: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

ISOMAP

Euclidean

Geodesic

a

b

URL - Spring 2020 - MAI 47/77

ISOMAP - Algorithm

1. For each data point find its k closest neighbors (points atminimal euclidean distance)

2. Build a graph where each point has an edge to its closestneighbors

3. Approximate the geodesic distance for each pair of points bythe shortest path in the graph

4. Apply a MDS algorithm to the distance matrix of the graph

URL - Spring 2020 - MAI 48/77

Page 50: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

ISOMAP - Example

a

b

a

b

Original Transformed

URL - Spring 2020 - MAI 49/77

Local Linear Embedding

• Performs a transformation that preserves local structure

• Assumes that each instance can be reconstructed by a linearcombination of its neighbors (weights)

• From these weights a new set of data points that preserve thereconstruction is computed for a lower dimensional space

• Different variants of the algorithm exist

URL - Spring 2020 - MAI 50/77

Page 51: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Local Linear Embedding

URL - Spring 2020 - MAI 51/77

Local Linear Embedding - Algorithm

1. For each data point find the K nearest neighbors in the originalspace of dimension p (N (i))

2. Approximate each point by a mixture of the neighbors:

mınWik

‖xi −∑

k∈N (i)

wikxk‖2

and∑

k∈N (i)wik = 1 and K < p

3. Find points yi in a space of dimension d < p that minimize:

N∑

i=0

‖yi −∑

k∈N (i)

wikyk‖2

URL - Spring 2020 - MAI 52/77

Page 52: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Local MDS

• Performs a transformation that preserves locality of closerpoints and puts farther away non neighbor points

• Given a set of pairs of points N where a pair (i, i′) belong tothe set if i is among the K neighbors of i′ or viceversa

• Minimize the function:

SL(z1, z2, . . . , zN ) =∑

(i,i′)∈N(dii′−‖zi−zi′‖)2− τ

(i,i′)6∈N(‖zi−zi′‖)

• The parameters τ controls how much the non neighbors arescattered

URL - Spring 2020 - MAI 53/77

t-SNE

• t-Stochastic Neighbor Embedding (t-SNE)

• Used as visualization tool

• Assumes distances define a probability distribution

• Obtains a low dimensional space with the closest distribution

• Tricky to use (see this link)

URL - Spring 2020 - MAI 54/77

Page 53: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

t-SNE

• Distances from each example to the rest are scaled to sum one(so it is a probability distribution)

• We want to project the data, so we preserve this probabilitydistribution on a lower dimensionality space

• Examples are distributed in the new space and their distancedistributions are computed

• Examples are iteratively moved to minimize the Kulback-Leiblerdistance among the distribution of the neighbours distances inthe original and in the projected space

URL - Spring 2020 - MAI 55/77

t-SNE

Each example has a distance probability distribution

Page 54: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

t-SNE

A similarity distribution is obtained

t-SNE

We distribute the data in a lower dimensionality space

URL - Spring 2020 - MAI 58/77

Page 55: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

t-SNE

Data is moved so the similarity distributions get closer

URL - Spring 2020 - MAI 59/77

Application: Wheel chair control characterization

• Wheelchair with shared control (patient/computer)

• Recorded trajectories of several patients in different situations• Angle/distance to the goal, Angle/distance to the nearest

obstacle from around the chair (210 degrees)

• Characterization about how the computer helps the patientswith different handicaps

• Is there any structure in the trajectory data?

URL - Spring 2020 - MAI 60/77

Page 56: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Application: Wheel chair control characterization

URL - Spring 2020 - MAI 61/77

Application: Wheel chair control characterization

URL - Spring 2020 - MAI 62/77

Page 57: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Application: Wheel chair control (PCA)

URL - Spring 2020 - MAI 63/77

Application: Wheel chair control (SparsePCA)

URL - Spring 2020 - MAI 64/77

Page 58: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Application: Wheel chair control (MDS)

URL - Spring 2020 - MAI 65/77

Application: Wheel chair control (ISOMAPKn=3)

URL - Spring 2020 - MAI 66/77

Page 59: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Application: Wheel chair control (ISOMAPKn=10)

URL - Spring 2020 - MAI 67/77

Unsupervised Attribute Selection

• To eliminate from the dataset all the redundant or irrelevantattributes

• The original attributes are preserved

• Less developed than in Supervised Attribute Selection• Problem: An attribute can be relevant or not depending on the

goal of the discovery process

• There are mainly two techniques for attribute selection:Wrapping and Filtering

URL - Spring 2020 - MAI 68/77

Page 60: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Attribute selection - Wrappers

• A model evaluates the relevance of subsets of attributes

• In supervised learning this is easy, in unsupervised learning it isvery difficult

• Results depend on the chosen model and on how well thismodel captures the actual structure of the data

URL - Spring 2020 - MAI 69/77

Attribute selection - Wrapper Methods

• Clustering algorithms that compute weights for the attributesbased on probability distributions

• Clustering algorithms with an objective function that penalizesthe size of the model

• Consensus clustering

URL - Spring 2020 - MAI 70/77

Page 61: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Attribute selection - Filters

• A measure evaluates the relevance of each attribute individually

• This kind of measures are difficult to obtain for unsupervisedtasks

• The idea is to obtain a measure that evaluates the capacity ofeach attribute to reveal the structure of the data (eg.: classseparability, similarity of instances in the same class)

URL - Spring 2020 - MAI 71/77

Attribute selection - Filter Methods

• Measures of properties of the spatial structure of the data(Entropy, PCA, laplacian matrix)

• Measures of the relevance of the attributes respect the inherentstructure of the data

• Measures of attribute correlation

URL - Spring 2020 - MAI 72/77

Page 62: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Laplacian Score

• The Laplacian Score is a filter method that ranks thefeatures respect to their ability of preserving the naturalstructure of the data.

• This method uses the spectral matrix of the graph computedfrom the near neighbors of the examples

URL - Spring 2020 - MAI 73/77

Laplacian Score

• The Similarity matrix is usually computed using a gaussiankernel (edges not present have a value of 0)

Sij = e||xi−xj ||2

σ

• The Degree matrix is a diagonal matrix where the elements arethe sum of the rows of S

• The Laplacian matrix is computed as

L = S −D

URL - Spring 2020 - MAI 74/77

Page 63: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Laplacian Score

• The score first computes for each attribute r and their valuesfr the transformation fr as:

fr = fr −fTr D1

1TD11

• and then the score Lr is computed as:

Lr =fTr Lfr

fTr Dfr

• This gives a ranking for the relevance of the attributes

URL - Spring 2020 - MAI 75/77

Python Notebooks

These two Python Notebooks show some examples dimensionalityreduction and feature selection

• Dimensionality reduction and feature selection Notebook (clickhere to go to the url)

• Linear and non linear dimensionality reduction Notebook (clickhere to go to the url)

If you have downloaded the code from the repository you will able toplay with the notebooks (run jupyter notebook to open thenotebooks)

URL - Spring 2020 - MAI 76/77

Page 64: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Python Code

• In the code from the repository inside subdirectoryDimReduction you have the Authors python script

• The code uses the datasets in the directory Data/authors• Auth1 has fragments of books that are novels or philosophy

works• Auth2 has fragments of books written in English and books

translated to English

• The code transforms the text to attribute vectors and appliesdifferent dimensionality reduction algorithms

• Modifying the code you can process one of the datasets andchoose how the text is transformed into vectors

URL - Spring 2020 - MAI 77/77

Page 65: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

3Data Clustering

57

Page 66: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Unsupervised Learning

Javier Béjar

URL - Spring 2019

CS - MAI

Unsupervised Learning

Page 67: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Unsupervised Learning

• Learning can be done in a supervised or unsupervised way

• There is a strong bias in the machine learning communitytowards supervised learning

• But a lot of concepts are learned unsupervisedly

• The discovery of new concepts is always unsupervised

URL - Spring 2019 - MAI 1/94

Unsupervised Learning

• We assume that data is embedded in a N-dimensional spacewith a similarity/dissimilarity function

• Similarity defines how examples are related to each other

• Bias:• Examples are more related to the nearest examples than to the

farthest

• Patterns are compact groups that are maximally separated fromeach other

• Areas: Statistics, machine learning, graph theory, fuzzy theory,physics

URL - Spring 2019 - MAI 2/94

Page 68: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Unsupervised Learning

• Discovery goals:• Summarization: To obtain representations that describe an

unlabeled dataset

• Understanding: To discover the concepts inside the data

• Difficult tasks because discovery is biased by context

• Different answers could be valid depending of the discoverygoal or the domain

• There are few criteria to validate the results

• Representation of the clusters: Unstructured (partitions) orrelational (hierarchies)

Unsupervised Learning Algorithms - Strategies

Hierarchical algorithms

• Examples are organized as a binary tree

• Based on the relationship among examples defined bysimilarity/dissimilarity functions

• No explicit division in groups, has to be chosen a posteriori

Partitional algorithms

• Only a partition of the dataset is obtained

• Based on the optimization of a criteria (assumptions about thecharacteristics of the cluster model)

URL - Spring 2019 - MAI 4/94

Page 69: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Hierarchical Algorithms

Hierarchical algorithms

• Based on graph theory

• The examples form a full connected graph

• Similarity defines the length of the edges

• Clustering is decided using a connectivity criteria

• Based on matrix algebra

• A distance matrix is calculated from the examples

• Clustering is computed using the distance matrix

• The distance matrix is updated after each iteration (differentupdating criteria)

URL - Spring 2019 - MAI 5/94

Page 70: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Hierarchical algorithms

• Graphs• Single Linkage, Complete Linkage, MST• Divisive, Agglomerative

• Matrices• Johnson algorithm• Different update criteria (S-L, C-L, Centroid, minimum

variance)

Computational costFrom O(n_inst3 × n_dims) to O(n_inst2 × n_dims)

URL - Spring 2019 - MAI 6/94

Agglomerative Graph Algorithm

Algorithm: Agglomerative graph algorithm

Compute distance/similarity matrixrepeat

Find the pair of examples with smallest similarityAdd an edge to the graph corresponding to this pairif Agglomeration criteria holds then

Merge the clusters the pair belongs to

until Only one cluster exists

• Single linkage = New edge is between two disconnected graphs• Complete linkage = New edge creates a clique with all the

nodes of both subgraphs

URL - Spring 2019 - MAI 7/94

Page 71: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Hierarchical algorithms - Graphs

2 3 4 51 6 8 2 72 1 5 33 10 94 4

5 2 3 1 4 2 3 1 54

Single Link Complete Link

2

1

3

4

5

2

1

3

4

5

2

1

3

4

5

2

1

3

4

5

2

1

3

4

5

2

1

3

4

5

2

1

3

4

5

URL - Spring 2019 - MAI 8/94

Agglomerative Johnson algorithm

Algorithm: Agglomerative Johnson algorithm

Compute Distance/similarity matrixrepeat

Find pair of groups/examples with the smallest similarityMerge the pair of groups/examplesDelete the rows and columns corresponding to the pairAdd a new row and column with the new distances for the newgroup

until Matrix has one element

• Single linkage = Distance between the closest examples• Complete linkage = Distance between the farthest examples• Average linkage = Distance between group centroids

URL - Spring 2019 - MAI 9/94

Page 72: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Hierarchical algorithms - Matrices

2 3 4 51 6 8 2 72 1 5 33 10 94 4

2,3 4 51 7 2 72,3 7.5 64 4

1,4 52,3 7.25 61,4 5.5

1,4,52,3 6.725

URL - Spring 2019 - MAI 10/94

Hierarchical algorithms - Problems

• A partition of the data has to be decided a posteriori

• Some undesirable and strange behaviours could appear(chaining, inversions, breaking large clusters)

• Some algorithms have problems when with different sized andconvex shaped clusters appear in the data

• Dendrograms are not a practical representation for largeamounts of data

• Computational cost is too high for large datasets• Time is O(n2) in the best case, O(n3) in general

URL - Spring 2019 - MAI 11/94

Page 73: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Hierarchical algorithms - Example

Data Single Link Complete Link

−1 0 1 2 3 4 5

−2

02

46

8

x1

x2

●●

● ●

2.0 1.5 1.0 0.5 0.0

151723162018211922241113141012241798635

12 10 8 6 4 2 0

151723192224182116201014118126351314279

Median Centroid Ward

4 3 2 1 0

111381263510241791418211620192224151723

4 3 2 1 0

182116201922241517231114138126351042179

50 40 30 20 10 0

142796810123511131517231418211620192224

Python Notebooks

This Python Notebook shows examples of using differenthierarchical clustering algorithms

• Hierarchical Clustering Algorithms Notebook (click here to goto the url)

If you have downloaded the code from the repository you will able toplay with the notebooks (run jupyter notebook to open thenotebooks)

URL - Spring 2019 - MAI 13/94

Page 74: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Other hierarchical algorithms

• Learning has an incremental nature (experience is acquiredfrom continuous observation, not at once)

• Concepts are learned at the same time than their relationships(polithetic hierarchies of concepts)

• Learning is a search in the space of hierarchies

• An objective function measures the utility of the structure

• The updating of the structure is performed by a set ofconceptual operators

• The result depends on the order of the examples

URL - Spring 2019 - MAI 14/94

Concept Formation - COBWEB

JH Gennari, P Langley, D Fisher, Models of incrementalconcept formation, Artificial intelligence, 1989

• Based on ideas from cognitive psychology• Learning is incremental• Concepts are organized in a hierarchy• Concepts are organized around a prototype and described

probabilistically• Hierarchical concept representation is modified via cognitive

operators• Builds a hierarchy top/down• Four conceptual operators• Heuristic measure to find the basic level (Category utility)

URL - Spring 2019 - MAI 15/94

Page 75: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Probabilistic hierarchy

Negro

Blanco

Triángulo

Cuadrado

Círculo

Negro

Blanco

Triángulo

Cuadrado

Círculo

Negro

Blanco

Triángulo

Cuadrado

Círculo

Negro

Blanco

Triángulo

Cuadrado

Círculo

Negro

Blanco

Triángulo

Cuadrado

Círculo

1.0

0.0

1.00.0

0.0

0.0

1.0

0.0

0.0

1.0

0.0

1.0

0.01.0

0.0

0.0

0.0

1.0

0.66

0.33

0.25

0.75

0.25

0.25

0.50

P(V|C)

Forma

Color

P(C0)=1.0 P(V|C)

Forma

Color

P(V|C)

Forma

Color

P(V|C)

Forma

Color

P(V|C)

Forma

Color

P(C0)=0.25

P(C0)=0.50

P(C0)=0.75

P(C0)=0.25

COBWEB - Category utility (CU)

• Category utility balances:• Intra class similarity: P (Ai = Vij |Ck)

• Inter class similarity: P (Ck|Ai = Vij)

• It measures the difference between a partition of the data andno partition at all

• For qualitative attributes and k categories {C1, ... Ck} isdefined as:

K∑k=1

P (Ck)I∑

i=1

J∑j=1

P (Ai = Vij|Ck)2 −I∑

i=1

J∑j=1

P (Ai = Vij)2

K

(see the full derivation on the paper)

URL - Spring 2019 - MAI 17/94

Page 76: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Operators

• Incorporate: Put the example inside an existing class• New class: Create a new class at this level• Merge: Two concepts are merged and the example is

incorporated inside the new class• Divide: A concept is substituted by its children

Ei

MERGE

E i+1

SPLIT

URL - Spring 2019 - MAI 18/94

COBWEB Algorithm

Procedure: Depth-first limited search COBWEB (x: Example, H:Hierarchy)

Update the father with the new exampleif we are in a leaf then

Create a new level with this exampleelse

Compute CU of incorporating the example to each classSave the two best CUCompute CU of merging the best two classesCompute CU of splitting the best classCompute CU of creating a new class with the exampleRecursive call with the best choice

Page 77: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Partitional algorithms

Partitional algorithms

To find the optimal partition of N objects in K groups is NP-hard,we need approximated algorithms

• Model/prototype based algorithms (K-means, Gaussian MixtureModels, Fuzzy K-means, Leader algorithm, ...)

• Density based algorithms (DBSCAN, DENCLUE, ...)

• Grid based algorithms (STING, CLIQUE, ...)

• Graph theory based algorithms (Spectral Clustering, ...)• Other approaches:

• Affinity Clustering• Unsupervised Neural networks• SVM clustering

URL - Spring 2019 - MAI 20/94

Page 78: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Model/Prototype Clustering

K-means

• Our model is a set of k hyperspherical clusters

• An iterative algorithm assigns each example to one of K groups(K is a parameter)

• Optimization criteria: Minimize the distance of eachexample to the centroid of the cluster (squared error)

Distorsion =K∑

k=1

i∈Ck

‖ xi − µk ‖2

• Optimization by a Hill Climbing/gradient descent searchalgorithm

• The algorithm converges to a local minima

URL - Spring 2019 - MAI 21/94

Page 79: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

K-means

Algorithm: K-means (X: Examples, k:integer)

Generate k initial prototypes (e.g. the first k examples)Assign the k examples to its nearest prototypeSumD = Sum of square distances examples-prototypesrepeat

Recalculate prototypesReassign examples to its nearest prototypeSumI = SumDSumD = Sum of square distances examples-prototypes

until SumI - SumD < ε

URL - Spring 2019 - MAI 22/94

K-means

1

11 1

1

22

2

2

22

22

1

1

11

2

1

12

1

11 1

1

2

2 2

2

22

2

2

22

22

1

11 1

1

2

22

2

2

22

22

2

1

1

1

1

11 1

1

22

2

2

22

22

1

1

11

2

1

1

Page 80: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

K-means - Practical problems

• The algorithm is sensitive to initialization (to run from severalrandom initializations is a common practice)

• Sensitive to clusters with different sizes/densities and outliers

• To find the value of k is not a trivial problem

• No guarantee about the quality of the solution

• A solution is found even when not hyperspherical clusters exist

• The spatial complexity makes it not suitable for large datasets

URL - Spring 2019 - MAI 24/94

K-means++ - Initialization Strategies

• K-means++ modifies the initialization strategy

• It tries to maximize distance among initial centers

• Algorithm:

1. Choose one center uniformly from among all the data

2. For each data point x, compute d(x, c), the distance between xand the nearest center already chosen

3. Choose one new data point at random as a new center, using aweighted probability distribution where a point x is chosen withprobability proportional to d(x, c)2

4. Repeat Steps 2 and 3 until k centers have been chosen

5. Proceed with the standard K-means algorithm

Page 81: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Bisecting K-means

• Bisecting K-means iteratively splits one of the current clustersinto two until obtaining the desired number of clusters

• Pros:• Reduces the effect of initialization

• A hierarchy is obtained

• It can be used to determine K

• Con: Different criteria could be used to decide which cluster tosplit (the largest, the one with largest variance...)

URL - Spring 2019 - MAI 26/94

Bisecting K-means

• Algorithm:1. Choose a number of partitions

2. Apply K-means to the dataset with k=2

3. Evaluate the quality of the current partition

4. Pick the cluster to be split using a quality criterion

5. Apply K-means to the cluster with k=2

6. If the number of clusters is less than desired repeat from step 3

URL - Spring 2019 - MAI 27/94

Page 82: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Global K-means

• Minimizes initialization dependence exploring all clusterings thatcan be generated using the examples as initialization points

• For generating a partition with K clusters explores all thealternative partitions from 1 to K clusters.

• Pro: Reduces the initialization problem/obtains all partitionsfrom 2 to K

• Con: Computational cost (runs K-means K ×N times)

URL - Spring 2019 - MAI 28/94

Global K-means

• Algorithm:• Compute the centroid of the partition with 1 cluster

• For C from 2 to k:• for each example e, compute K-means initialized with theC − 1 centroids from the previous iteration and an additionalwith e as the C-th centroid

• Keep the clustering with the best objective function as theC-clusters solution

URL - Spring 2019 - MAI 29/94

Page 83: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Other K-means variants

• Kernel K-means:• Distances are computed using a kernel• Pro: Clusters that are non linearly separable can be discovered

(non convex)• Con: Centroids are in the feature space, no interpretation in

the original space (image problem)

• Fast K-means• Use of the triangular inequality to reduce the number of

distance computations for assigning examples

• K-Harmonic means• Uses the Harmonic mean of the squared distances instead of

the distorsion as objective function• Pro: Less sensitive to initialization

K-medoids

• K-means assumes a centroid can be computed

• In some problems a centroid makes no sense (nominalattributes, structured data)

• One or more examples for each cluster are maintained as arepresentative of the cluster (medoid)

• The distance from each example to the medoid of their clusteris used as optimization criteria

• Pro: It is not sensitive to outliers

• Con: For one representative the cost per iteration is O(n2), formore it is NP-hard

URL - Spring 2019 - MAI 31/94

Page 84: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

K-medoids - PAM

Partitioning Around Medoids (PAM):

1. Randomly select k of the n data points as the medoids

2. Associate each data point to the closest medoids

3. For each medoid m• For each non-medoid o: Swap m and o and compute the cost

4. Keep the best solution

5. If medoids change, repeat from step 2

URL - Spring 2019 - MAI 32/94

Incremental algorithms: Leader Algorithm

• The previous algorithms need all the data from the beginning

• An incremental strategy is needed when data comes as astream (Leader Algorithm):

• A distance/similarity threshold (D) determines the extent of acluster

• Inside the threshold: Incremental updating of the model(prototype)

• Outside the threshold: A new cluster is created

• The threshold D determines the granularity of the clusters

• The clusters are dependent on the order of the examples

URL - Spring 2019 - MAI 33/94

Page 85: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Leader Algorithm

Algorithm: Leader Algorithm (X: Examples, D:double)

Generate a prototype with the first examplewhile there are examples do

e= current exampled= distance of e to the the nearest prototypeif d ≤ D then

Add the example to the clusterRecompute the prototype

elseCreate a new cluster with this example

URL - Spring 2019 - MAI 34/94

Leader Algorithm

1

2

111

1

11

1 1

111

1

11

1

2

2

2

2

22

2

211

1

31

11

1

11

1

2

2

2

2

22

2

11

1

22

22

2

2

3

33

3

3

33

33

1

1

111 1

Page 86: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Mixture Decomposition - EM algorithm

• We assume that data are drawn from a mixture of probabilitydistributions (usually Gaussian)

• Search the space of parameters of the distributions to obtainthe mixture that explains better the data (parameterestimation)

• The model of the data is:

P (x|θ) =K∑

k=1

wkP (x|θk)

with K the number of clusters and∑K

k=1wk = 1

• Each example has a probability to belong to a clusterURL - Spring 2019 - MAI 36/94

Mixture Decomposition - EM algorithm

URL - Spring 2019 - MAI 37/94

Page 87: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Mixture Decomposition - EM algorithm

• The goal is to estimate the parameters of the distribution thatdescribes each class (e.g.: µ and σ)

• The algorithm maximizes the likelihood of the distributionrespect the data

• It performs iteratively two steps:

• Expectation: We calculate a function that assigns a degree ofmembership to all the instances to any of the K probabilitydistributions

• Maximization: We re-estimate the parameters of thedistributions to maximize the membership

URL - Spring 2019 - MAI 38/94

EM Algorithm (K Gaussian)

• For the Gaussian case:

P (x|−→µ ,Σ) =K∑

k=1

wkP (x|−→µi ,Σk)

Being −→µ the vectors of means and Σ the covariance matrices

URL - Spring 2019 - MAI 39/94

Page 88: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

EM Algorithm (K Gaussian)

The computations depend on the assumptions that we make aboutthe attributes of the data (independent or not, same σ, ...)

• Attributes are independent: µi and σi have to be computed foreach cluster (O(k) parameters) (model: hyper spheres orellipsoids parallel to coordinate axis)

• Attributes are not independent: µi, σi and σij have to becomputed for each cluster (O(k2) parameters) (model: hyperellipsoids non parallel to coordinate axis)

URL - Spring 2019 - MAI 40/94

EM Algorithm (K Gaussian)

• For the case of A independent attributes:

P (x|−→µk,Σk) =A∏

j=1

P (x|µkj, σkj)

• The model to fit is

P (x|−→µ ,−→σ ) =K∑

k=1

wk

A∏

j=1

P (x|µkj, σkj)

URL - Spring 2019 - MAI 41/94

Page 89: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

EM Algorithm (K Gaussian)

• K initial distributions are generated N(µk, σk), µk and σkcorrespond to the mean and the variance of the attributes

• Repeat until convergence (no loglikelihood improvement):

1. Expectation: Compute the membership of each example toeach mixture component

• Each instance will have a weight (γik) depending on thecomponents computed by the previous iteration

2. Maximization: Recompute the parameters using the weightsfrom the previous step to obtain the new µk. σk and wk foreach distribution

URL - Spring 2019 - MAI 42/94

EM Algorithm (K Gaussian) - Expectation

• The expectation step computes the weights for each exampleand component

γik = wkP (xi|µk, σk)

• This represents the probability that an example xi is generatedby component Ck

URL - Spring 2019 - MAI 43/94

Page 90: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

EM Algorithm (K Gaussian) - Maximization

• The maximization steps recomputes µ, σ and w for eachcomponent proportionally to the weights computed in theexpectation step

µk =

∑Ni=1 γikxi∑Ni=1 γik

σk =

∑Ni=1 γik(xk − µi)

2

∑Ni=1 γik

wk =1

N

N∑

k=1

γik

URL - Spring 2019 - MAI 44/94

EM Gaussian Mixtures - Example

Initial Assignment

URL - Spring 2019 - MAI 45/94

Page 91: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

EM Gaussian Mixtures - Example

Expectation + Maximization = new parameters

m1 m2

s1 s2

weights

URL - Spring 2019 - MAI 46/94

EM Gaussian Mixtures - Example

Expectation + Maximization = new parameters

m1 m2

s1 s2

weights

URL - Spring 2019 - MAI 47/94

Page 92: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

EM algorithm - Comments

• K-means is a particular case of this algorithm (hard partition)

• The main advantage is that we obtain a membership as aprobability (soft assignments)

• Using different probability distributions we can find differentkinds of structures in the data

• For each probability model we use we need to derive thecalculations for the iterative updating of their parameters

URL - Spring 2019 - MAI 48/94

Dirichlet Process Mixture Model

• One of the problems of GMM is to decide a priori the numberof components

• This can be included in the model using a mixture model thatuses as a prior a Dirichlet Process distribution (represents thedistribution of the number of mixtures and their weights)

• Dirichlet Process distribution assumes an unbound number ofcomponents

• A finite weight is distributed among all the components

• The fitting of the model will decide what number ofcomponents better suits the data

URL - Spring 2019 - MAI 49/94

Page 93: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Dirichlet Process Mixture Model

URL - Spring 2019 - MAI 50/94

Fuzzy Clustering

• Fuzzy clustering relax the hard partition constraint of K-means

• Each example has a continuous membership to each partition

• A new optimization function is introduced:

L =N∑

i=1

K∑

k=1

δ(Ck, xi)b‖xi − µk‖2

with∑K

k=1 δ(Ck, xi) = 1 and b is a blending factor

• When the clusters are overlapped this is an advantage over hardpartition algorithms

URL - Spring 2019 - MAI 51/94

Page 94: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Fuzzy Clustering

• C-means is the most known fuzzy clustering algorithm, it is thefuzzy version of K-means• Membership is computed as the normalized inverse distance to

all the clusters• The updating of the cluster centers is computed as:

µj =

∑Ni=1 δ(Cj, xi)

bxi∑Ni=1 δ(Cj, xi)b

• And the updating of the memberships:

δ(Cj, xi) =(1/dij)

1/(1−b)∑K

k=1(1/dik)1/(1−b), dij = ‖xi − µj‖2

Notice that this is exactly a softmax of the distances to thecentroids

URL - Spring 2019 - MAI 52/94

Fuzzy Clustering

• The C-means algorithm looks for spherical clusters, otheralternatives:

• Gustafson-Kessel algorithm: A covariance matrix is introducedfor each cluster in the objective function that allows elipsoidshapes and different cluster sizes

• Gath-Geva algorithm: Adds to the objective function the sizeand an estimation of the density of the cluster

• Different objective functions can be used to detect specificshapes in the data (lines, rectangles, ...)

URL - Spring 2019 - MAI 53/94

Page 95: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Python Notebooks

This Python Notebook shows examples of using different theK-means and GMM and their problems

• Prototype Based Clustering Algorithms Notebook (click here togo to the url)

If you have downloaded the code from the repository you will able toplay with the notebooks (run jupyter notebook to open thenotebooks)

URL - Spring 2019 - MAI 54/94

Density/Grid Clustering

Page 96: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Density/Grid Based Clustering

• The number of clusters is not decided beforehand

• We are looking for regions with high density of examples

• We are no limited to predefined shapes (there is no model)

• Different approaches:

• Density estimation

• Grid partitioning

• Multidimensional histograms

• Usually applied to datasets with low dimensionality

URL - Spring 2019 - MAI 55/94

Density estimation

URL - Spring 2019 - MAI 56/94

Page 97: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Grid Partitioning

URL - Spring 2019 - MAI 57/94

Multidimensional Histograms

URL - Spring 2019 - MAI 58/94

Page 98: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

DBSCAN/OPTICS

Ester, Kriegel, Sander, Xu A Density-Based Algorithm forDiscovering Clusters in Large Spatial Databases with Noise(DBSCAN) (1996)

Ankerst, Breunig, Kriegel, Sander OPTICS: Ordering Points ToIdentify the Clustering Structure (2000)

• Used in spatial databases, but can be applied to data with moredimensionality

• Based on finding areas of high density, it finds arbitrary shapes

URL - Spring 2019 - MAI 59/94

DBSCAN/OPTICS

• We define ε-neighbourhood, as the examples that are at adistance less than ε to a given instance

Nε(x) = {y ∈ X|d(x, y) ≤ ε}

• We define core point as the examples that have a certainnumber of elements in Nε(x)

Core_point ≡ |Nε(x)| ≥MinPts

ε

ε-neighborhood

Page 99: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

DBSCAN - Density

• Two examples p and q are Direct Density Reachable withrespect to ε and MinPts if:

1. p ∈ Nε(q)

2. |Nε(q)| ≥MinPts

• Two examples p and q are Density Reachable if there is asequence of examples p = p1, p2, . . . , pn = q where for all pi,pi+1 is Direct Density Reachable from pi

• Two examples p and q are Density connected if there is anexample o such that both p and q are Density Reachablefrom o

URL - Spring 2019 - MAI 61/94

DBSCAN - Algorithm

q

p

p DDR q

q

p1

p2

p

p DR q

URL - Spring 2019 - MAI 62/94

Page 100: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

DBSCAN - Cluster Definition

Cluster

Given a dataset D, a cluster C with respect ε and MinPts is anysubset of D that:

1. ∀p, q p ∈ C ∧ density_reachable(q, p) −→ q ∈ C2. ∀p, q ∈ C density_connected(p, q)

Any example that can not be connected using these relationships istreated as noise

URL - Spring 2019 - MAI 63/94

DBSCAN - Algorithm

Algorithm

1. We start with an arbitrary example and compute all densityreachable examples with respect ε and MinPts.

2. If it is a core point we will obtain a group, otherwise, it is aborder point and we will start from other unclassified instance

To decrease the computational cost R∗ trees are used to store andcompute the neighborhood of instances

ε and MinPts are set from the thinnest cluster

The OPTICS algorithm defines a heuristic to find a good set ofvalues for these parameters

Page 101: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

DBSCAN - Algorithm

Datos Epsilon, MinPts=5 First Iteration (DDR) First Iteration (DR,DC)

URL - Spring 2019 - MAI 65/94

Python Notebooks

This Python Notebook compares Prototype Based and DensityBased Clustering algorithm

• Density Based Clustering Notebook (click here to go to the url)

If you have downloaded the code from the repository you will able toplay with the notebooks (run jupyter notebook to open thenotebooks)

URL - Spring 2019 - MAI 66/94

Page 102: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Other Approaches

Spectral Clustering

• Spectral Graph Theory defines properties that hold theeigenvalues and eigenvectors of the adjacency matrix orLaplacian matrix of a graph

• Spectral clustering uses the spectral properties of the similaritymatrix

• The distance matrix represents the graph that connects theexamples• Complete graph

• Neighborhood graph (different definitions)

• Different clustering algorithms can be defined from thediagonalization of this matrix

URL - Spring 2019 - MAI 67/94

Page 103: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Spectral Clustering (First approach)

• We start with the similarity matrix (W ) of the data

• The degree of a vertex is defined as:

di =n∑

j=1

wij

• We define the degree matrix D as the diagonal matrix withvalues d1, d2, . . . , dn

• We can define different Laplace matrices:

• Unnormalized: L = D −W

• Normalized: Lsym = D−1/2LD−1/2 or also Lrw = D−1L

URL - Spring 2019 - MAI 68/94

Spectral Clustering (First approach)

1

2

3

4

5

0.80.3

0.2 0.7

0.70.4

W =

0 0.8 0 0.2 0

0.8 0 0.3 0 0

0 0.3 0 0.7 0.4

0.2 0 0.7 0 0.7

0 0 0.4 0.7 0

D =

1 0 0 0 0

0 1.1 0 0 0

0 0 1.4 0 0

0 0 0 1.6 0

0 0 0 0 1.1

URL - Spring 2019 - MAI 69/94

Page 104: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Spectral Clustering (First approach)

• Algorithm:1. Compute the Laplace matrix from the similarity matrix

2. Compute the first K eigenvalues of the Laplace matrix

3. Use the eigenvectors as new datapoints

4. Apply K-means as clustering algorithm

• We are actually embedding the dataset in a space of lowerdimensionality

URL - Spring 2019 - MAI 70/94

Spectral Clustering (First approach)

L =

1 −0.8 0 −0.2 0

−0.8 1.1 −0.3 0 0

0 −0.3 1.4 −0.7 −0.4−0.2 0 −0.7 1.6 −0.70 0 −0.4 −0.7 1.1

Eigvec(1, 2) =

0.07 0.70

0.04 0.69

0.22 0.10

0.93 −0.100.24 0.03

Page 105: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Spectral Clustering (First approach)

0.0 0.2 0.4 0.6 0.8

0.1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

URL - Spring 2019 - MAI 72/94

Spectral Clustering (Second approach)

• From the similarity matrix and its Laplacian it is possible toformulate it as a graph partitioning problem

• Given two disjoint sets of vertex A and B, we define:

cut(A,B) =∑

i∈A,j∈Bwij

• We can partition the graph solving the mincut problemchoosing a partition that minimizes :

cut(A1, . . . , Ak) =k∑

i=1

cut(Ai, Ai)

URL - Spring 2019 - MAI 73/94

Page 106: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Spectral Clustering (Second approach)

• Using directly the weights of the Laplacian not always givesgood results, alternative objective functions are:

RatioCut(A1, . . . , Ak) =k∑

i=1

cut(Ai, Ai)

|Ai|(1)

Ncut(A1, . . . , Ak) =k∑

i=1

cut(Ai, Ai)

vol(Ai)(2)

• Where |Ai| is the size of the partition and vol(Ai) is the sum ofthe degrees of the vertex in Ai

URL - Spring 2019 - MAI 74/94

Affinity Propagation Clustering

• Affinity clustering is a message passing algorithm related tograph partitioning and belief propagation in probabilisticgraphical models

• It chooses a set of examples as the cluster prototypes andcomputes how the rest of examples are attached to them

• Each pair of examples have a similarity defined s(i, k)

• Each example has a value r(k, k) that represents the preferencefor each point to be an exemplar

• The algorithm does not set apriori the number of clusters,

URL - Spring 2019 - MAI 75/94

Page 107: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Affinity Propagation Clustering - Messages

• The examples pass two kind of messages

• Responsibility r(i, k), this is a message that an example i passesto the candidate to exemplars k of the point. This representsthe evidence of how good is k for being the exemplar of i

• Availability a(i, k), sent from candidate to exemplar k to pointi. It represents the accumulated evidence of how appropriatewould be for point i to choose point k as its exemplar

URL - Spring 2019 - MAI 76/94

Affinity Propagation Clustering - Updating

• All availabilities are initialized to 0

• The responsibilities are updated as:

r(i, k) = s(i, k)−maxk′ 6=k{a(i, k′) + s(i, k′)}

• The availabilities are updated as:

a(i, k) = min{0, r(k, k) +∑

i′ 6∈{i,k}max(0, r(i′, k))}

URL - Spring 2019 - MAI 77/94

Page 108: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Affinity Propagation Clustering - Updating

• The self availability a(k, k) is updated as:

a(k, k) =∑

i′ 6=k

max(0, r(i′, k))}

• The exemplar for a point is identified by the point thatmaximizes a(i, k) + r(i, k), if this point is the same point, thenit is an exemplar.

URL - Spring 2019 - MAI 78/94

Affinity Propagation Clustering - Algorithm

1. Update the responsibilities given the availabilities

2. Update the availabilities given the responsibilities

3. Compute the exemplars

4. Terminate if the exemplars do not change in a number ofiterations

URL - Spring 2019 - MAI 79/94

Page 109: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Affinity Propagation Clustering - Algorithm

Unsupervised Neural Networks

• Self-organizing maps are unsupervised neural networks

• Can be seen as an on-line constrained version of K-means

• The data is transformed to fit in a 1-d or 2-d regular mesh

• The nodes of this mesh are the prototypes

• This algorithm can be used as a dimensionality reductionmethod (from N to 2 dimensions)

URL - Spring 2019 - MAI 81/94

Page 110: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Self-Organizing Maps

• To build the map we have to decide the size and shape of themesh (rectangular/hexagonal)• Each node a multidimensional prototype of p features

Algorithm: Self-Organizing Map algorithm

Initial prototypes are distributed regularly on the meshfor Predefined number of iterations do

foreach Example xi doFind the nearest prototype (mj)Determine the neighborhood of mj (M)foreach Prototype mk ∈M do

mk = mk + α(xi −mk)

URL - Spring 2019 - MAI 82/94

Self-Organizing Maps

• Each iteration transforms the mesh closer to the data,maintaining the 2D relationship between prototypes

• The neighborhood of a prototype is defined by the adjacency ofthe cells and the distance of the prototypes

• Performance depends on the learning rate α, that usually isdecreased from 1 to 0 during the iterations

• The number of neighbors used in the update is decreasedduring the iterations from a predefined number to 1

• Some variations of the algorithm use the distance of theprototypes as weights for the update

URL - Spring 2019 - MAI 83/94

Page 111: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Self-Organizing Maps

Python Notebooks

This Python Notebook has examples for Spectral and AffinityPropagation Clustering

• Spectral and Affinity Propagation Clustering Notebook (clickhere to go to the url)

If you have downloaded the code from the repository you will able toplay with the notebooks (run jupyter notebook to open thenotebooks)

URL - Spring 2019 - MAI 85/94

Page 112: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Python Code

• In the code from the repository inside subdirectory Clusteringyou have the python programs HierarchicalAuthors,PartitionalAuthors and DensityBasedCity

• The first and second ones use the authors dataset and allowsto compare hierarchical clustering and partitional clusteringwith this data. You can observe what happens with bothdatasets and using different attributes

• The third one uses data from the City datasets that representsevents in different cities (Tweets and Crime) showing results fora variety clustering algorithms. You can use data from differentcities.

Applications

Page 113: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Barcelona Twitter/Instagram Dataset

• Goal: To analyze the geographical behavior of peopleliving/visiting a city

• Dataset: Tweets/posts inside a geographical area

• Attributes: geographical information / time stamp of the post

• Processes:• Geographical discretization for user representation

• Discovery of behavior profiles

URL - Spring 2019 - MAI 87/94

Discovery Goals - Are there distinguishable groupsof behaviors?

• Focus on the discovery of different groups of people

• The hypothesis is that users can be segmented according towhere they are at different times of the day

• We could answer to questions like:

• people from one part of the city stay usually in that part?

• people from outside the city go to the same places?

• public transportation is preferred by different profiles? ...

URL - Spring 2019 - MAI 88/94

Page 114: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Data Attributes

• Raw geolocalization and timestamp are too fine grained

• Even when we have millions of events the probability of havingtwo events at the same place and at the same time is low

• Discretizing localization and time will increase the probability offinding patterns

• How do we discretize space and time? Clustering

URL - Spring 2019 - MAI 89/94

Clustering of the events - Leader (200m/500m ra-dius)

URL - Spring 2019 - MAI 90/94

Page 115: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Finding Geographical Profiles

• We could explore the aggregated behavior of a user for a longperiod (month, year)

• Each example represents the places/times of the events of auser for the period

• We obtain a representation similar to the Bag of Words used intext mining

• User ⇒ Document

• Time × Place ⇒ Word

URL - Spring 2019 - MAI 91/94

Partitioning the data

• Difficult to choose the adequate clustering algorithm

• Some choices will depend on:• Types of attributes: continuous or discrete values, sparsity of

the data

• Size of the dataset: Aggregating the events reduces largely thesize of the data (no scalability issues)

• Assumptions about the model that represents our goals: Shapeof the clusters, Separability of the clusters/Distribution ofexamples

• Interpretability/Representability of the clusters

URL - Spring 2019 - MAI 92/94

Page 116: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

K-means - Twitter - Moving from near cities toBarcelona

URL - Spring 2019 - MAI 93/94

K-means - Twitter - Using nort-west freeways

URL - Spring 2019 - MAI 94/94

Page 117: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

4Cluster Validation

109

Page 118: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Clustering Evaluation

Javier Béjar

URL - Spring 2020

CS - MAI

Cluster Evaluation

Page 119: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Model evaluation

• The evaluation of unsupervised learning is difficult

• There is no goal model to compare with

• The true result is unknown, it may depend on the context, thetask to perform, ...

• Why do we want to evaluate them?

• To avoid finding patterns in noise

• To compare clustering algorithms

• To compare different models/parameters

URL - Spring 2020 - MAI 1/30

What can be evaluated?

• Cluster tendency, there are clusters in the data?

• Compare the clusters to the true partition of the data

• Quality of the clusters without reference to external information

• Compare the results of different clustering algorithms

• Evaluate algorithm parameters

• For instance, to determine the correct number of clusters

URL - Spring 2020 - MAI 2/30

Page 120: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Model evaluation - Cluster Tendency

• Before clustering a dataset we can test if there are actuallyclusters

• We have to test the hypothesis of the existence of patterns inthe data versus a dataset uniformly distributed (homogeneousdistribution)

URL - Spring 2020 - MAI 3/30

Model evaluation - Cluster Tendency

• Hopkins Statistic1. Sample n points (pi) from the dataset (D) uniformly and

compute the distance to their nearest neighbor (d(pi))2. Generate n points (qi) uniformly distributed in the space of the

dataset and compute their distance to nearest neighbors in D(d(qi))

3. Compute the quotient:

H =

∑ni=1 d(pi)∑n

i=1 d(pi) +∑n

i=1 d(qi)

4. If data are uniformly distributed the value of H will be around0.5

URL - Spring 2020 - MAI 4/30

Page 121: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Hopkins Statistic - Example

Cluster Quality criteria

• We can use different methodologies/criterion to evaluate thequality of a clustering:

• External criteria: Comparison with a model partition/labeleddata

• Internal criteria: Quality measures based on theexamples/quality of the partition

• Relative criteria: Comparison with other clusterings

URL - Spring 2020 - MAI 6/30

Page 122: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Internal criteria

Internal criteria

• Measure properties expected in a good clustering• Compact groups

• Well separated groups

• The indices are based on the model of the groups

• We can use indices based on the attributes values measuringthe properties of a good clustering

• These indices are based on statistical properties of theattributes of the model• Values distribution

• Distances distribution

URL - Spring 2020 - MAI 7/30

Page 123: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Internal criteria - Indices

• Some of the indices correspond directly to the objectivefunction optimized:

• Quadratic error/Distorsion (k-means)

SSE =

k∑

k=1

∀xi∈Ck

‖ xi − µk ‖2

• Log likelihood (Mixture of gaussians/EM)

URL - Spring 2020 - MAI 8/30

Internal criteria - Indices

• For prototype based algorithms several measures can be use tocompute quality indices

• Scatter matrices: interclass distance, intraclass distance,separation

SWk=

∀xi∈Ck

(xi − µk)(xi − µk)T

SBk= |Ck|(µk − µ)(µk − µ)T

SMk,l=

∀i∈Ck

∀j∈Cl

(xi − xj)(xi − xj)T

URL - Spring 2020 - MAI 9/30

Page 124: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Internal criteria - Indices

• Trace criteria (lower overall intracluster distance/higher overallintercluster distance)

Tr(SW ) =1

K

K∑

i=1

SWkTr(SB) =

1

K

K∑

i=1

SBk

• Calinski-Harabasz index (interclass-intraclass distance ratio)

CH =

∑Ki=0 |Ci| × ‖µi − µ‖2/(K − 1)

∑Kk=1

∑|Ci|i=0 ‖xi − µi‖2/(N −K)

URL - Spring 2020 - MAI 10/30

Internal criteria - Indices

• Davies-Bouldin criteria (maximum interclass-intraclass distanceratio)

R =1

K

K∑

i=1

Ri

where

Rij =SWi

+ SWj

SMij

Ri = maxj:j 6=i

Rij

URL - Spring 2020 - MAI 11/30

Page 125: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Internal criteria - Indices

• Silhouette index (maximum class spread/variance)

S =1

N

N∑

i=0

bi − aimax(ai, bi)

Whereai =

1

|Cj| − 1

y∈Cj ,y 6=xi

‖y − xi‖

bi = minl∈H,l 6=j

1

|Cl|∑

y∈Cl

‖y − xi‖

with xi ∈ Cj, H = {h : 1 ≤ h ≤ K}

URL - Spring 2020 - MAI 12/30

Internal criteria - Indices

• More than 30 indices can be found in the literature

• Several studies and comparisons have been performed

• Recent studies (Arbelatiz et al, 2013) have exhaustively testedthese indices, some have a performance significativelly betterthat others

• Some of the indices show a similar performance (notstatistically different)

• The study concludes that Silhouette, Davies-Bouldin andCalinski Harabasz perform well in a wide range of situations

URL - Spring 2020 - MAI 13/30

Page 126: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Internal criteria - 5 clusters different variance

Internal criteria - 5 clusters different variance -Scores

URL - Spring 2020 - MAI 15/30

Page 127: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

External criteria

External criteria

• These indices measure the similarity of a clustering to a modelpartition P

• Without a model they can be used to compare the results ofusing different parameters or different algorithms• For instance, can be used to assess the sensitivity to

initialization

• The main advantage is that these indices are independent ofthe examples/cluster description

• That means that they can be used to assess any clusteringalgorithm

URL - Spring 2020 - MAI 16/30

Page 128: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

External criteria - Indices

• All the indices are based on the coincidence of each pair ofexamples in the groups of two clusterings

• The computations are based on four values:

• The two examples in the same cluster in both partitions (a)

• The two examples in the same cluster in C, but not in P (b)

• The two examples in the same cluster in P , but not in C (c)

• The two examples in different cluster in both partitions (d)

URL - Spring 2020 - MAI 17/30

External criteria - Indices

• Rand/Adjusted Rand statistic:

Rand =(a+ d)

(a+ b+ c+ d); ARand =

a− (a+c)(a+b)a+b+c+d

(a+c)+(a+b)2

− (a+b)(a+c)a+b+c+d

• Jaccard Coefficient:

J =a

(a+ b+ c)

• Folkes and Mallow index:

FM =

√a

a+ b· a

a+ c

URL - Spring 2020 - MAI 18/30

Page 129: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

External criteria - Indices - Information Theory

• Defining Mutual Information between two partitions as:

MI(Yi, Yk) =∑

Xic∈Yi

Xkc′∈Yk

|X ic ∩Xk

c′|N

log2(N |X i

c ∩Xkc′|

|X ic||Xk

c′|)

• and Entropy of a partition as

H(Yi) = −∑

Xic∈Yi

|X ic|

Nlog2(

|X ic|

N)

where X ic ∩Xk

c′ is the number of objects that are in theintersection of the two groups

URL - Spring 2020 - MAI 19/30

External criteria - Indices - Information Theory

• Normalized Mutual Information:

NMI(Yi, Yk) =MI(Yi, Yk)√H(Yi)H(Yk)

• Variation of Information:

V I(C,C ′) = H(C) +H(C ′)− 2I(C,C ′)

• Adjusted Mutual Information:

AMI(U, V ) =MI(U, V )− E(MI(U, V ))

max(H(U), H(V ))− E(MI(U, V ))

URL - Spring 2020 - MAI 20/30

Page 130: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

External criteria - ARI/AMI Scores

Number of clusters

Page 131: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Number of clusters

• A topic related to cluster validation is to decide if the numberof clusters obtained is the correct one

• This point is important specially for the algorithms that needthis value as a parameter

• The usual procedure is to compare the characteristics ofclusterings of different sizes

• Usually internal criteria indices are used in this comparison

• A graphic of this indices for different number of clusters canshow what number of clusters is more probable

URL - Spring 2020 - MAI 22/30

Number of clusters - Indices

• Some of the internal validity indices can be used for thispurpose: Calinsky Harabasz index, Silhouette index• Using the within class scatter matrix (SW ) other criteria can be

defined:• Hartigan index:

H(k) =

[SW (k)

SW (k + 1)− 1

](n− k − 1)

• Krzanowski Lai index:

KL(k) =

∣∣∣∣DIFF (k)

DIFF (k + 1)

∣∣∣∣

being DIFF (k) = (k − 1)2/pSW (k − 1)− k2/pSW (k)

URL - Spring 2020 - MAI 23/30

Page 132: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

The Gap Statistic

• Assess the number of clusters comparing a clustering with theexpected distribution of data given the null hypothesis (noclusters)

• Computes different clusterings of the data increasing thenumber of clusters and compare to clusters of data (B)generated with a uniform distribution

• The interclass distance matrix SW is computed for both andcompared.

• The correct number of clusters is where the widest gap appearsbetween the SW of the data and the uniform data

URL - Spring 2020 - MAI 24/30

The Gap Statistic

• The Gap statistic:

Gap(k) = (1/B)∑

b

log(SW (k)b)− log(SW (k))

The first term is the mean of SW for the clusters obtained fromthe uniform distributed data

• From the st. dev. (sdk) of∑

b log(SW (k)b) is defined sk as:

sk = sdk√

1 + 1/B

• The probable number of clusters is the smallest number thatholds:

Gap(k) ≥ Gap(k + 1)− sk+1

URL - Spring 2020 - MAI 25/30

Page 133: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

The Gap Statistic

2 3 4 5 6 7 8

log(Sw)

Cluster Stability

• The idea is that if the model chosen for clustering a dataset iscorrect, it should be stable for different samplings of the data

• The procedure is to obtain different subsamples of the data,cluster them and test their stability

URL - Spring 2020 - MAI 27/30

Page 134: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Cluster Stability

• Using disjoint samples:

• Dataset divided in two disjoint samples that are clusteredseparately

• Indices can be defined to assess stability, for example using thedistribution of the number of neighbors that belong to thecomplementary sample

• Using non disjoint samples:

• Dataset divided in three disjoint samples (S1,S2, S3)• Two clusterings are obtained from S1 ∪ S3, S2 ∪ S3• Indices can be defined about the coincidence of the common

examples in both partitions

URL - Spring 2020 - MAI 28/30

Python Notebooks

This Python Notebook has examples for Measures of ClusteringValidation

• Clustering Validation Notebook (click here to go to the url)

If you have downloaded the code from the repository you will able toplay with the notebooks (run jupyter notebook to open thenotebooks)

URL - Spring 2020 - MAI 29/30

Page 135: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Python Code

• In the code from the repository inside subdirectory Validationyou have the python program ValidationAuthors,

• The authors dataset is clustered with different algorithms(K-means, GMM, Spectral) and different validity indices areplotted for the number of clusters

URL - Spring 2020 - MAI 30/30

Page 136: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

128 CHAPTER 4. CLUSTER VALIDATION

Page 137: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

5Clustering of Large Datasets

129

Page 138: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Clustering in KDD

Javier Béjar

URL - Spring 2020

CS - MIA

Introduction

Page 139: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Clustering in KDD

• One of the main tasks in the KDD process is the analysis ofdata when we do not know its structure

• This task is very different from the task of prediction where weknow the goal and we try to approximate it

• A great part of the KDD tasks are non supervised problems(KDNuggets poll, 2-3 most frequent task)

• Problems: Scalability, arbitrary cluster shapes, limited types ofdata, finding the correct parameters, ...

• There are some new algorithms that deal with these kind ofproblems

URL - Spring 2020 - MAI 1/31

Scalability Strategies

Page 140: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Strategies for cluster scalability

• One-pass• Process data as a stream

• Summarization/Data compression• Compress examples to fit more data in memory

• Sampling/Batch algorithms• Process a subset of the data and maintain/compute a global

model

• Approximation• Avoid expensive computations by approximate estimation

• Paralelization/Distribution• Divide the task in several parts and merge models

URL - Spring 2020 - MAI 2/31

One pass

• This strategy is based on incremental clustering algorithms

• They are cheap but order of processing affects greatly theirquality

• Although they can be used as a preprocessing step

• Two steps algorithms

1. A large number of clusters is generated using the one-passalgorithm

2. A more accurate algorithm clusters the preprocessed data

URL - Spring 2020 - MAI 3/31

Page 141: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Data Compression/Summarization

• Not all the data is necessary to discover the clusters

• Discard sets of examples and summarize by:

• Sufficient statistics

• Density approximations

• Discard data irrelevant for the model (do not affect the result)

URL - Spring 2020 - MAI 4/31

Approximation

• Not using all the information available to make decisions

• Using K-neighbours (data structures for computingk-neighbours)

• Preprocessing the data using a cheaper algorithm

• Generate batches using approximate distances (eg: canopyclustering)

• Use approximate data structures

• Use of hashing or approximate counts for distances andfrequency computation

URL - Spring 2020 - MAI 5/31

Page 142: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Batches/Sampling

• Process only data that fits in memory

• Obtain from the data set:

• Samples (process only a subset of the dataset)• Determine the size of the sample so all the clusters are

represented

• Batches (process all the dataset)

URL - Spring 2020 - MAI 6/31

Paralelization/Distribution/Divide&Conquer

• Paralelization usually depends on the specific algorithm

• Some not easy to parallelize (eg: hierarchical clustering)

• Some have specific parts that can be solved in parallel or byDivide&Conquer

• Distance computations in k-means

• Parameter estimation in EM algorithms

• Grid density estimations

• Space partitioning

• Batches and sampling are more general approaches

• The problem is how to merge all the different partitions

Page 143: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Scalable Algorithms

Scalable Hierarchical Clustering

Patra, Nandi, Viswanath Distance based clustering method forarbitrary shaped clusters in large datasets Pattern Recognition, 2011,44, 2862-2870

• Strategy: One pass + Summarization• The leader algorithm is used as a one pass summarization using

Leader algorithm (many clusters)• Single link is used to cluster the summaries• Guarantees the equivalence to SL at top levels• Summarization makes the algorithm independent of the dataset

size (depends on the radius used on the leader algorithm andthe volume of the data)• Complexity O(c2)

URL - Spring 2020 - MAI 8/31

Page 144: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

One pass + Single Link

1st Phase

2nd Phase

Leader Algorithm

Hierarchical Clustering

URL - Spring 2020 - MAI 9/31

BIRCH

Zhang, Ramakrishnan, Livny BIRCH: An Efficient Data ClusteringMethod for Very Large Databases (1996)

• Strategy: One-pass + Summarization

• Hierarchical clustering with limited memory

• Incremental algorithm

• Based on probabilistic prototypes and distances

• We need two pass from the database

• Based on an specialized data structure named CF-tree(Clustering Feature Tree)

URL - Spring 2020 - MAI 10/31

Page 145: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

BIRCH (CF-tree)

• Balanced n-ary tree containing clusters represented byprobabilistic prototypes

• Leaves have capacity L prototypes and clusters radius can notbe more than T

• Non terminal nodes have a fixed branching factor (B), eachelement summarizes its subtree

• Choice of parameters is crucial because available space could befilled during the process

• This is solved by changing the parameters (basically T ) andrecompressing the tree (T determines the granularity of thefinal groups)

URL - Spring 2020 - MAI 11/31

BIRCH - Insertion algorithm

1. Traverse the tree until reaching a leave and choose the nearestprototype

2. On this leave we could introduce the instance in an existinggroup or create a new prototype depending on if the distance islarger than parameter T

3. If the current leave has no space for the new prototype, thencreate a new terminal node and distribute the prototypesamong the current node and the new node

URL - Spring 2020 - MAI 12/31

Page 146: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

BIRCH - Insertion algorithm (cont.)

4. The distribution is performed choosing the two most differentprototypes and dividing the rest using their proximity to thesetwo prototypes

5. This creates a new node in the ascendant node, if the new nodeexceeds the capacity of the father then it is split and the processis continued until the root of the tree is reached if necessary

6. Additionally we could perform merge operations to compact thetree and reduce space

URL - Spring 2020 - MAI 13/31

BIRCH - Insertion Algorithm

Insertion + division

URL - Spring 2020 - MAI 14/31

Page 147: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

BIRCH - Clustering algorithm

1. Phase 1: Construction of the CF-tree, we obtain a hierarchythat summarizes the database as a set of groups whichgranularity is defined by T

2. Phase 2: Optionally we modify the CF-tree in order to reduceits size by merging near groups and deleting outliers

3. Phase 3: We use the prototypes inside the leaves of the treesas new instances and we run a clustering algorithm with them(for instance K-means)

4. Phase 4: We refine the groups assigning the instances fromthe original database to the prototypes obtained in the previousphase

URL - Spring 2020 - MAI 15/31

One pass + CFTREE (BIRCH)

1st Phase - CFTree 2nd Phase - Kmeans

Page 148: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Scalable K-means clustering

Bradley, Fayyad, Reina Scaling Clustering Algorithms to LargeDatabases Knowledge Discovery and Data Mining (1998)

• Strategy: Sampling + Summarization

• Clustering algorithms need to have all data in main memory toperform their task

• We try to obtain scalability looking for an algorithm that:

• Only look at the data one time• To be anytime (always a result is available)• To be incremental (more data not to start from scratch)• To be suspendable (continue from current solution)• To use limited memory

URL - Spring 2020 - MAI 17/31

Scalable K-means clustering (Algorithm)

• Obtain a sample that fits in memory

• Update the actual model

• Classify new instances as:• Necessary

• Discardable (We keep their information as sufficient statistics)

• Summarizable using data compression

• Decide if the model is stable or we keep clustering more data

URL - Spring 2020 - MAI 18/31

Page 149: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Canopy Clustering

McCallum, Nigam, Ungar Efficient clustering of high-dimensionaldata sets with application to reference matching (2002)

• Strategy: Divide & Conquer + Approximation

• The approach is based on a two stages clustering

• The first stage can be seen as a preprocess to determine theneighborhood of the densities and reducing the number ofdistances to compute on the second stage

• This first stage is the called canopy clustering, relies on a cheapdistance and two parameters T1 > T2

• This parameters are used as two centered spheres thatdetermine how to classify the examples.

Canopy Clustering - Algorithm

• Algorithm:1. One example is picked at random and the cheap distance from

this example to the rest is computed2. All the examples that are at less than T2 are deleted and

included in the canopy3. The points at less than T1 are added to the canopy of this

examples without deleting them4. The procedure is repeated until the example list is empty5. Canopies can share examples

• After the data can be clustered with different algorithms

• For agglomerative clustering only the distances among theexamples in the canopies have to be computed

URL - Spring 2020 - MAI 20/31

Page 150: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Canopy Clustering - Algorithm

1st Canopy 2nd Canopy 3rd Canopy

Mini-batch K-means

Sculley Web-scale k-means clustering Proceedings of the 19thinternational conference on World wide web, 2010, 1177-1178

• Strategy: Sampling

• Apply K-means to a sequence of bootstrap samples of the data

• Each iteration the samples are assigned to prototypes and theprototypes are updated with the new sample

• Each iteration the weight of the samples is reduced (learningrate)

• The quality of the results depends on the size of the batches

• Convergence is detected when prototypes are stable

URL - Spring 2020 - MAI 22/31

Page 151: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Mini-batch K-means (algorithm)

Given: k, mini-batch size b, iterations t, data set XInitialize each c ∈ C with an x picked randomly from Xv ← 0for i ← 1 to t do

M ← b examples picked randomly from Xfor x ∈ M do

d[x] ← f(C,x)

for x ∈ M doc ← d[x]v[c] ← v[c] + 1η ← 1

v[c]

c ← (1-η)c+ηx

CURE

Guha, Rastogi, Shim CURE: An efficient clustering algorithm forlarge databases (1998)

• Strategy: Sampling + Divide & Conquer• Hierarchical agglomerative clustering• Scalability is obtained by using sampling techniques and

partitioning the dataset• Uses a set of representatives (c) for cluster instead of centroids

(non spherical groups)• Distance is computed as the nearest pair of representatives

among groups• The clustering algorithm is agglomerative and merges pairs of

groups until k groups are obtainedURL - Spring 2020 - MAI 24/31

Page 152: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

CURE - Algorithm

1. Draws a random sample from the dataset

2. Partitions the sample in p groups

3. Executes the clustering algorithm on each partition

4. Deletes outliers

5. Runs the clustering algorithm on the union of all groups until itobtains k groups

6. Label the data accordingly to the similarity to the k groups

URL - Spring 2020 - MAI 25/31

CURE - Algorithm

Sampling+Partition Clustering partition 1

Clustering partition 2 Join partitions Labelling data

DATA

Page 153: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Rough-DBSCAN

Viswanath, Babu Rough-DBSCAN: A fast hybrid density basedclustering method for large data sets Pattern Recognition Letters,2009, 30, 1477 - 1488

• Strategy: One-pass + Summarization• Two stages algorithm:

1. Preprocess using the leader algorithm• Determine the instances that belong to the higher densities and

their neighbours2. Apply DBSCAN algorithm

• Determine the densities for the selected instances• Approximate the values of the densities from their distances

and the sizes of the neighbor• Assign the neighbors accordingly to the found densities

URL - Spring 2020 - MAI 27/31

MapReduce Clustering

Zhao, W., Ma, H., He, Q. Parallel K-Means Clustering Based onMapReduce Cloud Computing, LNCS 5931, 674-679, Springer 2009

• Strategy: Distribution/Divide & Conquer

• Applied to K-means and GMM

• Mappers have a copy of the current centroids and assigns theclosest one to examples

• Reducers compute the new centroids according to theassignments

URL - Spring 2020 - MAI 28/31

Page 154: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

MapReduce Clustering

N Mappers K Reducers

Assign Prototype 1

Prototype k

DA

TA

Prototypes (i)

Assign

Assign

Assign

Prototype 2

...Prototypes (i+1)

URL - Spring 2020 - MAI 29/31

Peer2Peer clustering

• Use Peer2Peer networks as a divide and conquer strategy

• Each member of the network process a chunk of the data

• Peers interchange messages with data or intermediate results

• Strategies:• Synchronous: All peers interchange information at timed

intervals• Asynchronous: All peers work independently and interchange

information randomly

• Messages:• Messages consist of prototypes that are integrated (possibly

with a weighting strategy)• Messages consist of examples or summarized examples

Page 155: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Python Notebooks

This Python Notebook has examples comparing K-means algorithmwith two scalable algorithms Mini Batch K-means and BIRCH

• Clustering DM Notebook (click here to go to the url)

If you have downloaded the code from the repository you will able toplay with the notebooks (run jupyter notebook to open thenotebooks)

URL - Spring 2020 - MAI 31/31

Page 156: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

148 CHAPTER 5. CLUSTERING OF LARGE DATASETS

Page 157: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

6Consensus Clustering

149

Page 158: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Consensus Clustering

Javier Béjar

URL - Spring 2020

CS - MAI

Consensus Clustering

• The ensemble of classifiers is a well established strategy insupervised learning

• Unsupervised learning aims the same goal: Consensus clusteringor clustering ensemble

• The idea is to merge complementary perspectives of the datainto a more stable partition

URL - Spring 2020 - MAI 1/22

Page 159: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Consensus Clustering

• Given a set of partitions of the same data X :

P = {P 1, P 2, ..., P n}

with:

P 1 = {C11 , C

12 , ..., C

1k1}

...

P n = {Cn1 , C

n2 , ..., C

nkn}

to obtain a new partition that uses the information of all npartitions

URL - Spring 2020 - MAI 2/22

Goals

• Robustness, the combination has a better performance thaneach individual partition in some sense

• Consistency, the combination is similar to the individualpartitions

• Stability, the resulting partition is less sensitive to outliers andnoise

• Novelty, the combination is able to obtain different partitionsthat can not be obtained by the clustering methods thatgenerated the individual partitions

URL - Spring 2020 - MAI 3/22

Page 160: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

advantages

• Knowledge reuse, the consensus can be computed from thepartition assignments so previous partitions using the same ordifferent attributes can be used

• Distributed computing, the individual partitions can beobtained independently

• Privacy, only the assignments of the individual partitions areneeded for the consensus

URL - Spring 2020 - MAI 4/22

Consensus Process

Page 161: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Consensus Process

• Consensus clustering is based generally in a two steps process:

1. Generate the individual partitions to be combined

2. Combine the partitions to generate the final partition

=

URL - Spring 2020 - MAI 5/22

Partition Generation

• Different example representations: Diversity by generatingpartitions with different subsets of attributes

• Different clustering algorithms: Take advantage that allclustering algorithms have a different biases

• Different parameter initialization: Use clusteringalgorithms able to produce different partitions using differentparameters

• Subspace projection: Use dimensionality reductionstechniques

• Subsets of examples: Use random subsamples of the dataset(bootstrapping)

Page 162: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Consensus Generation

• Coocurrence based methods: Use the labels obtained fromeach individual clustering and the coincidence of the labels forthe examples• Relabeling and voting, co-association matrix, graph and

hypergraph partitioning, information theory measures, finitemixture models

• Median partition based methods: Given a set of partitions(P) and a similarity function (Γ(Pi, Pj)), find the partition (Pc)that maximizes the similarity to the set:

Pc = arg maxP∈Px

Pi∈PΓ(P, Pi)

URL - Spring 2020 - MAI 7/22

Coocurrence based methods

Page 163: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Relabeling and voting

• First, solve the labeling correspondence problem

• After, determine the consensus using different strategies ofvoting

Dimitriadou Weingessel, Hornik Voting-Merging: An EnsembleMethod for Clustering Lecture Notes in Computer Science, 2001, 2130

1. Generate a clustering

2. Determine the correspondence with the current consensus

3. Each example gets a vote from their cluster assignment

4. Update the consensus

URL - Spring 2020 - MAI 8/22

Co-Association matrix

• Co-Association matrix: Count how many times a pair ofexamples are in the same cluster

• Use the matrix as a similarity or a new set of characteristics

• Apply a cluster algorithm to the information from theco-association matrix

=

0 1 2 3 4 5 6 7 8 9 10 11

11

910

87

46

20

13

5

0 0 0 0 0 0 0 0 1 3 3 3

0 0 0 0 0 0 0 0 1 3 3 3

0 0 0 0 0 0 0 0 1 3 3 3

0 0 0 1 2 1 2 2 3 1 1 1

0 0 0 1 3 2 3 3 2 0 0 0

0 0 0 1 3 2 3 3 2 0 0 0

0 0 0 1 3 2 3 3 2 0 0 0

3 3 3 2 0 1 0 0 0 0 0 0

3 3 3 2 0 1 0 0 0 0 0 0

3 3 3 2 0 1 0 0 0 0 0 0

2 2 2 3 1 2 1 1 1 0 0 0

1 1 1 2 2 3 2 2 1 0 0 0

Page 164: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Co-Association matrix

Fred Finding Consistent Clusters in Data Partitions MultipleClassifier Systems, 2001, 309-318

Fred, Jain Combining Multiple Clusterings Using EvidenceAccumulation IEEE Trans. Pattern Anal. Mach. Intell., 2005, 27,835-850

1. Compute the co-association matrix

2. Apply hierarchical clustering (different criteria)

3. Use a heuristic to cut the resulting dendrogram

URL - Spring 2020 - MAI 10/22

Graph and hypergraph partitioning

• Define consensus as a graph partitioning problem

• Different methods to build a graph or hypergraph from thepartitions

Strehl, Ghosh Cluster ensembles- A knowledge reuse framework forcombining multiple partitions Journal of Machine Learning Research,MIT Press, 2003, 3, 583-617

• Cluster based Similarity Partitioning Algorithm (CSPA)

• HyperGraph-Partitioning Algorithm (HGPA)

• Meta-CLustering Algorithm (MCLA)

URL - Spring 2020 - MAI 11/22

Page 165: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

CSPA

• Compute a similarity matrix from the clusterings• Hyperedges matrix: For all clusterings, compute an indicator

matrix (H) that represents the links among examples andclusters (Hypergraph)• Compute the similarity matrix as:

S =1

rHHT

where r is the number of clusterings• Apply a graph partitioning algorithm to the distance matrix

(METIS)• Drawback: Quadratic cost in the number of examples O(n2kr)

URL - Spring 2020 - MAI 12/22

CSPA: example

C1 C2 C3

x1 1 2 1x2 1 2 1x3 1 1 2x4 2 1 2x5 2 3 2

=⇒

C1,1 C1,2 C2,1 C2,2 C2,3 C3,1 C3,2

x1 1 0 0 1 0 1 0x2 1 0 0 1 0 1 0x3 1 0 1 0 0 0 1x4 0 1 1 0 0 0 1x5 0 1 0 0 1 0 1

S =

x1 x2 x3 x4 x5

x1 1 1 1/3 0 0x2 1 1 1/3 0 0x3 1/3 1/3 1 2/3 1/3x4 0 0 2/3 1 2/3x5 0 0 1/3 2/3 1

URL - Spring 2020 - MAI 13/22

Page 166: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

HGPA

• Partitions the hypergraph generated by the examples and theircluterings• The indicator matrix is partitioned into k clusters of

approximately the same size• The HMETIS hypergraph partitioning algorithm is used• Linear in the number of examples O(nkr)

MCLA

• Group and collapse hyperedges and assign the objects to thehyperedge in which they participate the most

• Algorithm1. Build a meta-graph with the hyperedges as vertices (edges have

the vertices similarities as weights, Jaccard)

2. Partition the hyperdeges into k metaclusters

3. Collapse the hyperedges of each metacluster

4. Assign examples to their most associated metacluster

• Linear in the number of examples O(nk2r2)

URL - Spring 2020 - MAI 15/22

Page 167: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

MCLA: Metagraph

C1,1 C1,2 C2,1 C2,2 C2,3 C3,1 C3,2

x1 1 0 0 1 0 1 0x2 1 0 0 1 0 1 0x3 1 0 1 0 0 0 1x4 0 1 1 0 0 0 1x5 0 1 0 0 1 0 1

C11 C12

C21

C22

C23

C31

C32

URL - Spring 2020 - MAI 16/22

Information Theory

• Information Theory measures are used to assess the similarityamong the clusters of the parititions

• For instance:• Normalized Mutual Information

• Category Utility

• The labels can be transformed to a new set of features for eachexample (measuring example coincidence)

• The new features can be used to partition the examples using aclustering algorithm

URL - Spring 2020 - MAI 17/22

Page 168: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Finite Mixture Models

• The problem is transformed into the estimation of theprobability of assignment

• The mixture is composed by the product of multinomialdistributions, one for each clustering

• Each example is described by the set of assignments of eachclustering

• An EM algorithm is used to find the probability distributionthat maximized the agreement

URL - Spring 2020 - MAI 18/22

Median partition basedmethods

Page 169: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Median Partition Methods

• Given a set of partitions (P) and a similarity function amongpartitions Γ(Pi, Pj), the Median Partition Pc is the one thatmaximizes the similarity to the set

Pc = arg maxP∈Px

Pi∈PΓ(P, Pi)

• Has been proven to be a NP-hard problem for some similarityfunctions Γ

URL - Spring 2020 - MAI 19/22

Similarity functions

• Based on the agreements and disagreements of pairs ofexamples between two partitions• Rand index, Jaccard coefficient, Mirkin distance (and their

randomness adjusted versions)

• Based on set matching• Purity, F-measure

• Based on information theory measures (how much informationtwo partitions share)• NMI, Variation of Information, V-measure

URL - Spring 2020 - MAI 20/22

Page 170: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Strategies

• Best of k (the partition of the set that minimizes the distance)

• Optimization using local search: Hill Climbing, SimulatedAnnealing, Genetic Algorithms• Perform a movement of examples between two clusters of the

current solution to improve the partition

• Non Negative Matrix Factorization• Find the partition matrix closest to the averaged association

matrix of a set of partitions

URL - Spring 2020 - MAI 21/22

Python Notebooks

This Python Notebook has examples of consensus clustering

• Consensus clustering Notebook (click here to go to the url)

If you have downloaded the code from the repository you will able toplay with the notebooks (run jupyter notebook to open thenotebooks)

URL - Spring 2020 - MAI 22/22

Page 171: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

7Clustering Structured Data

163

Page 172: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Clustering of structured data

Javier Béjar

URL - Spring 2020

CS - MAI

Introduction

Page 173: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Clustering of structured data

• There are some domains where patterns are more complex

• In these domains examples are related to each other

• Mining these relationships is more interesting than obtainingpatterns from the examples individually

• For instance:

• Temporal domains

• Relational databases

• Structured instances (trees, graphs)

• Usually these domains need specific methods

URL - Spring 2020 - MAI 1/41

Sequences

Page 174: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Clustering of sequences

• Data have a sequential relationship among examples

• We can have a unique sequence or a set of sequences

• The classical techniques from time series analysis do not apply(AR, ARIMA, GARCH, Kalman filter ...)

• What makes different these data?

• Usually qualitative data

• Very short series or long series that have to be segmented

• Interest in the relationships among series

• Interest only in a part of the series (episodes, anomalies,novelty, ...)

URL - Spring 2020 - MAI 2/41

Clustering Sequences

• Clustering of temporal series: Clustering algorithms appliedto a set of short series

• How to segment a unique series in a set of series? what partsare interesting?

• Representation of the series, representation of the groups

• New distance/similarity measures (scale invariant, shapedistances, ...)

URL - Spring 2020 - MAI 3/41

Page 175: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Clustering Sequences - Segmentation

• A unique series is provided and we must divide it into a set ofsubseries (series segmentation)

• Extract subseries using a sliding window• Width of the window

• overlapping/non overlapping

• Only some parts of the series are the target (episodes)

• Anomaly Detection

• Change Detection

• Be careful with unbalanced datasets

URL - Spring 2020 - MAI 4/41

Clustering Sequences - Segmentation

W1 W2 W3

T

T

... ...

Ep1 Ep2 Ep3 Ep4 Ep5 Ep6

Wn

Sliding Window

Episodes

URL - Spring 2020 - MAI 5/41

Page 176: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Clustering Sequences - Feature Extraction

• Raw time series are not always the best input

• Feature extraction: Generate informative features

• Frequency/Time domain features (Fourier, Wavelets, ...)

• Extreme points (maximum, minima, inflection points)

• Probabilistic models (Hidden Markov Models, ARIMA)

• Symbolic representation: SAX, SFA

URL - Spring 2020 - MAI 6/41

Clustering Sequences - Feature Extraction

Original Series Discrete Fourier Coefficients

Discrete Wavelet Coefficients 1-Lag SeriesURL - Spring 2020 - MAI 7/41

Page 177: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Symbolic Aggregate approXimation (SAX)

• Transforms a time series into a set of discrete symbols

• Data are discretized to strings of length M with a vocabularyof size N

• Algorithm:• Standardize the series (N (0, 1))

• For each of the M subwindows compute their mean

• Map each mean to a Gaussian distribution discretized to N

segments of equal frequency

• Transformed data can be used as a string or a integer valuedseries

URL - Spring 2020 - MAI 8/41

SAX

a

b

c

a b c a a b

URL - Spring 2020 - MAI 9/41

Page 178: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Clustering Sequences - Distance Functions

• Usual distance functions ignore timedynamic• Euclidean, hamming, ...

• Patterns in series contain noise,time/amplitude scaling, translations• Dynamic Time Warping (DTW)• Longest Common Subsequence (LCSS)• Edit Distance with Real Penalty (ERP)• Edit Distance on Real Sequence (EDR)• Spatial Assembly Distance (SpADe)

Dymamic Time Warping (DTW)

• Matches the dynamic of the series

• Series can be of different lengths

• The cost of matching two points is their distance (e.g.euclidean)

• A point from one series can be matched to multiple points ofthe other

• DTW is the minimum cost of the possible matchings

• Computing the distance is O(n2), but it can be reduced bylimiting the number of points able to match to a point

URL - Spring 2020 - MAI 11/41

Page 179: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Clustering Sequences - Distance Functions -DTW

URL - Spring 2020 - MAI 12/41

Clustering of Data Streams

Page 180: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Clustering of Data Streams

Data streams: Modeling an on-line continuous series of data

• Each item of the series is an example (one value, a vector ofvalues, structured data)• For instance, sensory data (one or multiple synchronized data),

stream of documents (twitter/news)

• Data are generated from a set of clusters (stable or changingover time)• For instance, states from a process or semantic topics

URL - Spring 2020 - MAI 13/41

Clustering of Data Streams

• Data are processed incrementally (model changes with time)• Only the current model

• Periodic snapshots

• Different goals:

• Model the domain

• Detect anomalies/novelty/bursts

• Detect change (Concept drift)

URL - Spring 2020 - MAI 14/41

Page 181: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Clustering of Data Streams - Elements

• Clustering has an on-line and an off-line phase

• Elements:• The data structure used to summarize the data

• The window model used to decide the influence of the currentand past data

• The mechanism for identifying outliers

• The clustering algorithm used to obtain the partition of thedata

URL - Spring 2020 - MAI 15/41

Clustering of Data Streams - Summary datastruc-ture

• Involved in the on-line phase

• Data are summarized using sufficient statistics (num ofexamples, sum of values, sum of squared products of values, ...)

• Usually a hierarchical datastructure (different levels ofgranularity)

• Indexing structure that can be updated incrementally

• Stores raw data or prototypes depending on space constraints

URL - Spring 2020 - MAI 16/41

Page 182: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Clustering of Data Streams - Window model

• Sliding window model• Fixed time window

• Only data inside the window updates the structure

• Damped window model• A weight is associated to examples and clusters

• Influence of data depends on time, old data fades away or arediscarded

• Landmark window model• Defines points of interest in time or amount of data

• Data before the landmark are discarded

URL - Spring 2020 - MAI 17/41

Clustering of Data Streams - Outliers

• Difficult task because data evolve with time

• Most methods work around the idea of microclusters

• A microcluster represents a dense area in the space of examples

• The indexing structure tracks the evolution of the microclusters

• Different thresholds determine if a microcluster is kept ordiscarded

URL - Spring 2020 - MAI 18/41

Page 183: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

CluStream - Prototype Based

Aggarwal et al. On Clustering Massive Data Streams: A SummarizationParadigm Data Streams-Models and Algorithms, Springer, 2007, 31, 9-38

• On-line phase:• Maintains microclusters (more than final number of clusters)• New data is incorporated to a microcluster or generates new

microclusters• The number of microclusters is fixed, they are merged to maintain

the number• Periodically the microclusters are stored

• Off-line phase:• Given a time window the stored microclusters are used to compute

the microclusters inside the time frame• K-means used to compute the clusters for the time window

DenStream - Density Based

Cao, Ester, Qian, Zhou Density-Based Clustering over an EvolvingData Stream with Noise Proceedings of the Sixth SIAM InternationalConference on Data Mining, 2006

• On-line phase:• Core-micro-clusters (a weighted sum of close points)• The weight of a point fades exponentially with time (damping

window model)• New examples are merged and mc are classified as:

• core-mc, sets of points with weight over a threshold• potential-mc• outlier-mc, sets of points with weight below a threshold

• outlier-mc dissapear with time• Off-line phase: Modified version of DBSCAN

URL - Spring 2020 - MAI 20/41

Page 184: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Python Notebooks

This Python Notebook has examples of time series clustering

• Time Series clustering Notebook (click here to go to the url)

If you have downloaded the code from the repository you will able toplay with the notebooks (run jupyter notebook to open thenotebooks)

URL - Spring 2020 - MAI 21/41

Graph mining

Page 185: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Mining of Structures

• A lot of information has a relational structure

• Methods and models used for unstructured data are notexpressive enough

• Sometimes structure can be flattened, but lots of interestinginformation is lost

• Relational database ⇒ unique merged table

• Attributes representing relations ⇒ inapplicable attributes

• Graph data ⇒ strings based on graph traversal

• Documents ⇒ bag of words

URL - Spring 2020 - MAI 22/41

Mining of Structures (WWW/Social networks)

URL - Spring 2020 - MAI 23/41

Page 186: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Mining of Structures (XML documents/Text)

URL - Spring 2020 - MAI 24/41

Mining of Structures (Chemical compounds/Geneinteractions)

URL - Spring 2020 - MAI 25/41

Page 187: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Mining of Structures

• All these types of data have in common that can be representedusing graphs and trees

• Historically we can find different approaches to the discovery ofpatterns in graphs/trees:

• Inductive logic programming: Structure is represented usinglogic formulas

• Graph algorithms

• Classic algorithms for detecting dense subgraphs (cliques)

• Graph isomorphism algorithms

• Graph partitioning algorithms

URL - Spring 2020 - MAI 26/41

Mining of Structures: Computational issues

• Most of the problems used to discover structures in graphs areNP-Hard

• Graph partitioning (Not for bi-partitioning)

• Graph isomorphism

• Two different problems:

• Clustering large graphs (only one structure) ⇒ Partitioning

• Clustering sets of graphs ⇒ common substructures

URL - Spring 2020 - MAI 27/41

Page 188: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Clustering Large Graphs

• Some information can be described as a large graph (severalinstances connected by different relations)

• For instance: Social networks, Web pages,

• We want to discover interesting substructures by:

• Dividing the graph in subgraphs (k-way partitioning, nodeclustering)

• Extracting dense substructures

URL - Spring 2020 - MAI 28/41

Graph partitioning (2-way)

• The simplest partitioning of a graph is to divide the graph intwo subgraphs

• We assume that edges have values as labels (similarity,distance, ...)

• This problem is the minimum cut-problem:

“Given a graph, divide the set of nodes in two groups so thecost of the edges connecting the nodes between the groups isminimum”

• This problem is related to the maximum flow problem that canbe solved in polynomial time

URL - Spring 2020 - MAI 29/41

Page 189: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Graph partitioning (2-way) - Karger’s Algorithm

• Randomized algorithm that approximates the min cut of agraph for undirected graphs

• Computational cost O(|E|)• Has to be repeated O(|V | log |E|) to have high probability of

finding the global minimum

• Algorithm:1. Pick an edge at random and join its vertices, reconnect the

remaining vertices to the new vertex

2. Repeat until only two vertices remain

URL - Spring 2020 - MAI 30/41

Graph partitioning (k-way)

• The general problem is NP-hard

• It can be solved approximately by local search algorithms (hillclimbing, simulated annealing)

• Kerninghan-Lin Algorithm:

1. Start with a random cut of the graph (k-clusters)

2. Interchange a pair of nodes from different partitions thatreduces the cut

3. Iterate until no improvement

• Different variations of this algorithm changing the strategy forselecting the pair of nodes to interchange

URL - Spring 2020 - MAI 31/41

Page 190: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Classical Clustering algorithms

Classical clustering algorithms can be adapted to obtain a graphpartition

• K-means and K-medoids variations

• Nodes of the graphs as prototypes• Objective functions to define node membership to clusters

(geodesic distance)• Network structure indices

• Spectral Clustering

• Define the Laplacian matrix from the graph• Perform the eigendecomposition• The largest Eigenvalues determine the number of clusters

URL - Spring 2020 - MAI 32/41

Social Networks - Community Discovery

• Graph partitioning is the problem of Community Discovery inthe area of Social Networks Analysis

• Based on graph measures detecting dense connected areas

• Edge betweenness centrality:

B(e) =NumConstrainedPaths(e, i, j)

NumShortPaths(i, j)

• Random Walk Betweeness: Compute how often a random walkstarting on node i passes through node j

• Modularity: Percentage of edges within communities comparedwith the expected number if they are not a community

Page 191: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Social Networks - Girvan Newman

• Girvan-Newman Algorithm (Betweenness)

1. Rank edges by B(e)

2. Delete edge with the highest score

3. Iterate until a specific criteria holds (eg. number ofcomponents)

URL - Spring 2020 - MAI 34/41

Social Networks - Girvan Newman

URL - Spring 2020 - MAI 35/41

Page 192: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Social Networks - Louvain

• Louvain Algorithm (Modularity)

1. Begin with a community for each node of the graph

2. Repeat until no change:• For each node i in the graph and for all its neighbors j of i,

consider the effect on the modularity of changing thecommunity of i to the community of j. Change the node ifmodularity increases

3. Build a new graph where the new nodes are the communitiesand the weights of the edges connecting communities are thesum of the edges among the nodes in the original graph

4. Repeat from 2 until no changes

URL - Spring 2020 - MAI 36/41

Social Networks - Louvain

Modularity Optimization

3

3

1

1 2

7

7

3

10

47 26

1st step 2nd step

Community Aggregation

URL - Spring 2020 - MAI 37/41

Page 193: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Mining Sets of Graphs

• Some information can be described as a collection of graphs

• For example: XML documents, chemical molecules

• We look at a graph as a complex object

• We have to adapt the elements of clustering algorithms tothese objects:

• Distance measures to compare graphs

• Summarization of graphs as prototypes

URL - Spring 2020 - MAI 38/41

Graph Edit distance

• Edit distance can be adapted to graphs

• Define add/delete/substitute costs for edges and vertices

• Different costs leads to different functions (sometimes do nothold distance properties)

• Distance is the minimum cost path that transforms one graphinto another

G1 G2Del V Subst V Subst V

Page 194: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Graph Kernels

• Specific graph kernels can be used to embed the data in ametric space

• A base similarity or distance function can be used (like graphedit distance)

• Diffusion kernels (extend similarity to closest neighbors)

• Walk kernels (computing the similarity of traversal paths)

• RBF kernels

URL - Spring 2020 - MAI 40/41

Python Notebooks

This Python Notebook has examples of community discovery usinggeolocation information from Twitter for London, Paris andBarcelona

• Dense Subgraphs Notebook (click here to go to the url)

If you have downloaded the code from the repository you will able toplay with the notebooks (run jupyter notebook to open thenotebooks)

URL - Spring 2020 - MAI 41/41

Page 195: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

8Semisupervised Clustering

187

Page 196: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Semi-supervised Clustering

Javier Béjar

URL - Spring 2020

CS - MAI

Semisupervised Clustering

Page 197: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Semi-supervised Clustering

• Sometimes we have available some information about thedataset we are analyzing unsupervisedly

• Could be interesting to incorporate this information to theclustering process in order to:• Bias the search of the algorithm toward the solutions more

consistent with our knowledge

• Improve the quality of the result reducing the algorithm naturalbias (predictivity/stability)

URL - Spring 2020 - MAI 1/17

Semi-supervised Clustering

• The information that we have available can be of differentkinds:• Sets of labeled instances

• Constrains among certain instances: Instances that have to bein the same group/instances that can not belong to the samegroup

• General information about the properties that the instances ofa group must hold

URL - Spring 2020 - MAI 2/17

Page 198: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

How to use supervised information

• It will depend on the model we can obtain1. Begin with a prior model that changes how the search is

performed

2. Bias the search, pruning the models that are not consistentwith the semisupervised knowledge

3. Modify the similarity among instances to match the constraintsimposed by the prior knowledge

URL - Spring 2020 - MAI 3/17

SemisupervisedClustering/Labeled Examples

Page 199: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Semi supervised clustering using labeled examples

• Assuming that we have some labeled examples, these can beused to obtain an initial model

• We only have to know what examples belong to clusters, theactual clusters are not needed

• We can begin from this model the clustering process, used as astarting point of the search

• This initial model changes the search and biases the final model

URL - Spring 2020 - MAI 4/17

Semi supervised clustering using labeled examples

Basu, Banerjee, Mooney Semi supervised clustering by seeding ICML2002

• Algorithm based on K-means• The usual initialization of K-means is by selecting randomly the

initial prototypes• Two alternatives:

• Use the labeled examples to build the initial prototypes (seeding)• Use the labeled examples to build the initial prototypes and

constrain the model so the labeled examples are always in theinitial clusters (seed and constraint)

• The initial prototypes give an initial probability distribution forthe clustering

URL - Spring 2020 - MAI 5/17

Page 200: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Semi supervised clustering using labeled examples

URL - Spring 2020 - MAI 6/17

Seeded-KMeans

Algorithm: Seeded-KMeans

Input: The dataset X , the number of clusters K, a set S of labeledinstances (k groups)

Output: A partition of X in K groupsbegin

Compute K initial prototypes (µi) using the labeled instancesrepeat

Assign each example from X to their nearest prototype µi

Recompute the prototype µi with the examples assigneduntil Convergence

Page 201: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Constrained-KMeans

Algorithm: Constrained-KMeans

Input: The dataset X , the number of clusters K, a set S of labeledinstances (k groups)

Output: A partition of X in K groupsbegin

Compute K initial prototypes (µi) using the labeled instancesrepeat

Maintain the examples from S in their initial classesAssign each example from X to their nearest prototype µi

Recompute the prototype µi with the examples assigneduntil Convergence

SemisupervisedClustering/Constraints

Page 202: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Semi supervised clustering using constraints

• To have labeled examples means that the number of clustersand something about the characteristic of the data are known

• Sometimes it is easier to have information about if twoexamples belong to the same or different clusters

• This information can be expressed by means of constraintsamong examples: must links and cannot links

• This information can be used to bias the search and only lookfor models that maintain these constraints

URL - Spring 2020 - MAI 9/17

Semi supervised clustering using constraints

URL - Spring 2020 - MAI 10/17

Page 203: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Semi supervised clustering using constraints

Basu, Bilenko, Mooney A probabilistic framework for semi-supervisedclustering ICML 2002

• Algorithm based on K-means (spherical clusters based onprototypes)

• A set of must-link and cannot-link constraints is defined over asubset of examples

• The quality function of the K-means algorithm is modified tobias the search

• A hidden markov random field is defined using the constraints

URL - Spring 2020 - MAI 11/17

Semi supervised clustering using constraints

• The labels of the examples canbe used to define a markovrandom field

• The must-links and cannot-linksdefine the dependence amongthe variables

• The clustering of the exampleshas to maximize the probabilityof the hidden markov randomfield

Data

Hidden�MRF

Must�link

Must�link

Cannot�link

URL - Spring 2020 - MAI 12/17

Page 204: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Semi supervised clustering using constraints

• A new objective function for the K-Means is defined

• The main idea is to introduce a penalty term to the objectivefunction that:• Penalizes the clustering that puts examples with must-links in

different clusters

• Penalizes the clustering that puts examples with cannot-links inthe same cluster

• This penalty has to be proportional to the distance among theinstances

URL - Spring 2020 - MAI 13/17

HMRF-KMeans

Algorithm: HMRF-KMeans

Input: The data X , the num of clusters K, must and cannot links, adistance function D and weights for violating the constraints

Output: A partition of X in K groupsbegin

Compute K initial prototypes (µi) using constraintsrepeat

E-step: Reassign the labels of the examples using theprototypes (µi) to minimize Jobj

M-step: Given the cluster labels recalculate cluster centroidsto minimize Jobj

until Convergence

Page 205: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

SemisupervisedClustering/Distance Learning

Semi supervised clustering with Distance Learning

• Other approach consists on learning a more adequate distancefunction to fulfill the constraints

• The constraints are used as a guide to find a distance matrixthat represents the relations among examples

• The problem can be defined as an optimization problem thatoptimizes the distances among examples with respect to theconstraints

• These methods are related to kernel methods, the goal is tolearn a Kernel matrix that represents a new space where theinstances have appropriate distances

URL - Spring 2020 - MAI 15/17

Page 206: Unsupervised Machine Learningbejar/URL/material/URLTransBook.pdfJavier Béjar URL - Spring 2020 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical

Semi supervised clustering with Distance Learning

• Relevant Component Analysis [Yeung, Chang (2006)](optimization of linear combination of distance kernelsgenerated by must and cannot links)

• Optimization with the Graph Spectral Matrix, maintaining theconstrains in the new space and the structure of the originalspace

• Learning of Mahalanobis distances: separate/approach thedifferent dimensions to match the constraints

• Kernel Clustering (Kernel K-means) with kernel matrix learningvia regularization

URL - Spring 2020 - MAI 16/17

Semi supervised clustering with Distance Learning

Original First Iteration

Second Iteration Third Iteration

URL - Spring 2020 - MAI 17/17