applied machine learning lecture 10: unsupervised learningrichajo/dit866/lectures/l10/l10.pdf ·...
TRANSCRIPT
Applied Machine LearningLecture 10: Unsupervised learning
Selpi ([email protected])
The slides are further development of Richard Johansson's slides
February 25, 2020
Overview
Di�erent learning approaches
Unsupervised learning
Clustering
modeling distributions
dimensionality reduction
Di�erent learning approaches based on supervision
Di�erent learning approaches based on supervision
Di�erent learning approaches based on supervision
Overview
Di�erent learning approaches
Unsupervised learning
Clustering
modeling distributions
dimensionality reduction
Unsupervised learning
I Why do this?I hand-labeled data might not be available or expensive and
time-consuming to produce
I Goal:I to discover structure in the data to summarise, explore,
visualise, and understand data
Applications of unsupervised learning
Main types of unsupervised approaches
I grouping the data: clustering
I modeling the statistical distribution of the data
I dimensionality reduction: projecting a high-dimensional
dataset to a lower dimensionality
Overview
Di�erent learning approaches
Unsupervised learning
Clustering
modeling distributions
dimensionality reduction
clustering
I clustering describes the data by forming groups (partitions)or hierarchies
5 4 3 2 1 0 1 2
3
2
1
0
1
2
3
lv et fi lt mt pl cs sk sl ro en fr it es pt hude nl da sv0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
What do we need to be able to cluster data?
I Euclidean distanceI Manhattan distance
Di�erent clustering methods
I PartitionalI K-meansI Graph-basedI EM
I HierarchicalI Single linkageI Average linkageI Median linkageI Complete linkageI Centrod linkageI Ward's method
I ...
I Neural network
in scikit-learn
http://scikit-learn.org/stable/modules/clustering.html
�at clustering
4 2 0 2 4 6 8 100
2
4
6
8
10
6 4 2 0 2 4 6 8
8
6
4
2
0
2
4
6
�at clustering methods: overview
K-means (1)
I each cluster is represented by its
centroid µi
I minimize the within-cluster sum
of squares:
argminS
k∑i=1
∑x∈Si
‖x− µi‖2
Randomly initialise k cluster centroids
repeat until converged:assign cluster to each training data, based on distance
move each centroid to the centre of its cluster
K-means (2)
I How to determine number of clusters?I Elbow method, trial and error, use domain knowledge
I AdvantagesI simple, easy to understand, fast, robust enough, good for
distinct clusters
I DisadvantagesI Rely on k, can have local minima, depends on initialisation, it
clusters even a uniform data, cannot handle outlier, sensitiveto scalling
I See Demo!
K-means (2)
I How to determine number of clusters?I Elbow method, trial and error, use domain knowledge
I AdvantagesI simple, easy to understand, fast, robust enough, good for
distinct clusters
I DisadvantagesI Rely on k, can have local minima, depends on initialisation, it
clusters even a uniform data, cannot handle outlier, sensitiveto scalling
I See Demo!
K-means (2)
I How to determine number of clusters?I Elbow method, trial and error, use domain knowledge
I AdvantagesI simple, easy to understand, fast, robust enough, good for
distinct clusters
I DisadvantagesI Rely on k, can have local minima, depends on initialisation, it
clusters even a uniform data, cannot handle outlier, sensitiveto scalling
I See Demo!
K-means (2)
I How to determine number of clusters?I Elbow method, trial and error, use domain knowledge
I AdvantagesI simple, easy to understand, fast, robust enough, good for
distinct clusters
I DisadvantagesI Rely on k, can have local minima, depends on initialisation, it
clusters even a uniform data, cannot handle outlier, sensitiveto scalling
I See Demo!
DBSCAN
[source]
Parameters: eps and min_samples
1) Randomly select a point P
2) Retrieve all points directly density-reachable from P w.r.t. eps
(radius to search for neighboors).
- If P is a core point, a cluster is formed. Find recursively all its
density conected points and assign them to the same cluster as P
- If P is not a core point, iterate through the remaining unvisited
points in the dataset
DBSCAN example
6 4 2 0 2 4 6 8
8
6
4
2
0
2
4
6
See demo!
Discussion
I why is clustering hard?
I Di�cult to interpret and evaluate the results
Discussion
I why is clustering hard?I Di�cult to interpret and evaluate the results
Evaluating clustering algorithms
I how could we evaluate the result produced by a clusteringalgorithm?I Check Silhouette scoreI Compare with gold standard data (if any)
Evaluating �at clustering (1): internal consistency
5 4 3 2 1 0 1 2
3
2
1
0
1
2
3
5 4 3 2 1 0 1 2
3
2
1
0
1
2
3
Cluster evaluation: the silhouette score
I Measures how close each object is to its own cluster compared
to other clusters.
I For each data point xi , the silhouette score is de�ned
si =bi − ai
max(ai , bi )
whereI ai is the average distance to other members in the same
clusterI bi the minimal average distance to another cluster
I then we take the average over all the data
I in scikit-learn: sklearn.metrics.silhouette_score
example
5 4 3 2 1 0 1 2
3
2
1
0
1
2
3
silhouette = 0.68
5 4 3 2 1 0 1 2
3
2
1
0
1
2
3
silhouette = 0.42
evaluating �at clustering (2): comparing to a gold standard
evaluating �at clustering (2): comparing to a gold standard
I scikit-learn contains several evaluation metrics for this scenarioI http://scikit-learn.org/stable/modules/classes.
html#clustering-metricsI http://scikit-learn.org/stable/modules/clustering.
html#clustering-evaluation
I for instance, the adjusted Rand score:
[source]
hierarchical clustering
See demo!
Single linkage clustering
[source]
Complete linkage clustering
[source]
Overview
Di�erent learning approaches
Unsupervised learning
Clustering
modeling distributions
dimensionality reduction
modeling distributions
I we are given a dataset and now we need toI visualize the distributionI �nd the most probable region in the dataI randomly generate new synthetic data from
the same distributionI are there any exceptional data points?I for some new data point x, does it seem
plausible that it comes from the samedistribution?
[source]
I we need to model the statistical distribution of the dataI how would we solve this using an approach we've seen in a
basic stats class?I . . . and what are the limitations of that method?
modeling distributions
I we are given a dataset and now we need toI visualize the distributionI �nd the most probable region in the dataI randomly generate new synthetic data from
the same distributionI are there any exceptional data points?I for some new data point x, does it seem
plausible that it comes from the samedistribution?
[source]
I we need to model the statistical distribution of the dataI how would we solve this using an approach we've seen in a
basic stats class?I . . . and what are the limitations of that method?
kernel density estimation
[source]
in scikit-learn
sklearn.neighbors.KernelDensity
novelty detection, anomaly detection, . . .
[source]
Applications: fraud detection, cyber security, manufacturing
processes, monitoring machine.
Anomaly detection vs supervised
I Anomaly detection: Very small number of positive examples,
large number of negative examples, future anomalies are
unlikely to be similar to the training set
I Supervised: Large number of positives and negatives future
examples are likely to be similar to training set.
in scikit-learn
http://scikit-learn.org/stable/modules/outlier_
detection.html
one-class SVM
I the one-class SVM tries to �nd a decision boundary that
encloses the training data (except a fraction ν)
[source]I it can be used for novelty and outlier detection
one-class SVM (formally)
Anomaly detection - choosing features
I Plot histogram of data, see if the data looks gaussian. If not
gaussian, do transformation to make the data look gaussian
(e.g., take log(x) and plot the data again).
I log(x) can be changed with other function. The function that
makes the data more Gaussian can be used as a feature
evaluation of anomaly detection systems
I how do you think it should be done?
Overview
Di�erent learning approaches
Unsupervised learning
Clustering
modeling distributions
dimensionality reduction
dimensionality reduction
I reducing a high-dimensional dataset to a low-dimensional one
I why?I visualizing, understandingI reducing the need for storageI making supervised/unsupervised algorithms run fasterI making supervised/unsupervised learning easier
Principal Component Analysis
[source]
autoencoders: dimensionality reduction using a NN
I for several Keras implementations, see
https://blog.keras.io/
building-autoencoders-in-keras.html
Review from today's lecture
I Characterise essential di�erences between supervised,
unsupervised, semi-supervised learning approaches
I Discuss several unsupervised learning methods
I Compare and evaluate di�erent clustering methods
I Compare and evaluate di�erent distance measures
Next two lectures
I 28 Feb: Ethics in Machine LearningI Will be given by Vilhelm Verendel
(https://www.chalmers.se/en/sta�/Pages/vilhelm-gustav-verendel.aspx)
I 3 March: Dealing with time series data