applied machine learning lecture 10: unsupervised learningrichajo/dit866/lectures/l10/l10.pdf ·...

55

Upload: others

Post on 22-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

Applied Machine LearningLecture 10: Unsupervised learning

Selpi ([email protected])

The slides are further development of Richard Johansson's slides

February 25, 2020

Page 2: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

Overview

Di�erent learning approaches

Unsupervised learning

Clustering

modeling distributions

dimensionality reduction

Page 3: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

Di�erent learning approaches based on supervision

Page 4: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

Di�erent learning approaches based on supervision

Page 5: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

Di�erent learning approaches based on supervision

Page 6: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

Overview

Di�erent learning approaches

Unsupervised learning

Clustering

modeling distributions

dimensionality reduction

Page 7: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

Unsupervised learning

I Why do this?I hand-labeled data might not be available or expensive and

time-consuming to produce

I Goal:I to discover structure in the data to summarise, explore,

visualise, and understand data

Page 8: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

Applications of unsupervised learning

Page 9: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

Main types of unsupervised approaches

I grouping the data: clustering

I modeling the statistical distribution of the data

I dimensionality reduction: projecting a high-dimensional

dataset to a lower dimensionality

Page 10: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

Overview

Di�erent learning approaches

Unsupervised learning

Clustering

modeling distributions

dimensionality reduction

Page 11: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

clustering

I clustering describes the data by forming groups (partitions)or hierarchies

5 4 3 2 1 0 1 2

3

2

1

0

1

2

3

lv et fi lt mt pl cs sk sl ro en fr it es pt hude nl da sv0.0

2.5

5.0

7.5

10.0

12.5

15.0

17.5

Page 12: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

What do we need to be able to cluster data?

I Euclidean distanceI Manhattan distance

Page 13: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

Di�erent clustering methods

I PartitionalI K-meansI Graph-basedI EM

I HierarchicalI Single linkageI Average linkageI Median linkageI Complete linkageI Centrod linkageI Ward's method

I ...

I Neural network

Page 14: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

in scikit-learn

http://scikit-learn.org/stable/modules/clustering.html

Page 15: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

�at clustering

4 2 0 2 4 6 8 100

2

4

6

8

10

6 4 2 0 2 4 6 8

8

6

4

2

0

2

4

6

Page 16: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

�at clustering methods: overview

Page 17: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

K-means (1)

I each cluster is represented by its

centroid µi

I minimize the within-cluster sum

of squares:

argminS

k∑i=1

∑x∈Si

‖x− µi‖2

Randomly initialise k cluster centroids

repeat until converged:assign cluster to each training data, based on distance

move each centroid to the centre of its cluster

Page 18: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

K-means (2)

I How to determine number of clusters?I Elbow method, trial and error, use domain knowledge

I AdvantagesI simple, easy to understand, fast, robust enough, good for

distinct clusters

I DisadvantagesI Rely on k, can have local minima, depends on initialisation, it

clusters even a uniform data, cannot handle outlier, sensitiveto scalling

I See Demo!

Page 19: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

K-means (2)

I How to determine number of clusters?I Elbow method, trial and error, use domain knowledge

I AdvantagesI simple, easy to understand, fast, robust enough, good for

distinct clusters

I DisadvantagesI Rely on k, can have local minima, depends on initialisation, it

clusters even a uniform data, cannot handle outlier, sensitiveto scalling

I See Demo!

Page 20: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

K-means (2)

I How to determine number of clusters?I Elbow method, trial and error, use domain knowledge

I AdvantagesI simple, easy to understand, fast, robust enough, good for

distinct clusters

I DisadvantagesI Rely on k, can have local minima, depends on initialisation, it

clusters even a uniform data, cannot handle outlier, sensitiveto scalling

I See Demo!

Page 21: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

K-means (2)

I How to determine number of clusters?I Elbow method, trial and error, use domain knowledge

I AdvantagesI simple, easy to understand, fast, robust enough, good for

distinct clusters

I DisadvantagesI Rely on k, can have local minima, depends on initialisation, it

clusters even a uniform data, cannot handle outlier, sensitiveto scalling

I See Demo!

Page 22: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

DBSCAN

[source]

Parameters: eps and min_samples

1) Randomly select a point P

2) Retrieve all points directly density-reachable from P w.r.t. eps

(radius to search for neighboors).

- If P is a core point, a cluster is formed. Find recursively all its

density conected points and assign them to the same cluster as P

- If P is not a core point, iterate through the remaining unvisited

points in the dataset

Page 23: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

DBSCAN example

6 4 2 0 2 4 6 8

8

6

4

2

0

2

4

6

See demo!

Page 24: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

Discussion

I why is clustering hard?

I Di�cult to interpret and evaluate the results

Page 25: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

Discussion

I why is clustering hard?I Di�cult to interpret and evaluate the results

Page 26: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

Evaluating clustering algorithms

I how could we evaluate the result produced by a clusteringalgorithm?I Check Silhouette scoreI Compare with gold standard data (if any)

Page 27: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

Evaluating �at clustering (1): internal consistency

5 4 3 2 1 0 1 2

3

2

1

0

1

2

3

5 4 3 2 1 0 1 2

3

2

1

0

1

2

3

Page 28: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

Cluster evaluation: the silhouette score

I Measures how close each object is to its own cluster compared

to other clusters.

I For each data point xi , the silhouette score is de�ned

si =bi − ai

max(ai , bi )

whereI ai is the average distance to other members in the same

clusterI bi the minimal average distance to another cluster

I then we take the average over all the data

I in scikit-learn: sklearn.metrics.silhouette_score

Page 29: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

example

5 4 3 2 1 0 1 2

3

2

1

0

1

2

3

silhouette = 0.68

5 4 3 2 1 0 1 2

3

2

1

0

1

2

3

silhouette = 0.42

Page 30: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

evaluating �at clustering (2): comparing to a gold standard

Page 31: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

evaluating �at clustering (2): comparing to a gold standard

I scikit-learn contains several evaluation metrics for this scenarioI http://scikit-learn.org/stable/modules/classes.

html#clustering-metricsI http://scikit-learn.org/stable/modules/clustering.

html#clustering-evaluation

I for instance, the adjusted Rand score:

[source]

Page 32: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

hierarchical clustering

See demo!

Page 35: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

Co-clustering

[source]

Page 36: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

Overview

Di�erent learning approaches

Unsupervised learning

Clustering

modeling distributions

dimensionality reduction

Page 37: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

modeling distributions

I we are given a dataset and now we need toI visualize the distributionI �nd the most probable region in the dataI randomly generate new synthetic data from

the same distributionI are there any exceptional data points?I for some new data point x, does it seem

plausible that it comes from the samedistribution?

[source]

I we need to model the statistical distribution of the dataI how would we solve this using an approach we've seen in a

basic stats class?I . . . and what are the limitations of that method?

Page 38: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

modeling distributions

I we are given a dataset and now we need toI visualize the distributionI �nd the most probable region in the dataI randomly generate new synthetic data from

the same distributionI are there any exceptional data points?I for some new data point x, does it seem

plausible that it comes from the samedistribution?

[source]

I we need to model the statistical distribution of the dataI how would we solve this using an approach we've seen in a

basic stats class?I . . . and what are the limitations of that method?

Page 40: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

kernel density estimation (2)

[source]

Page 43: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

novelty detection, anomaly detection, . . .

[source]

Applications: fraud detection, cyber security, manufacturing

processes, monitoring machine.

Page 44: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

Anomaly detection vs supervised

I Anomaly detection: Very small number of positive examples,

large number of negative examples, future anomalies are

unlikely to be similar to the training set

I Supervised: Large number of positives and negatives future

examples are likely to be similar to training set.

Page 45: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

in scikit-learn

http://scikit-learn.org/stable/modules/outlier_

detection.html

Page 46: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

one-class SVM

I the one-class SVM tries to �nd a decision boundary that

encloses the training data (except a fraction ν)

[source]I it can be used for novelty and outlier detection

Page 47: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

one-class SVM (formally)

Page 48: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

Anomaly detection - choosing features

I Plot histogram of data, see if the data looks gaussian. If not

gaussian, do transformation to make the data look gaussian

(e.g., take log(x) and plot the data again).

I log(x) can be changed with other function. The function that

makes the data more Gaussian can be used as a feature

Page 49: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

evaluation of anomaly detection systems

I how do you think it should be done?

Page 50: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

Overview

Di�erent learning approaches

Unsupervised learning

Clustering

modeling distributions

dimensionality reduction

Page 51: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

dimensionality reduction

I reducing a high-dimensional dataset to a low-dimensional one

I why?I visualizing, understandingI reducing the need for storageI making supervised/unsupervised algorithms run fasterI making supervised/unsupervised learning easier

Page 52: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

Principal Component Analysis

[source]

Page 53: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

autoencoders: dimensionality reduction using a NN

I for several Keras implementations, see

https://blog.keras.io/

building-autoencoders-in-keras.html

Page 54: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

Review from today's lecture

I Characterise essential di�erences between supervised,

unsupervised, semi-supervised learning approaches

I Discuss several unsupervised learning methods

I Compare and evaluate di�erent clustering methods

I Compare and evaluate di�erent distance measures

Page 55: Applied Machine Learning Lecture 10: Unsupervised learningrichajo/dit866/lectures/l10/l10.pdf · 2020-02-25 · Anomaly detection - choosing features I Plot histogram of data, see

Next two lectures

I 28 Feb: Ethics in Machine LearningI Will be given by Vilhelm Verendel

(https://www.chalmers.se/en/sta�/Pages/vilhelm-gustav-verendel.aspx)

I 3 March: Dealing with time series data