introduction to machine learning - amazon s3 · 2016-05-04 · introduction to machine learning 1....

50
INTRODUCTION TO MACHINE LEARNING Clustering with k-means

Upload: others

Post on 29-Jul-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

INTRODUCTION TO MACHINE LEARNING

Clustering with k-means

Page 2: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

Clustering, what?

● Similar within cluster

● Dissimilar between clusters

● Clustering: grouping objects in clusters

● No labels: unsupervised classification

● Plenty possible clusterings

● Cluster: collection of objects

Page 3: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

Clustering, why?● Pa!ern Analysis

● Visualise Data

● pre-Processing Step

● Outlier Detection

● …

● Targeted Marketing Programs

● Student Segmentations

● Data Mining

● …

Page 4: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

Clustering, how?● Measure of Similarity: d( …, …)

• Numerical variables Metrics: Euclidean, Manhattan, …

• Categorical variables Construct your own distance

• Clustering Methods

• k-means

• Hierarchical

• …

Many variations

Page 5: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

Cluster CentroidObjectCluster

#Clusters

Compactness and Separation● Within Cluster Sums of Squares (WSS):

Measure of compactness

● Between Cluster Sums of Squares (BSS):

Measure of separation

Minimise WSS

Maximise BSS

Cluster Centroid#Clusters

#Objects in ClusterSample Mean

Page 6: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

Goal: Partition data in k disjoint subsets

k-Means Algorithm

0 5 10−5

05

x

y

Let’s take k = 3

Page 7: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

1. Randomly assign k centroids

0 5 10−5

05

x

y

k-Means Algorithmk = 3Goal: Partition data in k disjoint subsets

Page 8: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

1. Randomly assign k centroids

2. Assign data to closest centroid

0 5 10−5

05

x

y

k-Means Algorithmk = 3Goal: Partition data in k disjoint subsets

Page 9: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

1. Randomly assign k centroids

2. Assign data to closest centroid

3. Moves centroids to average location

0 5 10−5

05

x

y

k-Means Algorithmk = 3Goal: Partition data in k disjoint subsets

Page 10: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

1. Randomly assign k centroids

2. Assign data to closest centroid

3. Moves centroids to average location

4. Repeat step 2 and 3

0 5 10−5

05

x

y

k-Means Algorithmk = 3Goal: Partition data in k disjoint subsets

Page 11: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

1. Randomly assign k centroids

2. Assign data to closest centroid

3. Moves centroids to average location

4. Repeat step 2 and 3

0 5 10−5

05

x

y

k-Means Algorithm

The algorithm has converged!

k = 3Goal: Partition data in k disjoint subsets

Page 12: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

● Problem: WSS keeps decreasing as k increases!

● Solution: WSS starts decreasing slowly

Choosing k● Goal: Find k that minimizes WSS

WSS / TSS < 0.2 Fix k}

Page 13: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

Choosing k: Scree PlotScree Plot: Visualizing the ratio WSS / TSS as function of k

1 2 3 4 5 6 7

0.2

0.4

0.6

0.8

1.0

k

WSS

/ TS

S

Look for the elbow in the plot

Choose k = 3

Page 14: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

k-Means in R

● centers: Starting centroid or #clusters

● nstart: #times R restarts with different centroids

> my_km <- kmeans(data, centers, nstart)

Distance: Euclidean metric

> my_km$tot.withinss

> my_km$betweenss

WSS

BSS

Page 15: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

INTRODUCTION TO MACHINE LEARNING

Let’s practice!

Page 16: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

INTRODUCTION TO MACHINE LEARNING

Performance and Scaling

Page 17: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

Cluster EvaluationNot trivial! There is no truth

● No true labels

● No true response

Evaluation methods? Depends on the goal

Goal: Compact and Separated Measurable!

Page 18: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

Cluster MeasuresWSS and BSS:

Underlying idea:

● Variance within clusters

● Separation between clusters } Compare

Alternative:

● Diameter

● Intercluster Distance

Good indication

Page 19: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

Diameter

0 5 10 15−5

05

x

y

: Objects : Cluster

: Distance (objects)

Measure of Compactness

Page 20: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

0 5 10 15

−50

5

x

y

Intercluster Distance

: Objects : Clusters: Distance (objects)

Measure of Separation

Page 21: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

0 5 10 15

−50

5

x

y

Dunn’s Index

Page 22: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

Notes:

● High computational cost

● Worst case indicator

Dunn’s IndexHigher Dunn Be!er separated / more compact

Page 23: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

Alternative measures● Internal Validation: based on intrinsic knowledge

● BIC Index

● Silhoue!e’s Index

● External Validation: based on previous knowledge

● Hulbert’s Correlation

● Jaccard’s Coefficient

Page 24: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

Evaluating in RLibraries: cluster and clValid

> dunn(clusters = my_km, Data = ...)

Dunn’s Index:

● clusters: cluster partitioning vector

● Data: original dataset

Page 25: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

Scale IssuesMetrics are o!en scale dependent!

Which pair is most similar? ( Age, Income, IQ )

● X1 = (28, 72000, 120)

● X2 = (56, 73000, 80)

● X3 = (29, 74500, 118)

● Intuition: (X1, X3)

● Euclidean: (X1, X2)

Solution: Rescale income / 1000$

Page 26: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

StandardizingProblem: Multiple variables on different scales

Solution: Standardize your data

1. Subtract the mean

2. Divide by the standard deviation

> scale(data)

Note: Standardizing Different interpretation

Page 27: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

INTRODUCTION TO MACHINE LEARNING

Let’s practice!

Page 28: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

INTRODUCTION TO MACHINE LEARNING

Hierarchical Clustering

Page 29: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

Hierarchical ClusteringHierarchy:

● Which objects cluster first?

● Which cluster pairs merge? When?

Bo!om-up:

● Starts from the objects

● Builds a hierarchy of clusters

Page 30: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

Bo!om-Up: AlgorithmPre: Calculate distances between objects

Objects

Page 31: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

Bo!om-Up: AlgorithmPre: Calculate distances between objects

Distance

Objects

Page 32: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

Bo!om-Up: Algorithm1. Put every object in its own cluster

Page 33: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

Bo!om-Up: Algorithm2. Find the closest pair of clusters Merge them

Page 34: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

Bo!om-Up: Algorithm3. Compute distances between new cluster and old ones

Page 35: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

Bo!om-Up: Algorithm4. Repeat steps two and three

Page 36: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

Bo!om-Up: Algorithm4. Repeat steps two and three

Page 37: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

Bo!om-Up: Algorithm4. Repeat steps two and three

Page 38: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

Bo!om-Up: Algorithm4. Repeat steps two and three One cluster

Page 39: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

Linkage-Methods● Simple-Linkage: minimal distance between clusters

● Complete-Linkage: maximal distance between clusters

● Average-Linkage: average distance between clusters

Different Clusterings

Page 40: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

Simple-LinkageMinimal distance between objects in each clusters

Page 41: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

Complete-LinkageMaximal distance between objects in each cluster

Page 42: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

Single-Linkage: Chaining

● O!en undesired

Page 43: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

Single-Linkage: Chaining

● O!en undesired

Page 44: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

Single-Linkage: Chaining

● O!en undesired

Page 45: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

Single-Linkage: Chaining

● Can be great outlier detector

● O!en undesired

Page 46: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

DendrogramHe

ight

Leaves / Objects

Cut

Merge

Merge

Page 47: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

Hierarchical Clustering in R

> dist(x, method)

Library: stats

> hclust(d, method)

● x: dataset ● method: distance

● d: distance matrix ● method: linkage

Page 48: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

Hierarchical: Pro and Cons● Pros

● In-depth analysis

● Linkage-methods

● Cons

● High computational cost

● Can never undo merges

Different pa"ern

Page 49: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

Introduction to Machine Learning

k-Means: Pro and Cons● Pros

● Can undo merges

● Fast computations

● Cons

● Fixed #Clusters

● Dependent on starting centroids

Page 50: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1. Randomly assign k centroids 2. Assign data to closest centroid 3. Moves centroids

INTRODUCTION TO MACHINE LEARNING

Let’s practice!