data clustering with r

31
Data Clustering with R Yanchang Zhao http://www.RDataMining.com 30 September 2014 1 / 30

Upload: yanchang-zhao

Post on 25-Jun-2015

899 views

Category:

Technology


4 download

DESCRIPTION

RDataMining Slides Series: Data Clustering with R

TRANSCRIPT

Page 1: Data Clustering with R

Data Clustering with R

Yanchang Zhao

http://www.RDataMining.com

30 September 2014

1 / 30

Page 2: Data Clustering with R

Outline

Introduction

The k-Means Clustering

The k-Medoids Clustering

Hierarchical Clustering

Density-based Clustering

Online Resources

2 / 30

Page 3: Data Clustering with R

Data Clustering with R 1

I k-means clustering with kmeans()

I k-medoids clustering with pam() and pamk()

I hierarchical clustering

I density-based clustering with DBSCAN

1Chapter 6: Clustering, in book R and Data Mining: Examples and CaseStudies. http://www.rdatamining.com/docs/RDataMining.pdf

3 / 30

Page 4: Data Clustering with R

Outline

Introduction

The k-Means Clustering

The k-Medoids Clustering

Hierarchical Clustering

Density-based Clustering

Online Resources

4 / 30

Page 5: Data Clustering with R

k-means clustering

set.seed(8953)

iris2 <- iris

iris2$Species <- NULL

(kmeans.result <- kmeans(iris2, 3))

## K-means clustering with 3 clusters of sizes 38, 50, 62

##

## Cluster means:

## Sepal.Length Sepal.Width Petal.Length Petal.Width

## 1 6.850 3.074 5.742 2.071

## 2 5.006 3.428 1.462 0.246

## 3 5.902 2.748 4.394 1.434

##

## Clustering vector:

## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2...

## [31] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 3 3 3 3...

## [61] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3...

## [91] 3 3 3 3 3 3 3 3 3 3 1 3 1 1 1 1 3 1 1 1 1 1 1 3 3 1 1...

## [121] 1 3 1 3 1 1 3 3 1 1 1 1 1 3 1 1 1 1 3 1 1 1 3 1 1 1 3...

##

## Within cluster sum of squares by cluster:

## [1] 23.88 15.15 39.82

## (between_SS / total_SS = 88.4 %)

##

## Available components:

##

## [1] "cluster" "centers" "totss" "withinss"...

## [5] "tot.withinss" "betweenss" "size" "iter" ...

## [9] "ifault"

5 / 30

Page 6: Data Clustering with R

Results of k-Means Clustering

Check clustering result against class labels (Species)

table(iris$Species, kmeans.result$cluster)

##

## 1 2 3

## setosa 0 50 0

## versicolor 2 0 48

## virginica 36 0 14

I Class “setosa” can be easily separated from the other clusters

I Classes “versicolor” and “virginica” are to a small degreeoverlapped with each other.

6 / 30

Page 7: Data Clustering with R

plot(iris2[c("Sepal.Length", "Sepal.Width")], col = kmeans.result$cluster)

points(kmeans.result$centers[, c("Sepal.Length", "Sepal.Width")],

col = 1:3, pch = 8, cex = 2) # plot cluster centers

4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

2.0

2.5

3.0

3.5

4.0

Sepal.Length

Sepa

l.Wid

th

7 / 30

Page 8: Data Clustering with R

Outline

Introduction

The k-Means Clustering

The k-Medoids Clustering

Hierarchical Clustering

Density-based Clustering

Online Resources

8 / 30

Page 9: Data Clustering with R

The k-Medoids Clustering

I Difference from k-means: a cluster is represented with itscenter in the k-means algorithm, but with the object closestto the center of the cluster in the k-medoids clustering.

I more robust than k-means in presence of outliers

I PAM (Partitioning Around Medoids) is a classic algorithm fork-medoids clustering.

I The CLARA algorithm is an enhanced technique of PAM bydrawing multiple samples of data, applying PAM on eachsample and then returning the best clustering. It performsbetter than PAM on larger data.

I Functions pam() and clara() in package cluster

I Function pamk() in package fpc does not require a user tochoose k .

9 / 30

Page 10: Data Clustering with R

Clustering with pamk()

library(fpc)

pamk.result <- pamk(iris2)

# number of clusters

pamk.result$nc

## [1] 2

# check clustering against actual species

table(pamk.result$pamobject$clustering, iris$Species)

##

## setosa versicolor virginica

## 1 50 1 0

## 2 0 49 50

Two clusters:

I “setosa”

I a mixture of “versicolor” and “virginica”

10 / 30

Page 11: Data Clustering with R

layout(matrix(c(1, 2), 1, 2)) # 2 graphs per page

plot(pamk.result$pamobject)

−3 −1 0 1 2 3 4

−2−1

01

23

clusplot(pam(x = sdata, k = k, diss = diss))

Component 1

Comp

onen

t 2

These two components explain 95.81 % of the point variability.

Silhouette width si

0.0 0.2 0.4 0.6 0.8 1.0

Silhouette plot of pam(x = sdata, k = k, diss = diss)

Average silhouette width : 0.69

n = 150 2 clusters Cj

j : nj | avei∈Cj si

1 : 51 | 0.81

2 : 99 | 0.62

layout(matrix(1)) # change back to one graph per page11 / 30

Page 12: Data Clustering with R

I The left chart is a 2-dimensional “clusplot” (clustering plot)of the two clusters and the lines show the distance betweenclusters.

I The right chart shows their silhouettes. A large si (almost 1)suggests that the corresponding observations are very wellclustered, a small si (around 0) means that the observationlies between two clusters, and observations with a negative siare probably placed in the wrong cluster.

I Since the average Si are respectively 0.81 and 0.62 in theabove silhouette, the identified two clusters are well clustered.

12 / 30

Page 13: Data Clustering with R

Clustering with pam()

# group into 3 clusters

pam.result <- pam(iris2, 3)

table(pam.result$clustering, iris$Species)

##

## setosa versicolor virginica

## 1 50 0 0

## 2 0 48 14

## 3 0 2 36

Three clusters:

I Cluster 1 is species “setosa” and is well separated from theother two.

I Cluster 2 is mainly composed of “versicolor”, plus some casesfrom “virginica”.

I The majority of cluster 3 are “virginica”, with two cases from“versicolor”.

13 / 30

Page 14: Data Clustering with R

layout(matrix(c(1, 2), 1, 2)) # 2 graphs per page

plot(pam.result)

−3 −2 −1 0 1 2 3

−3−2

−10

12

clusplot(pam(x = iris2, k = 3))

Component 1

Comp

onen

t 2

These two components explain 95.81 % of the point variability.

Silhouette width si

0.0 0.2 0.4 0.6 0.8 1.0

Silhouette plot of pam(x = iris2, k = 3)

Average silhouette width : 0.55

n = 150 3 clusters Cj

j : nj | avei∈Cj si

1 : 50 | 0.80

2 : 62 | 0.42

3 : 38 | 0.45

layout(matrix(1)) # change back to one graph per page14 / 30

Page 15: Data Clustering with R

Results of Clustering

I In this example, the result of pam() seems better, because itidentifies three clusters, corresponding to three species.

I Note that we cheated by setting k = 3 when using pam(),which is already known to us as the number of species.

15 / 30

Page 16: Data Clustering with R

Outline

Introduction

The k-Means Clustering

The k-Medoids Clustering

Hierarchical Clustering

Density-based Clustering

Online Resources

16 / 30

Page 17: Data Clustering with R

Hierarchical Clustering of the iris Data

set.seed(2835)

# draw a sample of 40 records from the iris data, so that the

# clustering plot will not be over crowded

idx <- sample(1:dim(iris)[1], 40)

irisSample <- iris[idx, ]

# remove class label

irisSample$Species <- NULL

# hierarchical clustering

hc <- hclust(dist(irisSample), method = "ave")

# plot clusters

plot(hc, hang = -1, labels = iris$Species[idx])

# cut tree into 3 clusters

rect.hclust(hc, k = 3)

# get cluster IDs

groups <- cutree(hc, k = 3)

17 / 30

Page 18: Data Clustering with R

seto

sase

tosa

seto

sase

tosa

seto

sase

tosa

seto

sase

tosa

seto

sase

tosa

seto

sase

tosa

seto

sase

tosa

vers

icol

orve

rsic

olor

vers

icol

orvi

rgin

ica

virg

inic

ave

rsic

olor

vers

icol

orve

rsic

olor

vers

icol

orve

rsic

olor

vers

icol

orve

rsic

olor

vers

icol

orve

rsic

olor

vers

icol

orve

rsic

olor

vers

icol

orve

rsic

olor

virg

inic

avi

rgin

ica

virg

inic

avi

rgin

ica

virg

inic

avi

rgin

ica

virg

inic

avi

rgin

ica

01

23

4

Cluster Dendrogram

hclust (*, "average")dist(irisSample)

Hei

ght

18 / 30

Page 19: Data Clustering with R

Outline

Introduction

The k-Means Clustering

The k-Medoids Clustering

Hierarchical Clustering

Density-based Clustering

Online Resources

19 / 30

Page 20: Data Clustering with R

Density-based Clustering

I Group objects into one cluster if they are connected to oneanother by densely populated area

I The DBSCAN algorithm from package fpc provides adensity-based clustering for numeric data.

I Two key parameters in DBSCAN:I eps: reachability distance, which defines the size of

neighborhood; andI MinPts: minimum number of points.

I If the number of points in the neighborhood of point α is noless than MinPts, then α is a dense point. All the points in itsneighborhood are density-reachable from α and are put intothe same cluster as α.

I Can discover clusters with various shapes and sizes

I Insensitive to noise

I The k-means algorithm tends to find clusters with sphereshape and with similar sizes.

20 / 30

Page 21: Data Clustering with R

Density-based Clustering of the iris data

library(fpc)

iris2 <- iris[-5] # remove class tags

ds <- dbscan(iris2, eps = 0.42, MinPts = 5)

# compare clusters with original class labels

table(ds$cluster, iris$Species)

##

## setosa versicolor virginica

## 0 2 10 17

## 1 48 0 0

## 2 0 37 0

## 3 0 3 33

I 1 to 3: identified clusters

I 0: noises or outliers, i.e., objects that are not assigned to anyclusters

21 / 30

Page 22: Data Clustering with R

plot(ds, iris2)

Sepal.Length

2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5

4.5

5.5

6.5

7.5

2.0

2.5

3.0

3.5

4.0

Sepal.Width

Petal.Length

12

34

56

7

4.5 5.5 6.5 7.5

0.5

1.0

1.5

2.0

2.5

1 2 3 4 5 6 7

Petal.Width

22 / 30

Page 23: Data Clustering with R

plot(ds, iris2[c(1, 4)])

4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

0.5

1.0

1.5

2.0

2.5

Sepal.Length

Peta

l.Wid

th

23 / 30

Page 24: Data Clustering with R

plotcluster(iris2, ds$cluster)

1

1

1

1

1

11

11

1

11

1

1

1

11

1

1

1

1

10

1

11

1

11

11

11

1

1

111 1

1

1

0

1

1

1

11

11

1

2

2

222

2

2

0

2

2

0

2

0

2

0

2

2

2

02

3

2

3

2

2

2

2

22

02

2

23

2

0

2

0

2

2

2

2

20

22

2

2

0

2

0

3

3

3

3

0

0

0

0

0

3

3

33

0

3

3

0

00

33

0

3

3

0

33

3

0

0

0

3

3

0

0

3

3

3 3

33

3

3

3

3

3

3

3

3

−8 −6 −4 −2 0 2

−2−1

01

23

dc 1

dc 2

24 / 30

Page 25: Data Clustering with R

Prediction with Clustering Model

I Label new data, based on their similarity with the clusters

I Draw a sample of 10 objects from iris and add small noisesto them to make a new dataset for labeling

I Random noises are generated with a uniform distributionusing function runif().

# create a new dataset for labeling

set.seed(435)

idx <- sample(1:nrow(iris), 10)

# remove class labels

new.data <- iris[idx,-5]

# add random noise

new.data <- new.data + matrix(runif(10*4, min=0, max=0.2),

nrow=10, ncol=4)

# label new data

pred <- predict(ds, iris2, new.data)

25 / 30

Page 26: Data Clustering with R

Results of Prediction

table(pred, iris$Species[idx]) # check cluster labels

##

## pred setosa versicolor virginica

## 0 0 0 1

## 1 3 0 0

## 2 0 3 0

## 3 0 1 2

Eight(=3+3+2) out of 10 objects are assigned with correct classlabels.

26 / 30

Page 27: Data Clustering with R

Results of Prediction

table(pred, iris$Species[idx]) # check cluster labels

##

## pred setosa versicolor virginica

## 0 0 0 1

## 1 3 0 0

## 2 0 3 0

## 3 0 1 2

Eight(=3+3+2) out of 10 objects are assigned with correct classlabels.

26 / 30

Page 28: Data Clustering with R

plot(iris2[c(1, 4)], col = 1 + ds$cluster)

points(new.data[c(1, 4)], pch = "+", col = 1 + pred, cex = 3)

4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

0.5

1.0

1.5

2.0

2.5

Sepal.Length

Peta

l.Wid

th + +

+

+

+

+

+

+

++

27 / 30

Page 29: Data Clustering with R

Outline

Introduction

The k-Means Clustering

The k-Medoids Clustering

Hierarchical Clustering

Density-based Clustering

Online Resources

28 / 30

Page 30: Data Clustering with R

Online Resources

I Chapter 6: Clustering, in book R and Data Mining: Examplesand Case Studieshttp://www.rdatamining.com/docs/RDataMining.pdf

I R Reference Card for Data Mininghttp://www.rdatamining.com/docs/R-refcard-data-mining.pdf

I Free online courses and documentshttp://www.rdatamining.com/resources/

I RDataMining Group on LinkedIn (7,000+ members)http://group.rdatamining.com

I RDataMining on Twitter (1,700+ followers)@RDataMining

29 / 30

Page 31: Data Clustering with R

The End

Thanks!

Email: yanchang(at)rdatamining.com

30 / 30