lecture 19: clustering analysis - nc state universitylu/st7901/lecture notes... · various...

33
Lecture 19: Clustering Analysis Wenbin Lu Department of Statistics North Carolina State University Fall 2019 Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 1 / 33

Upload: others

Post on 16-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 19: Clustering Analysis - Nc State Universitylu/ST7901/lecture notes... · Various Clustering Algorithms K-means Method K-means Method One of the most popular iterative descent

Lecture 19: Clustering Analysis

Wenbin Lu

Department of StatisticsNorth Carolina State University

Fall 2019

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 1 / 33

Page 2: Lecture 19: Clustering Analysis - Nc State Universitylu/ST7901/lecture notes... · Various Clustering Algorithms K-means Method K-means Method One of the most popular iterative descent

Outlines

Unsupervised Learning

IntroductionVarious Learning Problems

Clustering Analysis

IntroductionOptimal Clustering

Various Clustering Algorithms

K-means AlgorithmK-medoids AlgorithmHierarchical Clustering

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 2 / 33

Page 3: Lecture 19: Clustering Analysis - Nc State Universitylu/ST7901/lecture notes... · Various Clustering Algorithms K-means Method K-means Method One of the most popular iterative descent

Unsupervised Learning Introduction

What Is Unsupervised Learning

Learning patterns from data without a teacher.

Data: {xi}ni=1

No yi ’s are available.

Various unsupervised learning problems:

cluster analysis

dimension reduction

density estimation problems (useful for low-dimensional problems)

In contrast to supervised learning, there is no clear measure of success forunsupervised learning.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 3 / 33

Page 4: Lecture 19: Clustering Analysis - Nc State Universitylu/ST7901/lecture notes... · Various Clustering Algorithms K-means Method K-means Method One of the most popular iterative descent

Unsupervised Learning Introduction

About Supervised Learning

Learning with a teacher: given a training set {xi , yi}ni=1 to learn therelationship between x and y .

regression, classification, ...

Once a model f (x) is obtained, one can use it for prediction y = f (x).

Measure of success: accuracy of prediction. For example,

L(y , y) = (y − y)2.

“student”: the prediction value yi“teacher”: the correct answer and/or an error associated with thestudent’s answer.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 4 / 33

Page 5: Lecture 19: Clustering Analysis - Nc State Universitylu/ST7901/lecture notes... · Various Clustering Algorithms K-means Method K-means Method One of the most popular iterative descent

Unsupervised Learning Introduction

Cluster Analysis

Goal: to find multiple regions of the X-space that contains modes ofP(x).

Can we represent P(x) by a mixture of simpler densities representingdistinct types or classes of observations?

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

−1.5

−1.0

−0.5

0.00.5

1.01.5

x1

x2

●●

●●

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 5 / 33

Page 6: Lecture 19: Clustering Analysis - Nc State Universitylu/ST7901/lecture notes... · Various Clustering Algorithms K-means Method K-means Method One of the most popular iterative descent

Unsupervised Learning Introduction

Dimension Reduction

Goal: to describe the association among variables as a function of asmaller set of ”latent” variables.

Principle component analysis (PCA)

Independent component analysis (ICA)

used for blind source separationAssume X = AS, where S is p-vector containing independentcomponents. The goal is to identify A and S.

multidimensional scaling (MDS)

self-organizing maps (SOP)

principal curves

....

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 6 / 33

Page 7: Lecture 19: Clustering Analysis - Nc State Universitylu/ST7901/lecture notes... · Various Clustering Algorithms K-means Method K-means Method One of the most popular iterative descent

Unsupervised Learning Introduction

Density Estimation

Goal: estimate the density distribution of X1, · · · ,Xp.

histogram

Gaussian mixture provides a crude model

kernel density estimation, smoothing splines

Challenges:

Density estimation can be infeasible for high dimensional problems.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 7 / 33

Page 8: Lecture 19: Clustering Analysis - Nc State Universitylu/ST7901/lecture notes... · Various Clustering Algorithms K-means Method K-means Method One of the most popular iterative descent

Cluster Analysis Introduction

Cluster Analysis Introduction

Goal: Group or segment the dataset (a collection of objects) into subsets,such that those within each subset are more closely related to one anotherthan those assigned to different subsets.

Each subset is called a cluster

Two types of clustering: flat and hierarchical clustering:

flat clustering divides the dataset into k clusters

hierarchical clustering is to arrange the clusters into a naturalhierarchy.

involves successively grouping the clusters themselvesAt each level of the hierarchy, clusters within the same group are moresimilar to each other than those in different groups.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 8 / 33

Page 9: Lecture 19: Clustering Analysis - Nc State Universitylu/ST7901/lecture notes... · Various Clustering Algorithms K-means Method K-means Method One of the most popular iterative descent

Cluster Analysis Introduction

Proximity and Dissimilarity Matrices

Clustering results are crucially dependent on the measure of similarity ordistance between the “points” to be clustered.

One can use either a similarity or dissimilarity matrix.

Similarities can be converted to dissimilarities using amonotone-decreasing function.

Given two points (objects), xi and xi ′ , there are two levels of dissimilarity:

Attribute dissimilarity: dj(xij , xi ′j) quantifies the dissimilarity in theirjth attribute.

Object dissimilarity: d(xi , xi ′) quantifies the overall dissimilaritybetween two points (objects).

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 9 / 33

Page 10: Lecture 19: Clustering Analysis - Nc State Universitylu/ST7901/lecture notes... · Various Clustering Algorithms K-means Method K-means Method One of the most popular iterative descent

Cluster Analysis Introduction

Object Dissimilarities

A common choice of attribute dissimilarity is:

Squared distance : dj(xij , xi ′j) = (xij − xi ′j)2.

The object dissimilarity can be quantified by the sum of dj ’s

D(xi , xi ′) =

p∑j=1

dj(xij , xi ′j).

or a weighted sum

D(xi , xi ′) =

p∑j=1

wjdj(xij , xi ′j) where

p∑j=1

wj = 1.

General distance-based dissimilarity:

d(xi , xi ′) = L(‖xi − xi ′‖).

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 10 / 33

Page 11: Lecture 19: Clustering Analysis - Nc State Universitylu/ST7901/lecture notes... · Various Clustering Algorithms K-means Method K-means Method One of the most popular iterative descent

Cluster Analysis Introduction

Correlation-Based Dissimilarity

Alternatively, the following correlation can be used to measure thedissimilarity between subjects:

ρ(xi , xi ′) =

∑pj=1(xij − xi )(xi ′j − xi ′)√∑p

j=1(xij − xi )2∑p

j=1(xi ′j − xi ′)2,

where xi =∑p

j=1 xij/p is the average of variables (not observations).

If inputs are standardized, then

p∑j=1

(xij − xi ′j)2 ∝ 2(1− ρ(xi , xi ′)).

In the case, clustering based on correlation (similarity) is equivalent tothat based on squared distance (dissimilarity).

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 11 / 33

Page 12: Lecture 19: Clustering Analysis - Nc State Universitylu/ST7901/lecture notes... · Various Clustering Algorithms K-means Method K-means Method One of the most popular iterative descent

Cluster Analysis Introduction

Proximity Matrix

The proximity (or alikeness or affinity) matrix D = (dik):

The ik-th element dik measuring the proximity between the i-th andthe kth objects (or observations).

D is of size n × n:

D is typically symmetric.

If D is not symmetric, we can use (D + D ′)/2 can be applied.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 12 / 33

Page 13: Lecture 19: Clustering Analysis - Nc State Universitylu/ST7901/lecture notes... · Various Clustering Algorithms K-means Method K-means Method One of the most popular iterative descent

Cluster Analysis Optimal Clustering and Combinatorial Algorithm

Criterion for Optimal Clustering

Each observation is uniquely labeled by an integer i ∈ {1, 2, . . . , n}.k clusters: k ∈ {1, . . . ,K}.Define k = C (i): the ith observation is assigned to the k-th cluster.

The optimal cluster C ∗ should satisfy that

The total dissimilarity within clusters is minimized, i.e., we want tominimize W (C ):

W (C ) =1

2

K∑k=1

∑C(i)=k

∑C(i ′)=k

d(xi , xi ′).

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 13 / 33

Page 14: Lecture 19: Clustering Analysis - Nc State Universitylu/ST7901/lecture notes... · Various Clustering Algorithms K-means Method K-means Method One of the most popular iterative descent

Cluster Analysis Optimal Clustering and Combinatorial Algorithm

Within-Class and Between-Class Dissimilarities

Total dissimilarity:

T =1

2

n∑i=1

n∑i ′=1

dii ′ =1

2

K∑k=1

∑C(i)=k

(∑

C(i ′)=k

dii ′ +∑

C(i ′)6=k

dii ′).

Between-cluster dissimilarity:

B(C ) =1

2

K∑k=1

∑C(i)=k

∑C(i ′)6=k

dii ′ .

W (C ) = T − B(C ).

Minimizing W (C ) is equivalent to maximizing B(C ).

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 14 / 33

Page 15: Lecture 19: Clustering Analysis - Nc State Universitylu/ST7901/lecture notes... · Various Clustering Algorithms K-means Method K-means Method One of the most popular iterative descent

Cluster Analysis Optimal Clustering and Combinatorial Algorithm

Combinatorial Algorithms

One needs to minimize W over all possible assignments of n points toK clusters.

The number of distinct assignments is

S(n,K ) =1

K !

K∑k=1

(−1)K−k(K

k

)kn.

It is not feasible for large n and K .

It calls for more efficient algorithms: may not be optimal but areasonably good suboptimal partition.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 15 / 33

Page 16: Lecture 19: Clustering Analysis - Nc State Universitylu/ST7901/lecture notes... · Various Clustering Algorithms K-means Method K-means Method One of the most popular iterative descent

Various Clustering Algorithms K -means Method

K -means Method

One of the most popular iterative descent clustering methods.

Works for the case when all variables are quantitative.

Dissimilarity measure: the squared distance

d(xi , xi ′) =

p∑j=1

(xij − xi ′j)2 = ||xi − xi ′ ||2.

The within-class scatter can be reduced to

W (C ) =K∑

k=1

nk∑

C(i)=k

||xi − xk ||2,

where xk is the mean for cluster k and nk is size of cluster k.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 16 / 33

Page 17: Lecture 19: Clustering Analysis - Nc State Universitylu/ST7901/lecture notes... · Various Clustering Algorithms K-means Method K-means Method One of the most popular iterative descent

Various Clustering Algorithms K -means Method

Optimization for K -means Clustering

Optimization Problem:

C ∗ = minC

K∑k=1

nk∑

C(i)=k

||xi − xk ||2,

wherexS = argminm

∑i∈S||xi −m||2.

This can be solved using an iterative descent algorithm.

We actually need to solve the enlarged optimization problem:

minC ,{mk}K1

K∑k=1

nk∑

C(i)=k

||xi −mk ||2.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 17 / 33

Page 18: Lecture 19: Clustering Analysis - Nc State Universitylu/ST7901/lecture notes... · Various Clustering Algorithms K-means Method K-means Method One of the most popular iterative descent

Various Clustering Algorithms K -means Method

K -means Algorithm

Starts with the guesses for the K clusters:

1 (Clustering Step) Given K centroids {m1, . . . ,mk}, solve C byassigning each observation to the closest (current) cluster mean

C (i) = argmink ||xi −mk ||2.

2 (Minimization Step)) For a given C , update center {m1, . . . ,mk};that is the mean of k clusters.

3 Iterate steps 1 and 2 until the assignments do not change.

This can be regarded as a top-down procedure for clustering analysis. (seehttp://en.wikipedia.org/wiki/K−meansclustering )

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 18 / 33

Page 19: Lecture 19: Clustering Analysis - Nc State Universitylu/ST7901/lecture notes... · Various Clustering Algorithms K-means Method K-means Method One of the most popular iterative descent

Various Clustering Algorithms K -means Method

Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 14

-4 -2 0 2 4 6

-20

24

6

Initial Centroids

• • •

••

••

•••

• •• ••

•• • •

••

••

••

•••••

• ••

• •••

••

••

•••

••

••

••••

• •••

••

•• •

•• •• •

• •• •

••

• •

• ••••

••

•• • •

•• •• •

• ••

••• •

•••

•• •

••

••

••

••

••

••

• •

••• ••

••

••

• • •

••

••

•••

• •• ••

•• • •

••

••

••

•••••

• ••

• •••

••

••

•••

••

••

••••

• •••

••

•• •

•• •• •

• •• •

••

• •

• ••••

••

•• • •

•• •• •

• ••

••• •

•••

•• •

••

••

••

••

••

••

• •

••• ••

••

••

Initial Partition

• • •

••

••

•••

• •• ••

•• • •

••

••

••

•••••

• ••

• •••

••

••

•••

••

••••

• •••

••

•• •

•• •• •

• •• •

••

• •

•• ••

••

••

•• •• •

••

••• •

••

• •• •

••

••

••

••

••

••

• •

••• ••

Iteration Number 2

••

••

• • •

••

••

•••

• •• ••

•• • •

••

••

••

•••••

• ••

• •••

••

••

••

•••

••

••••

• •••

••

•• •

•• •• •

• •• •

••

• •

••

••• •

•• • •

•• •• •

•• ••

•••

•••

•• •

••

••

••

••

••••

• •

••• ••

Iteration Number 20

••

••

Figure 14.6: Su essive iterations of the K-means lustering algorithm for the simulated data of Fig-ure 14.4.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 19 / 33

Page 20: Lecture 19: Clustering Analysis - Nc State Universitylu/ST7901/lecture notes... · Various Clustering Algorithms K-means Method K-means Method One of the most popular iterative descent

Various Clustering Algorithms K -means Method

About K -means Algorithm

K -means often produces good results.

The K -means algorithm is guaranteed to converge, since each ofsteps 1 and 2 reduced W (C )

The time complexity of K -means is O(tKn), where t is the number ofiterations.

K -means finds a local optimum and may miss the global optimum.

Different starting values lead to different clustering results.

should start the algorithm with many different random choices for thestarting means, and choose the solution having smallest value of theW (C ).

K -means does not work on categorical data; does not handle outlierswell.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 20 / 33

Page 21: Lecture 19: Clustering Analysis - Nc State Universitylu/ST7901/lecture notes... · Various Clustering Algorithms K-means Method K-means Method One of the most popular iterative descent

Various Clustering Algorithms K -means Method

How to Choose K

In the K -means algorithm, one needs to specify K in advance.

K is a tuning parameter. An inappropriate choice of K may yield poorresults.

Intuitively, one can try different K values and evaluate W (C ) on atest set.

However, W (C ) generally decreases with increasing K , since a largenumber of centers tend to fill the feature space densely and thus willbe close to all data points.So CV or test set does not work here.

A heuristic approach: locating a kink in W (C ) for the optimal K ∗.

Use the Gap statistic (Tibshirani et al. 2001)

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 21 / 33

Page 22: Lecture 19: Clustering Analysis - Nc State Universitylu/ST7901/lecture notes... · Various Clustering Algorithms K-means Method K-means Method One of the most popular iterative descent

Various Clustering Algorithms K -means Method

Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 14

Number of Clusters K

Sum

of Sq

uares

2 4 6 8 10

1600

0020

0000

2400

00

••

••

• •• •

Figure 14.8: Total within luster sum of squares forK-means lustering applied to the human tumor mi- roarray data.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 22 / 33

Page 23: Lecture 19: Clustering Analysis - Nc State Universitylu/ST7901/lecture notes... · Various Clustering Algorithms K-means Method K-means Method One of the most popular iterative descent

Various Clustering Algorithms K-medoids Algorithm

Motivations of K-medoids Algorithm

The K-means algorithm:

assumes squared Euclidean distance in the minimization step

requires all the inputs to be quantitative type

The K-medoids algorithm generalizes the K-means algorithm in thefollowing ways:

use arbitrarily defined similarities D(xi , xi ′)

can be applied to any data described only by proximity matrices

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 23 / 33

Page 24: Lecture 19: Clustering Analysis - Nc State Universitylu/ST7901/lecture notes... · Various Clustering Algorithms K-means Method K-means Method One of the most popular iterative descent

Various Clustering Algorithms K-medoids Algorithm

Features of K-medoids Algorithm

Main ideas:

Centers for each cluster are restricted to be one of the observationsassigned to the cluster

there is no need to explicitly computer cluster centersthe computation cost is O(n2k). (For K-means, the computation cost isO(nk).)

Given a set of cluster centers {i1, · · · , ik}, obtain the new assignments

C (i) = arg min1≤k≤K

diik .

Requires the computation proportional to Kn.

K-medoids is far more computationally intensive than K -means.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 24 / 33

Page 25: Lecture 19: Clustering Analysis - Nc State Universitylu/ST7901/lecture notes... · Various Clustering Algorithms K-means Method K-means Method One of the most popular iterative descent

Various Clustering Algorithms K-medoids Algorithm

K -medoids Algorithm

Starts with the guesses for the K clusters:

1 (Clustering Step) Given {m1, . . . ,mk}, minimize the total error byassigning each observation to the closest center:

C (i) = arg min1≤k≤K

D(xi ,mk).

2 (Minimization Step) For a given C , find the observation in the clusterk by minimizing the total distance to other points in the cluster k

i∗k = arg mini :C(i)=k

∑C(i ′)=k

D(xi , xi ′).

3 Iterate steps 1 and 2 until the assignments do not change.

A heuristic search strategy to solve minC ,{ik}K1

∑Kk=1

∑C(i)=k diik .

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 25 / 33

Page 26: Lecture 19: Clustering Analysis - Nc State Universitylu/ST7901/lecture notes... · Various Clustering Algorithms K-means Method K-means Method One of the most popular iterative descent

Various Clustering Algorithms Hierarchical Clustering

Hierarchical Clustering

K -means does not give a linear ordering of objects within a cluster. Thismotivates the idea of hierarchical clustering:

Produce hierarchical representations: the clusters at each level of thehierarchy are created by merging clusters at the next lower level.

At the lowest level, each cluster contains a single observation.

At the highest level there is only one cluster containing allobservations.

It is very informative and highly interpretable to use the dendrogram todisplay the clustering result.

Cutting the dendrogram horizontally at a particular height partitionsthe data into disjoint clusters represented by the vertical lines thatintersect it.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 26 / 33

Page 27: Lecture 19: Clustering Analysis - Nc State Universitylu/ST7901/lecture notes... · Various Clustering Algorithms K-means Method K-means Method One of the most popular iterative descent

Various Clustering Algorithms Hierarchical Clustering

Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 14

CNS

CNS

CNS

RENA

L

BREA

ST

CNSCN

S

BREA

ST

NSCL

C

NSCL

C

RENA

LRE

NAL

RENA

LRENA

LRE

NAL

RENA

L

RENA

L

BREA

STNS

CLC

RENA

L

UNKN

OWN

OVAR

IAN

MELA

NOMA

PROS

TATEOV

ARIAN

OVAR

IAN

OVAR

IANOV

ARIAN

OVAR

IANPR

OSTA

TE

NSCL

CNS

CLC

NSCL

C

LEUK

EMIA

K562

B-rep

roK5

62A-r

epro

LEUK

EMIA

LEUK

EMIA

LEUK

EMIA

LEUK

EMIA

LEUK

EMIA

COLO

NCO

LON

COLO

N COLO

NCO

LON

COLO

NCO

LON

MCF7

A-rep

roBR

EAST

MCF7

D-rep

ro

BREA

ST

NSCL

C

NSCL

CNS

CLC

MELA

NOMA

BREA

STBR

EAST

MELA

NOMA

MELA

NOMA

MELA

NOMA

MELA

NOMA

MELA

NOMA

MELA

NOMA

Figure 14.12: Dendrogram from agglomerative hier-ar hi al lustering with average linkage to the humantumor mi roarray data.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 27 / 33

Page 28: Lecture 19: Clustering Analysis - Nc State Universitylu/ST7901/lecture notes... · Various Clustering Algorithms K-means Method K-means Method One of the most popular iterative descent

Various Clustering Algorithms Hierarchical Clustering

Agglomerative Clustering

Two paradigms:

agglomerative (bottom-up)

divisive (top-down).

Key steps in the agglomerative clustering:

Begin with every observation representing a singleton cluster.

At each step, merge two “closest” clusters into one cluster and reducethe number of clusters by one.

Need a measure of dissimilarity between two clusters.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 28 / 33

Page 29: Lecture 19: Clustering Analysis - Nc State Universitylu/ST7901/lecture notes... · Various Clustering Algorithms K-means Method K-means Method One of the most popular iterative descent

Various Clustering Algorithms Hierarchical Clustering

Dissimilarity between Two Groups

In hierarchical clustering, it is important to define the dissimilarity betweentwo clusters (group) G and H:

d(G ,H) is a function of the set of pairwise dissimilarities dii ′ ,where xi ∈ G and xi ′ ∈ H.

Single linkage: the closest pair

dSL(G ,H) = mini∈G ,i ′∈H

dii ′ .

Complete linkage: the furthest pair

dCL(G ,H) = maxi∈G ,i ′∈H

dii ′ .

Group Average: average dissimilarity

dGA(G ,H) =1

nGnH

∑i∈G

∑i ′∈A

dii ′ .

Group average clustering has a statistical consistency property violated bythe single and complete linkage.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 29 / 33

Page 30: Lecture 19: Clustering Analysis - Nc State Universitylu/ST7901/lecture notes... · Various Clustering Algorithms K-means Method K-means Method One of the most popular iterative descent

Various Clustering Algorithms Hierarchical Clustering

Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 14

Average Linkage Complete Linkage Single Linkage

Figure 14.13: Dendrograms from agglomerative hier-ar hi al lustering of human tumor mi roarray data.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 30 / 33

Page 31: Lecture 19: Clustering Analysis - Nc State Universitylu/ST7901/lecture notes... · Various Clustering Algorithms K-means Method K-means Method One of the most popular iterative descent

Various Clustering Algorithms Hierarchical Clustering

Human Tumor Microarray Data

An example of high-dimensional clustering

The data are 6, 830× 64 matrix of real numbers

Row: the expression measurement for a gene.

Column: a sample (labeled as “breast cancer” or “melaoma”)

In the next plot,

we have arranged the genes (rows) and samples (columns) in orderingderived from hierarchical clustering

the subtree with the tighter cluster is placed to the left (or bottom).

This two-way rearrangement of picture is more informative than displayingthe data with randomly ordered rows and columns

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 31 / 33

Page 32: Lecture 19: Clustering Analysis - Nc State Universitylu/ST7901/lecture notes... · Various Clustering Algorithms K-means Method K-means Method One of the most popular iterative descent

Various Clustering Algorithms Hierarchical Clustering

Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 14

Figure 14.14: DNA mi roarray data: average linkagehierar hi al lustering has been applied independentlyto the rows (genes) and olumns (samples), determin-ing the ordering of the rows and olumns (see text).The olors range from bright green (negative, underex-pressed) to bright red (positive, overexpressed).Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 32 / 33

Page 33: Lecture 19: Clustering Analysis - Nc State Universitylu/ST7901/lecture notes... · Various Clustering Algorithms K-means Method K-means Method One of the most popular iterative descent

Various Clustering Algorithms Hierarchical Clustering

R functions

K -means:

package stat function kmeans

Hierarchical Clustering:

package stat function hclust.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 33 / 33