week 5 lecture 10 cluster analysis - juniata...

38
Week 5 Lecture 10 Cluster Analysis Searching for g roup structure / pattern among observations Class Prep library (vegan); library (cluster) bufochem.csv; boreal_birds.csv read.table('http://jcsites.juniata.edu/faculty/merovich/QuantEcol_files/bufochem. csv', header=TRUE, sep=",") HW Borcard et al. 2011 Thought for today: "No matter what you do, there will be other people who do it better. Do everything for the fun of it, and never mind the score." -- Warren Miller, filmmaker (1924-)

Upload: others

Post on 07-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

Week 5 Lecture 10

Cluster Analysis

Searching for group structure / pattern

among observations

Class Preplibrary (vegan); library (cluster)

bufochem.csv; boreal_birds.csv

read.table('http://jcsites.juniata.edu/faculty/merovich/QuantEcol_files/bufochem.

csv', header=TRUE, sep=",")

HWBorcard et al. 2011

Thought for today: "No matter what you do, there will be other people who do it better. Do everything for the fun of it, and never

mind the score." -- Warren Miller, filmmaker (1924-)

Page 2: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

Cluster Analysis

Learning outcomes

Explain the purpose of cluster analysis

Appropriately use cluster analysis to

address research questions and complete

data analyses

Interpret output from cluster analysis and

use this information in further analyses

Page 3: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

CA – purpose

Can I group my observations into various

categories given their attributes?

A priori

Unconstrained

Un-supervised

Classification technique

Page 4: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

Cluster Analysis

The essence is to

find groups of

observations

Within vs……

Between group

Distance ….

Site Spp1 Spp2 Spp3 Spp4

Lake1 13 56 32 19

Lake2 12 1 59 53

Lake3 3 3 87 63

Lake4 43 3 12 7

Lake5 10 11 14 5

Page 5: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

Cluster Analysis

Visualize a distance matrix

– [X] continuous p x n data frame

– n x n dissimilarity matrix

– i.e., usually want to cluster ROWS (not

variables)

Page 6: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

Kinds

Partitioning—non hierarchical

Hierarchical

Page 7: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

Hierarchical Cluster Analysis

Finds nested sequence of clusters

Kinds

– Divisive

– Agglomerative

The opposite

Common

– Trace the splits or additions (respectively) to

produce a

Dendrogram

Page 8: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

Agglomerative Nested (Hierarchical) Cluster

Analysis: An Example

Site by species (4n x 3p) data matrix

Site Spp1 Spp2 Spp3

Stream1 23 2 20

Stream2 42 17 21

Stream3 46 34 5

Stream4 12 5 15

Page 9: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

Example

Calculate the Square

Dissimilarity Matrix:

Euclidian Distance

(nxn)

Stream1 Stream2 Stream3 Stream4

Stream1 0

Stream2 24.2 0

Stream3 42.2 23.7 0

Stream4 12.4 32.9 45.8 0

Page 10: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

Example

Summarize the dissimilarity matrix in a

dendrogram

– First, look for the smallest distance and plot it

Page 11: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

Example

Calculate the Square

Dissimilarity Matrix:

Euclidian Distance

(nxn)

S1

S2

S3

S4

24.2

12.4

23.7

Stream1 Stream2 Stream3 Stream4

Stream1 0

Stream2 24.2 0

Stream3 42.2 23.7 0

Stream4 12.4 32.9 45.8 0

Page 12: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

First Step in Building

Dendrogram

10

20

30

50

40

Euclid

ea

n D

ista

nce

12.4STR1 STR4

Page 13: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

Recalculate Euclidean distance matrix with

Stream 1 and 4 combined as a group

Look for the smallest distance and plot

Unchanged

S2

S3

S1 & S4

24.2

23.7

Linkage Methods:

Single (Closest)

Complete (Furthest)

Average

Stream1 and 4 Stream 2 Stream 3

Stream1 and 4 0

Stream 2 24.2, 32.9 0

Stream 3 42.2, 45.8 23.7 0

Page 14: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

Second Step in Building

Dendrogram

12.4STR1 STR4 STR3STR2

23.7

10

20

30

50

40

Euclid

ean D

ista

nce

Option: complete or furthest neighbor: largest dissimilarity between a point in the first

cluster and a point in the second cluster (furthest neighbor method)

Page 15: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

Recalculate Euclidean distance matrix with

Stream 1 and 4 combined as a group, and

with Stream 2 and 3 combined as a group

Look for the smallest distance and plotS1 & S4

24.2

S2 & S3

Stream1 and 4 Stream 2 and 3

Stream1 and 4 0 24.2, 32.9,

42.2, 45.8

Stream 2 and 3 0

Page 16: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

Third Step in Building

Dendrogram

12.4

STR1 STR4 STR3STR2

24.2

23.7

Agglomerative10

20

30

50

40

Euclid

ean D

ista

nce

Page 17: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

Agglomerative Nested (Hierarchical)

Cluster Analysis

How should one calculate the new distances

between newly merged observations (groups)

and the remaining objects?

Page 18: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

Agglomerative Nested (Hierarchical)

Cluster Analysis

How should one calculate the new distances

between newly merged observations (groups)

and the remaining objects?

– Stream 1 and 4, and

Stream 2

Stream 3

– Use:

Stream 1?

Stream 4?

Combination?

Stream1 and 4 Stream 2 Stream 3

Stream1 and 4 0

Stream 2 32.9 (2 vs. 4) 0

Stream 3 45.8 (3 vs. 4) 23.7 0

Unchanged

Page 19: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

Agglomerative Methods

Three common linkage methods used:

– Single Linkage (minimum; nearest neighbor) Method

– Complete Linkage (maximum; furthest neighbor) Method

– Average Linkage Method

Page 20: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

Agglomerative Methods

Single Linkage: uses the minimum distance of the newly formed group to find new distances to the other observations

Page 21: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

Agglomerative Methods

Complete Linkage: uses the maximum

– Often shows too much structure

Average Linkage: uses the average

– Good compromise

– Emphasizes central tendency

– To weight (WPGMA) or ….

– ….not to weight (UPGMA) (divide by # samples in the clusters compared)

Page 22: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

Agglomerative Methods

Other (commonly seen) linkage methods

– Centroid

– Wards Minimum Distance

– Flexible beta (B = -0.25)

Page 23: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

Boreal Bird Example

Ned Johnson 1975

Great Basin, North America

Goal

– Primary – classify bird assemblages

– Secondary – do the bird assemblage types (if

they exist) related to type of mountain in some

way

Page 24: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

Agglomerative Nesting Cluster Analysis

Important Choices– The distance metric

– The linkage method

Which to use??

Page 25: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

Agglomerative Nesting Cluster Analysis

Measures of Fit—Assessing how well the newly

derived clustering summarizes the structure of

the data set objects

– Have 6 distance measures between all objects in the

distance matrix

– But the cluster tree dendrogram only has 3 distances

(it summarizes the original distance matrix)

– How well do these 3 of the dendrogram estimate the

distances in the original distance matrix??

– How much information does the cluster dendrogram

capture / explain from the original information??

Page 26: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

Agglomerative Nesting Cluster Analysis

Measure of Fit

– Calculate derived distances (based on your distance

measure) – BC, C, O, etc.– and all your linkage

methods (single, complete, etc.)

– Compare to original true distances in the full matrix

(Euclidian)

One method is cophenetic correlation—matrix comparison

Strategy (i.e., tree) with the highest cophenetic correlation is

the winner (the one that explains most of the variation in the

original distance matrix)

Measures how faithfully a summarization technique

preserves the original info / pairwise distances

Page 27: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

Important Points

Not statistical; merely numerical

No assumptions, but be smart….

Summarizes data so you loose something

Page 28: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

Cluster analysis Interpretation

Flip around node

The Height axis

Page 29: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

Cluster Analysis

Considerations

Once you have the dendrogram, now

what?

12.4

STR1 STR4 STR3STR2

45.8

23.7

10

20

30

50

40

Eu

clid

ean D

ista

nce

Page 30: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

Now look for groups—often a

primary goal of CA

You interpret the value in it

Page 31: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

Cluster Analysis

Considerations1. Determine the groups– Subjective

– Look for “natural groups” defined by long stems

– Choose some predefined level of similarity

– Cut at 2, 3, 4, 5 groups?

– Can test it

– Objective procedures have been developed

– Tradeoffs between number of groups and the similarity of their

elements

What happens as you move up to less groups?

What happens as you move down to more groups?

2. NAME THE GROUPS based on relationships to

internal or external variables

3. Quantify relationships among groups

Page 32: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

Cluster Analysis

Considerations

Once you have the dendrogram, now what?

4. You may “PRUNE” the tree to hide detail

– If there are too many samples to show

– Pruning: TradeoffsWhat happens as you move up

What happens as you move down

A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B

02

46

8

Group Membership

Clu

ste

r H

eig

ht

BA

BA

BA

BA

BA

BA

BA

BA

BA

BA

BA

BA

BA

BA

BA

BA

BA

BA

BA

BA

BA

BA

BA

BA

BA

BA

BA

BA

BA

BA

BA

BA

BA

BA

BA

BA

BF

BF

BA

BA

BA

BA

BA

BA

BA

BA

BF

BF

BA

BF

BF

BF

BF

BF

BF

BF

BF

BF

BF

BF

BF

BF

BF

BF

BF

BF

BA

BA

BA

BA

BA

BA

BA

BA

BA

BA

BA

BF

BF

BF

BA

BF

BA

BF

BA

BA

BA

BA

BA

BF

BF

BF

23

45

67

89

Cluster Analysis--Wards Linkage and Bray-Curtis Distance Metric

Species

Ag

glo

me

ratio

n L

eve

l

Page 33: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

Once you have the groups…

5. Describe clusters

– Summary stats on the variables

– Correlate, examine independent variables not

used in clustering

– Try to explain reason for grouping

– Verify groups using other tests

ANOVA

ANOSIM

MeanSim

DFA

PCA

Page 34: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

Example

Goal: group toads based on Bufadienolide chemicals

Is the classification related to some external grouping

scheme – that would be cool

Is this classification related to species

Is this classification related to habitat / presence of

predators

Is it related to geography?

Page 35: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

Considerations

Standardize data if sample variances do

not reflect the importance of variables

– Example: one variable with large variance will

completely determine distances

– Same reason why we ran PCA on the

correlation matrix

Page 36: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

Considerations

What distance metric to use

– Any

– Euclidean good for data showing linear patterns

Chemistry

Habitat

– “Ecological indexes” for variables with modal

patterns / relationships

BC, Jaccard, etc.

Can be sensitive, so compare to an

ordination like PCA

Page 37: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

Considerations

Be careful with dendrograms that don’t

have long stems with segregated groups

Gleasonian gradients prevail in ecological

data

Page 38: Week 5 Lecture 10 Cluster Analysis - Juniata Collegejcsites.juniata.edu/faculty/merovich/QuantEcol_files/ClusterAnalysis... · Agglomerative Nesting Cluster Analysis Measures of Fit—Assessing

What to report and do?

Distance measure and justification

Linkage method and justification

Dendrogram?

Pruning level and justification

Group membership of observations that emerge from

pruning

Overlay a priori group membership

Summary stats on groups

Verification of groups

Software