week 5 lecture 10 cluster analysis - juniata...
Post on 07-Jul-2020
1 Views
Preview:
TRANSCRIPT
Week 5 Lecture 10
Cluster Analysis
Searching for group structure / pattern
among observations
Class Preplibrary (vegan); library (cluster)
bufochem.csv; boreal_birds.csv
read.table('http://jcsites.juniata.edu/faculty/merovich/QuantEcol_files/bufochem.
csv', header=TRUE, sep=",")
HWBorcard et al. 2011
Thought for today: "No matter what you do, there will be other people who do it better. Do everything for the fun of it, and never
mind the score." -- Warren Miller, filmmaker (1924-)
Cluster Analysis
Learning outcomes
Explain the purpose of cluster analysis
Appropriately use cluster analysis to
address research questions and complete
data analyses
Interpret output from cluster analysis and
use this information in further analyses
CA – purpose
Can I group my observations into various
categories given their attributes?
A priori
Unconstrained
Un-supervised
Classification technique
Cluster Analysis
The essence is to
find groups of
observations
Within vs……
Between group
Distance ….
Site Spp1 Spp2 Spp3 Spp4
Lake1 13 56 32 19
Lake2 12 1 59 53
Lake3 3 3 87 63
Lake4 43 3 12 7
Lake5 10 11 14 5
Cluster Analysis
Visualize a distance matrix
– [X] continuous p x n data frame
– n x n dissimilarity matrix
– i.e., usually want to cluster ROWS (not
variables)
Kinds
Partitioning—non hierarchical
Hierarchical
Hierarchical Cluster Analysis
Finds nested sequence of clusters
Kinds
– Divisive
– Agglomerative
The opposite
Common
– Trace the splits or additions (respectively) to
produce a
Dendrogram
Agglomerative Nested (Hierarchical) Cluster
Analysis: An Example
Site by species (4n x 3p) data matrix
Site Spp1 Spp2 Spp3
Stream1 23 2 20
Stream2 42 17 21
Stream3 46 34 5
Stream4 12 5 15
Example
Calculate the Square
Dissimilarity Matrix:
Euclidian Distance
(nxn)
Stream1 Stream2 Stream3 Stream4
Stream1 0
Stream2 24.2 0
Stream3 42.2 23.7 0
Stream4 12.4 32.9 45.8 0
Example
Summarize the dissimilarity matrix in a
dendrogram
– First, look for the smallest distance and plot it
Example
Calculate the Square
Dissimilarity Matrix:
Euclidian Distance
(nxn)
S1
S2
S3
S4
24.2
12.4
23.7
Stream1 Stream2 Stream3 Stream4
Stream1 0
Stream2 24.2 0
Stream3 42.2 23.7 0
Stream4 12.4 32.9 45.8 0
First Step in Building
Dendrogram
10
20
30
50
40
Euclid
ea
n D
ista
nce
12.4STR1 STR4
Recalculate Euclidean distance matrix with
Stream 1 and 4 combined as a group
Look for the smallest distance and plot
Unchanged
S2
S3
S1 & S4
24.2
23.7
Linkage Methods:
Single (Closest)
Complete (Furthest)
Average
Stream1 and 4 Stream 2 Stream 3
Stream1 and 4 0
Stream 2 24.2, 32.9 0
Stream 3 42.2, 45.8 23.7 0
Second Step in Building
Dendrogram
12.4STR1 STR4 STR3STR2
23.7
10
20
30
50
40
Euclid
ean D
ista
nce
Option: complete or furthest neighbor: largest dissimilarity between a point in the first
cluster and a point in the second cluster (furthest neighbor method)
Recalculate Euclidean distance matrix with
Stream 1 and 4 combined as a group, and
with Stream 2 and 3 combined as a group
Look for the smallest distance and plotS1 & S4
24.2
S2 & S3
Stream1 and 4 Stream 2 and 3
Stream1 and 4 0 24.2, 32.9,
42.2, 45.8
Stream 2 and 3 0
Third Step in Building
Dendrogram
12.4
STR1 STR4 STR3STR2
24.2
23.7
Agglomerative10
20
30
50
40
Euclid
ean D
ista
nce
Agglomerative Nested (Hierarchical)
Cluster Analysis
How should one calculate the new distances
between newly merged observations (groups)
and the remaining objects?
Agglomerative Nested (Hierarchical)
Cluster Analysis
How should one calculate the new distances
between newly merged observations (groups)
and the remaining objects?
– Stream 1 and 4, and
Stream 2
Stream 3
– Use:
Stream 1?
Stream 4?
Combination?
Stream1 and 4 Stream 2 Stream 3
Stream1 and 4 0
Stream 2 32.9 (2 vs. 4) 0
Stream 3 45.8 (3 vs. 4) 23.7 0
Unchanged
Agglomerative Methods
Three common linkage methods used:
– Single Linkage (minimum; nearest neighbor) Method
– Complete Linkage (maximum; furthest neighbor) Method
– Average Linkage Method
Agglomerative Methods
Single Linkage: uses the minimum distance of the newly formed group to find new distances to the other observations
Agglomerative Methods
Complete Linkage: uses the maximum
– Often shows too much structure
Average Linkage: uses the average
– Good compromise
– Emphasizes central tendency
– To weight (WPGMA) or ….
– ….not to weight (UPGMA) (divide by # samples in the clusters compared)
Agglomerative Methods
Other (commonly seen) linkage methods
– Centroid
– Wards Minimum Distance
– Flexible beta (B = -0.25)
Boreal Bird Example
Ned Johnson 1975
Great Basin, North America
Goal
– Primary – classify bird assemblages
– Secondary – do the bird assemblage types (if
they exist) related to type of mountain in some
way
Agglomerative Nesting Cluster Analysis
Important Choices– The distance metric
– The linkage method
Which to use??
Agglomerative Nesting Cluster Analysis
Measures of Fit—Assessing how well the newly
derived clustering summarizes the structure of
the data set objects
– Have 6 distance measures between all objects in the
distance matrix
– But the cluster tree dendrogram only has 3 distances
(it summarizes the original distance matrix)
– How well do these 3 of the dendrogram estimate the
distances in the original distance matrix??
– How much information does the cluster dendrogram
capture / explain from the original information??
Agglomerative Nesting Cluster Analysis
Measure of Fit
– Calculate derived distances (based on your distance
measure) – BC, C, O, etc.– and all your linkage
methods (single, complete, etc.)
– Compare to original true distances in the full matrix
(Euclidian)
One method is cophenetic correlation—matrix comparison
Strategy (i.e., tree) with the highest cophenetic correlation is
the winner (the one that explains most of the variation in the
original distance matrix)
Measures how faithfully a summarization technique
preserves the original info / pairwise distances
Important Points
Not statistical; merely numerical
No assumptions, but be smart….
Summarizes data so you loose something
Cluster analysis Interpretation
Flip around node
The Height axis
Cluster Analysis
Considerations
Once you have the dendrogram, now
what?
12.4
STR1 STR4 STR3STR2
45.8
23.7
10
20
30
50
40
Eu
clid
ean D
ista
nce
Now look for groups—often a
primary goal of CA
You interpret the value in it
Cluster Analysis
Considerations1. Determine the groups– Subjective
– Look for “natural groups” defined by long stems
– Choose some predefined level of similarity
– Cut at 2, 3, 4, 5 groups?
– Can test it
– Objective procedures have been developed
– Tradeoffs between number of groups and the similarity of their
elements
What happens as you move up to less groups?
What happens as you move down to more groups?
2. NAME THE GROUPS based on relationships to
internal or external variables
3. Quantify relationships among groups
Cluster Analysis
Considerations
Once you have the dendrogram, now what?
4. You may “PRUNE” the tree to hide detail
– If there are too many samples to show
– Pruning: TradeoffsWhat happens as you move up
What happens as you move down
A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B
02
46
8
Group Membership
Clu
ste
r H
eig
ht
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BF
BF
BA
BA
BA
BA
BA
BA
BA
BA
BF
BF
BA
BF
BF
BF
BF
BF
BF
BF
BF
BF
BF
BF
BF
BF
BF
BF
BF
BF
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BF
BF
BF
BA
BF
BA
BF
BA
BA
BA
BA
BA
BF
BF
BF
23
45
67
89
Cluster Analysis--Wards Linkage and Bray-Curtis Distance Metric
Species
Ag
glo
me
ratio
n L
eve
l
Once you have the groups…
5. Describe clusters
– Summary stats on the variables
– Correlate, examine independent variables not
used in clustering
– Try to explain reason for grouping
– Verify groups using other tests
ANOVA
ANOSIM
MeanSim
DFA
PCA
Example
Goal: group toads based on Bufadienolide chemicals
Is the classification related to some external grouping
scheme – that would be cool
Is this classification related to species
Is this classification related to habitat / presence of
predators
Is it related to geography?
Considerations
Standardize data if sample variances do
not reflect the importance of variables
– Example: one variable with large variance will
completely determine distances
– Same reason why we ran PCA on the
correlation matrix
Considerations
What distance metric to use
– Any
– Euclidean good for data showing linear patterns
Chemistry
Habitat
– “Ecological indexes” for variables with modal
patterns / relationships
BC, Jaccard, etc.
Can be sensitive, so compare to an
ordination like PCA
Considerations
Be careful with dendrograms that don’t
have long stems with segregated groups
Gleasonian gradients prevail in ecological
data
What to report and do?
Distance measure and justification
Linkage method and justification
Dendrogram?
Pruning level and justification
Group membership of observations that emerge from
pruning
Overlay a priori group membership
Summary stats on groups
Verification of groups
Software
top related