the multivariate dataset 2-way data matrix · original data matrix dissimilarity matrix euclidean...
TRANSCRIPT
SpeciesDCBASites112911111812
101061310901041082105270106
1
The Multivariate Dataset
Obs Group X-set Y-set
1 A a11 a12 a13 ... a1p b11 b12 b13 ... b1m
2 A a21 a22 a23 ... a2p b21 b22 b23 ... b2m
3 A a31 a32 a33 ... a3p b31 b32 b33 ... b3m
. . . . . ... . . . . ... .
. . . . . ... . . . . ... .n A an1 an2 an3 ... anp bn1 bn2 bn3 ... bnm
n+1 C c11 c12 c13 ... c1p
n+2 C c21 c22 c23 ... c2p
n+3 C c31 c32 c33 ... c3p
. . . . . ... .
. . . . . ... .N C cn1 cn2 cn3 ... cnp
2-way data matrix
2
12
34
56
Sites-by-species2-way data matrix
The Multivariate Dataset2-way data matrix
SpeciesDCBASites112911111812
101061310901041082105270106
SpeciesDCBASites112911111812
101061310901041082105270106
3
P Each site can be represented asa point in a p-dimensionalspace based on its measuredvalues along each of the pspecies axes.
P The collection of points formsa “data cloud” in this p-dimensional space.
P The shape, clumping anddispersion of this data cloudcontains ecological informationwe seek to describe.
Species A
Site 5
Site 4
10
8
9
2
12
Site 1
1
9
Data Space
The Multivariate DatasetGeometric Representation
4
Ecological Resemblance
12
34
56
How ecologically similar(or dissimilar) is eachsite to each other?
SpeciesDCBASites112911111812
101061310901041082105270106
5
Ecological Resemblance
P Similarity is a characterizationof the ratio of the number ofattributes two objects share incommon compared to the totallist of attributes between them.
P Objects that have everything incommon are identical, andhave a similarity of 1.0; objectsthat have nothing in commonhave a similarity of 0.0.
P Dissimilarity is thecomplement of similarity,and is a characterization ofthe number of attributes twoobjects have uniquelycompared to the total list ofattributes between them.
P In general, dissimilarity canbe calculated as 1-similarity.
Similarity: Dissimilarity:
Both range from 0-1
6
P Distance is a geometricconception of the proximityof objects in a highdimensional space defined bymeasurements on theattributes.
P How we measure proximity,however, varies amongdistance measures.
Ecological Distance
Species A
Site 5
Site 4
10
8
9
2
12
Site 1
1
9
Data Space
?12
Original Data Matrix (6C4) Dissimilarity Matrix (6C6)
Species Sites
Sites A B C D Sites 1 2 3 4 5 6
1 1 9 12 1 1 0
2 1 8 11 1 2 1.4 0
3 1 6 10 10 3 9.7 9.3 0
4 10 0 9 10 4 15.9 15.2 10.9 0
5 10 2 8 10 5 15.1 14.4 10.0 2.2 0
6 10 0 7 2 6 13.7 12.7 13.8 8.2 8.3 0
Sites A B C D Sites 1 2 3 4 5 6
1 1 9 12 1 1 0
2 1 8 11 1 2 1.4 0
3 1 6 10 10 3 9.7 9.3 0
4 10 0 9 10 4 15.9 15.2 10.9 0
5 10 2 8 10 5 15.1 14.4 10.0 2.2 0
6 10 0 7 2 6 13.7 12.7 13.8 8.2 8.3 0
SitesSpecies
7
P In practice, distances and dissimilarities areoften used interchangeably. They havequite distinct properties, however.
P Dissimilarities are always bounded [0,1]; e.g.,once plots have no species in commonthey can be no more dissimilar.
P Distances are typically unbounded on theupper end; e.g., plots that have no speciesin common have distances that depend onthe number and abundance of species inthe plots, and is thus variable.
Ecological Distance versus Dissimilarity
8
The Resemblance Transformation
Site 1 dissimilarity
Site 5
Site 6
13.7 15.1
12.7
14.4
10.0
13.8
Site 1
1.4
9.7
DissimilaritySpace
6x4 6x6
P Resemblance matrixcontains a resemblancecoefficient for every pair ofentities. The result is anentities-by-entitiesresemblance matrix.
Original data matrix Dissimilarity matrix
Euclideandistance
Sites A B C D Sites 1 2 3 4 5 6
1 1 9 12 1 1 0
2 1 8 11 1 2 1.4 0
3 1 6 10 10 3 9.7 9.3 0
4 10 0 9 10 4 15.9 15.2 10.9 0
5 10 2 8 10 5 15.1 14.4 10.0 2.2 0
6 10 0 7 2 6 13.7 12.7 13.8 8.2 8.3 0
SitesSpecies
SpeciesDCBASites112911111812
101061310901041082105270106
ED x xjk ij iki
p
( )2
1
9
P The choice of a coefficient will frequently be guided by thetype of data, the ecological question, or the type of analysis.
P When the measurement scale is such that several possiblecoefficients could be used, the choice is often guided bypersonal preference.
P It is generally advantageous to try several different measuresand weigh the results using ecological criteria (i.e., which oneproduces the most meaningful and interpretable results).
Ecological Resemblance
P There is a large numberof resemblancemeasures to choosefrom.
Original data matrix Dissimilarity matrix
Euclideandistance
10
Euclidean Distance
P Intuitively, the most appealingdistance measure.
P Data are usually columnstandardized first to removedifferences due to measurementunits and scale (if necessary).
P Can be applied to data of any scale.
P Has true ‘metric’ properties and isused in eigenvector ordinations.
B
A0
EuclideanDistance
SpeciesDCBASites112911111812
101061310901041082105270106
ED x xjk ij iki
p
( )2
1
SpeciesDCBASites112911111812
101061310901041082105270106
CB x xjk ij iki
p
1
11
Euclidean Distance
P Often poor performance inecological applications due toseveral problems:
P Assumes that variables areuncorrelated (never so).
P Emphasizes outliers.
P Loses sensitivity more rapidlythan other distance measures as heterogeneity of data increases.
P Measures distance throughpotentially uninhabitableecological space.
P Nonproportional distancemeasure.0
12
City-block (Manhattan) Distance
P Most ecologically meaningfuldissimilarity measures are ofManhattan type.
P Compared to ED, gives less weightto outliers (no squared differences).
P Compared to ED, retains sensitivityas the heterogeneity of a data setincreases.
P But still a nonproportional distance.
City-blockdistance
SpeciesDCBASites112911111812
101061310901041082105270106
PDx x
x xjk
ij iki
p
ij iki
p
( )
( )
100 1
1
PDx x
x
x x
xjk
ij iki
p
iji
p
ij iki
p
iki
p
( )
min( , ) min( , )100 1
1
21
1
1
1
PDx x
x xjk
ij iki
p
ij iki
p
100 1
21
1
min( , )
( )
11
2
w
A
w
B
1
w
A B w
12
w
A B
PD
x x
x x x xjk
ij iki
p
iji
p
iki
p
ij iki
p
( )
min ,
100 1 1
1 1 1
13
Proportional Distance Coefficients
E.g., Percentage Dissimilarity
P City-block (Manhattan) distancemeasures expressed as proportionsof the maximum distance possible.
P Whereby if two communities shareno species in common they have amaximum dissimilarity of one.
(Sorensen Distance or Bray-Curtis Distance)
A BW
Environmental gradient
B
A
City-blockdistance
14
Proportional Distance Coefficients
Variations on Percentage Dissimilarity
Sorensen Distance orBray-Curtis Distance
Jaccard Distance
Kulczynski Distance
A BW
Environmental gradient
SpeciesDCBASites112911111812
101061310901041082105270106
PDx x
x xjk
ij iki
p
ij iki
p
100 1
21
1
min( , )
( )
SpeciesDCBASites112911111812
101061310901041082105270106
Chordx
x
x
xij
ij
iji
pik
iki
pi
p
2
1
2
1
2
1
15
Proportional Distance Coefficients
Percentage Dissimilarity
P PD is commonly used with speciesabundance data, but can beapplied to data of any scale (e.g.,presence/absence data).
P Compared to ED, PD gives lessweight to outliers.
P Compared to ED, PD retainssensitivity as the heterogeneity of adata set increases.
P Unlike ED and CB, PD ismaximum when no shared species.
P But PD is not ‘metric’ and thusnot compatible with manyanalyses (e.g, DA, CCA).
(Sorensen DistanceBray-Curtis Distance)
16
ChordDistance
P Similar conceptually to ED, but thedata are adjusted so that the sums ofsquares for each row are 1 (i.e., rownormalization).
P Useful in species abundance data;effectively removes differences inoverall abundances among samples,instead focusing analysis on thedifferences in relative abundancesamong species (i.e., species profiles).
A*
B*
0 1
1
Euclidean Distance Based on Species Profiles
Chordx
x
x
xij
ij
iji
pik
iki
pi
p
2
1
2
1
2
1
xx
xij
ij
iji
p*
2
1
ED x xjk ij jki
p
( )* * 2
1
xx
xij
ij
iji
p*
2
1
ED x xjk ij jki
p
( )* * 2
1
Chordx
x
x
xij
ij
iji
pik
iki
pi
p
2
1
2
1
2
1
17
Euclidean Distance Based on Species Profiles
18
Euclidean Distance Based on Species Profiles
x xx
x xij
ij
i j
*
xx
xijij
i
*
xx
xijij
i
*
SpeciesDCBASites112911111812
101061310901041082105270106
PDx
x
x
xjk
ij
iji
pik
iki
pi
p
100 1
1 1
1
min ,
19
Chi-square Distance P ED computed on relative abundances(species profiles), weighted by theinverse of the square root of columntotals, adjusted by the square root ofmatrix total (or row chi.squarestandardization).
P ED computed on relative abundances(species profiles or row totalstandardization).
P ED computed on square root ofrelative abundances (species profilesor row hellinger standarization)
Euclidean Distance Based on Species Profiles
Species Profile Distance
Hellinger Distance
All these distance measures work well with species data,preserve the ‘metric’ distance, and are Euclidean.
i = row (site)j = col (species)
20
Proportional Distance Based on Species Profiles
Relative PercentageDissimilarity
P PD can be computed onstandardized data, similar tochord distance, in whichspecies abundances aredivided by sample totals sothat each sample contributesequally regardless of totalabundance.
P This measure shifts theemphasis of the analysis toproportions of species in asample unit, rather thanabsolute abundances.
(Relative SorensenRelative Bray-Curtis)
SpeciesDCBASites112911111812
101061310901041082105270106
rx x x x
x x x xjk
ij j ik ki
p
ij j ik ki
p
i
p
( )( )
( ) ( )
1
2 2
11
21
Pearson’s Product-Moment Correlation P Limited use for community
data, but may be ideal withmultivariate normal data andlinear relationships.
P Intuitively appealing becauseof familiarity with correlationcoefficients.
CDjk = (1-rjk)/2 or |rjk|
Correlation Distance
22
Sample j
Sample k
A B CA D
Sample j standardized value
rjk = 1CDjk = 0
Correlation Distance
P Generally only useful whenthe similarity in average“profile shapes” isconsidered more importantthan average “profile levels”,because correlation distanceis 0 when 2 profiles areparallel, irrespective of howfar apart they are in dataspace.
CDjk = (1-rjk)/2 or |rjk|
SpeciesDCBASites112911111812
101061310901041082105270106
D X X S X Xjk j k x j k2 1
Sites A B C
1 1 9 12
2 1 8 11
3 1 7 10
4 10 0 9
5 10 2 8
6 10 1 7
A B C
A 24.3 -18.9 -8.1
B -18.9 15.5 6.5
C -8.1 6.5 3.5
A B C
A 0.823 0.926 0.185
B 0.926 1.333 -0.333
C 0.185 -0.333 1.333
0 1 1 0.823 0.926 0.185 00.926 1.333 -0.333 10.185 -0.333 1.333 1
1.1 1 1 011
Sites 1 2 3 4 5 61 0.0002 2.000 0.0003 8.000 2.000 0.0004 8.667 6.667 8.667 0.0005 4.667 4.667 8.667 8.000 0.0006 8.667 4.667 4.667 8.000 2.000 0.000
D X X S X Xjk j k x j k2 1
23
Mahalanobis Distance
Xj - Xk = vector of differences betweensites j and k
Sx = Covariance matrix of X
Case A
Case B >MD
P Accounts for the variance-covariancestructure of data; inversely weightsdistance by the variance on each axisand the covariance between axes, sothat distance is greater in case B.
P Reduces to ED for perfectlymultivariate normal distribution.
P Commonly used for multivariateoutlier detection.
24
Mahalanobis Distance
D2 matrix:Sx =
Sx-1 =
] [[
] ][
.
.
Site 1-2 differences:
Covariance matrix:
2
[]].[
]
][
D12 = D12 = [0,1,1]
D n g w x x x xfh ij if
j
p
i
p
ih jf jh2
11
( ) ( )( )Sites Group A B C D
1 f 1 9 12 1
2 f 1 8 11 1
3 f 1 7 10 10
4 h 10 0 9 10
5 h 10 3 8 10
6 h 10 0 7 1
Species
Sample j
kSample
TotalAbsentPresent
a+bbaPresent
c+ddcAbsent
pb+da+cTotal
CSMC IAa d
pJK
CCJ IAa
a b cJK
CCC IAa
a b cJK 2
2
25
Mahalanobis Distance
P MD commonly used to measuredistance between groups (e.g., indiscriminant analysis)
P MD inversely weights the distancebetween group centroids by thevariance, so the distance is greaterin case B than case A, eventhough the centroids areequidistant in hyperspace.
Case A
Case B
n = #sites; g = #groups0if = mean of variable i in group fwij = inverse of the pooled within-groups var-
cov matrix of X for variable i and j
26
Association Coefficients
Simple MatchingCoefficient
P Applied to categorical data.
P Measure agreement between2 rows representing 2 sampleentities.
P May be special cases ofdistance measures.
P Most measures for binary(presence/absence) data.
P Different associationcoefficients emphasizedifferent aspects of theagreement between samples.
AssociationTable
Coefficient ofJaccard
Coefficient ofCommunity
27
P Availability - many choices,but not all, in most computerprograms.
P Compatibility - city-blockmeasures not compatible withmany multivariate procedures(DA, CANCOR, CCA).
P Theoretical basis - very little;Euclidean vs city-blockdistances in species space.
P Intuitive criteria - effects ofoutliers; sensitivity withincreasing heterogeneity.
Chosing a Distance Coefficient?
X1
X2
Centroid
%
X2
X1
City-blockdistance
X2
X1
Correlationdistance
Euclideandistance
28
Ecological Distance Blues
PWhich resemblance measure should Iuse?
P Should I standardize my data beforecalculating resemblance?
P If so, which standardization should Iuse, column or row standardizationboth (wisconsin), based on norm,range, total, or max, etc., and how willthis affect my interpretation ofecological distance?
And we are just getting started!