the multivariate dataset 2-way data matrix · original data matrix dissimilarity matrix euclidean...

SpeciesDCBASites112911111812

101061310901041082105270106

1

The Multivariate Dataset

Obs Group X-set Y-set

1 A a11 a12 a13 ... a1p b11 b12 b13 ... b1m

2 A a21 a22 a23 ... a2p b21 b22 b23 ... b2m

3 A a31 a32 a33 ... a3p b31 b32 b33 ... b3m

. . . . . ... . . . . ... .

. . . . . ... . . . . ... .n A an1 an2 an3 ... anp bn1 bn2 bn3 ... bnm

n+1 C c11 c12 c13 ... c1p

n+2 C c21 c22 c23 ... c2p

n+3 C c31 c32 c33 ... c3p

. . . . . ... .

. . . . . ... .N C cn1 cn2 cn3 ... cnp

2-way data matrix

2

12

34

56

Sites-by-species2-way data matrix

The Multivariate Dataset2-way data matrix


101061310901041082105270106


101061310901041082105270106

3

P Each site can be represented asa point in a p-dimensionalspace based on its measuredvalues along each of the pspecies axes.

P The collection of points formsa “data cloud” in this p-dimensional space.

P The shape, clumping anddispersion of this data cloudcontains ecological informationwe seek to describe.

Species A

Site 5

Site 4

10

8

9

2

12

Site 1

1

9

Data Space

The Multivariate DatasetGeometric Representation

4

Ecological Resemblance

12

34

56

How ecologically similar(or dissimilar) is eachsite to each other?


101061310901041082105270106

5


P Similarity is a characterizationof the ratio of the number ofattributes two objects share incommon compared to the totallist of attributes between them.

P Objects that have everything incommon are identical, andhave a similarity of 1.0; objectsthat have nothing in commonhave a similarity of 0.0.

P Dissimilarity is thecomplement of similarity,and is a characterization ofthe number of attributes twoobjects have uniquelycompared to the total list ofattributes between them.

P In general, dissimilarity canbe calculated as 1-similarity.

Similarity: Dissimilarity:

Both range from 0-1

6

P Distance is a geometricconception of the proximityof objects in a highdimensional space defined bymeasurements on theattributes.

P How we measure proximity,however, varies amongdistance measures.

Ecological Distance

Species A

Site 5

Site 4

10

8

9

2

12

Site 1

1

9

Data Space

?12

Original Data Matrix (6C4) Dissimilarity Matrix (6C6)

Species Sites

Sites A B C D Sites 1 2 3 4 5 6

1 1 9 12 1 1 0

2 1 8 11 1 2 1.4 0

3 1 6 10 10 3 9.7 9.3 0

4 10 0 9 10 4 15.9 15.2 10.9 0

5 10 2 8 10 5 15.1 14.4 10.0 2.2 0

6 10 0 7 2 6 13.7 12.7 13.8 8.2 8.3 0


1 1 9 12 1 1 0

2 1 8 11 1 2 1.4 0

3 1 6 10 10 3 9.7 9.3 0

4 10 0 9 10 4 15.9 15.2 10.9 0

5 10 2 8 10 5 15.1 14.4 10.0 2.2 0

6 10 0 7 2 6 13.7 12.7 13.8 8.2 8.3 0

SitesSpecies

7

P In practice, distances and dissimilarities areoften used interchangeably. They havequite distinct properties, however.

P Dissimilarities are always bounded [0,1]; e.g.,once plots have no species in commonthey can be no more dissimilar.

P Distances are typically unbounded on theupper end; e.g., plots that have no speciesin common have distances that depend onthe number and abundance of species inthe plots, and is thus variable.

Ecological Distance versus Dissimilarity

8

The Resemblance Transformation

Site 1 dissimilarity

Site 5

Site 6

13.7 15.1

12.7

14.4

10.0

13.8

Site 1

1.4

9.7

DissimilaritySpace

6x4 6x6

P Resemblance matrixcontains a resemblancecoefficient for every pair ofentities. The result is anentities-by-entitiesresemblance matrix.

Original data matrix Dissimilarity matrix

Euclideandistance


1 1 9 12 1 1 0

2 1 8 11 1 2 1.4 0

3 1 6 10 10 3 9.7 9.3 0

4 10 0 9 10 4 15.9 15.2 10.9 0

5 10 2 8 10 5 15.1 14.4 10.0 2.2 0

6 10 0 7 2 6 13.7 12.7 13.8 8.2 8.3 0

SitesSpecies


101061310901041082105270106

ED x xjk ij iki

p

( )2

1

9

P The choice of a coefficient will frequently be guided by thetype of data, the ecological question, or the type of analysis.

P When the measurement scale is such that several possiblecoefficients could be used, the choice is often guided bypersonal preference.

P It is generally advantageous to try several different measuresand weigh the results using ecological criteria (i.e., which oneproduces the most meaningful and interpretable results).


P There is a large numberof resemblancemeasures to choosefrom.

Original data matrix Dissimilarity matrix

Euclideandistance

10

Euclidean Distance

P Intuitively, the most appealingdistance measure.

P Data are usually columnstandardized first to removedifferences due to measurementunits and scale (if necessary).

P Can be applied to data of any scale.

P Has true ‘metric’ properties and isused in eigenvector ordinations.

B

A0

EuclideanDistance


101061310901041082105270106

ED x xjk ij iki

p

( )2

1


101061310901041082105270106

CB x xjk ij iki

p

1

11

Euclidean Distance

P Often poor performance inecological applications due toseveral problems:

P Assumes that variables areuncorrelated (never so).

P Emphasizes outliers.

P Loses sensitivity more rapidlythan other distance measures as heterogeneity of data increases.

P Measures distance throughpotentially uninhabitableecological space.

P Nonproportional distancemeasure.0

12

City-block (Manhattan) Distance

P Most ecologically meaningfuldissimilarity measures are ofManhattan type.

P Compared to ED, gives less weightto outliers (no squared differences).

P Compared to ED, retains sensitivityas the heterogeneity of a data setincreases.

P But still a nonproportional distance.

City-blockdistance


101061310901041082105270106

PDx x

x xjk

ij iki

p

ij iki

p

( )

( )

100 1

1

PDx x

x

x x

xjk

ij iki

p

iji

p

ij iki

p

iki

p

( )

min( , ) min( , )100 1

1

21

1

1

1

PDx x

x xjk

ij iki

p

ij iki

p

100 1

21

1

min( , )

( )

11

2

w

A

w

B

1

w

A B w

12

w

A B

PD

x x

x x x xjk

ij iki

p

iji

p

iki

p

ij iki

p

( )

min ,

100 1 1

1 1 1

13

Proportional Distance Coefficients

E.g., Percentage Dissimilarity

P City-block (Manhattan) distancemeasures expressed as proportionsof the maximum distance possible.

P Whereby if two communities shareno species in common they have amaximum dissimilarity of one.

(Sorensen Distance or Bray-Curtis Distance)

A BW

Environmental gradient

B

A

City-blockdistance

14


Variations on Percentage Dissimilarity

Sorensen Distance orBray-Curtis Distance

Jaccard Distance

Kulczynski Distance

A BW

Environmental gradient


101061310901041082105270106

PDx x

x xjk

ij iki

p

ij iki

p

100 1

21

1

min( , )

( )


101061310901041082105270106

Chordx

x

x

xij

ij

iji

pik

iki

pi

p

2

1

2

1

2

1

15


Percentage Dissimilarity

P PD is commonly used with speciesabundance data, but can beapplied to data of any scale (e.g.,presence/absence data).

P Compared to ED, PD gives lessweight to outliers.

P Compared to ED, PD retainssensitivity as the heterogeneity of adata set increases.

P Unlike ED and CB, PD ismaximum when no shared species.

P But PD is not ‘metric’ and thusnot compatible with manyanalyses (e.g, DA, CCA).

(Sorensen DistanceBray-Curtis Distance)

16

ChordDistance

P Similar conceptually to ED, but thedata are adjusted so that the sums ofsquares for each row are 1 (i.e., rownormalization).

P Useful in species abundance data;effectively removes differences inoverall abundances among samples,instead focusing analysis on thedifferences in relative abundancesamong species (i.e., species profiles).

A*

B*

0 1

1

Euclidean Distance Based on Species Profiles

Chordx

x

x

xij

ij

iji

pik

iki

pi

p

2

1

2

1

2

1

xx

xij

ij

iji

p*

2

1

ED x xjk ij jki

p

( )* * 2

1

xx

xij

ij

iji

p*

2

1

ED x xjk ij jki

p

( )* * 2

1

Chordx

x

x

xij

ij

iji

pik

iki

pi

p

2

1

2

1

2

1

17


18


x xx

x xij

ij

i j

*

xx

xijij

i

*

xx

xijij

i

*


101061310901041082105270106

PDx

x

x

xjk

ij

iji

pik

iki

pi

p

100 1

1 1

1

min ,

19

Chi-square Distance P ED computed on relative abundances(species profiles), weighted by theinverse of the square root of columntotals, adjusted by the square root ofmatrix total (or row chi.squarestandardization).

P ED computed on relative abundances(species profiles or row totalstandardization).

P ED computed on square root ofrelative abundances (species profilesor row hellinger standarization)


Species Profile Distance

Hellinger Distance

All these distance measures work well with species data,preserve the ‘metric’ distance, and are Euclidean.

i = row (site)j = col (species)

20

Proportional Distance Based on Species Profiles

Relative PercentageDissimilarity

P PD can be computed onstandardized data, similar tochord distance, in whichspecies abundances aredivided by sample totals sothat each sample contributesequally regardless of totalabundance.

P This measure shifts theemphasis of the analysis toproportions of species in asample unit, rather thanabsolute abundances.

(Relative SorensenRelative Bray-Curtis)


101061310901041082105270106

rx x x x

x x x xjk

ij j ik ki

p

ij j ik ki

p

i

p

( )( )

( ) ( )

1

2 2

11

21

Pearson’s Product-Moment Correlation P Limited use for community

data, but may be ideal withmultivariate normal data andlinear relationships.

P Intuitively appealing becauseof familiarity with correlationcoefficients.

CDjk = (1-rjk)/2 or |rjk|

Correlation Distance

22

Sample j

Sample k

A B CA D

Sample j standardized value

rjk = 1CDjk = 0

Correlation Distance

P Generally only useful whenthe similarity in average“profile shapes” isconsidered more importantthan average “profile levels”,because correlation distanceis 0 when 2 profiles areparallel, irrespective of howfar apart they are in dataspace.

CDjk = (1-rjk)/2 or |rjk|


101061310901041082105270106

D X X S X Xjk j k x j k2 1

Sites A B C

1 1 9 12

2 1 8 11

3 1 7 10

4 10 0 9

5 10 2 8

6 10 1 7

A B C

A 24.3 -18.9 -8.1

B -18.9 15.5 6.5

C -8.1 6.5 3.5

A B C

A 0.823 0.926 0.185

B 0.926 1.333 -0.333

C 0.185 -0.333 1.333

0 1 1 0.823 0.926 0.185 00.926 1.333 -0.333 10.185 -0.333 1.333 1

1.1 1 1 011

Sites 1 2 3 4 5 61 0.0002 2.000 0.0003 8.000 2.000 0.0004 8.667 6.667 8.667 0.0005 4.667 4.667 8.667 8.000 0.0006 8.667 4.667 4.667 8.000 2.000 0.000

D X X S X Xjk j k x j k2 1

23

Mahalanobis Distance

Xj - Xk = vector of differences betweensites j and k

Sx = Covariance matrix of X

Case A

Case B >MD

P Accounts for the variance-covariancestructure of data; inversely weightsdistance by the variance on each axisand the covariance between axes, sothat distance is greater in case B.

P Reduces to ED for perfectlymultivariate normal distribution.

P Commonly used for multivariateoutlier detection.

24


D2 matrix:Sx =

Sx-1 =

] [[

] ][

.

.

Site 1-2 differences:

Covariance matrix:

2

[]].[

]

][

D12 = D12 = [0,1,1]

D n g w x x x xfh ij if

j

p

i

p

ih jf jh2

11

( ) ( )( )Sites Group A B C D

1 f 1 9 12 1

2 f 1 8 11 1

3 f 1 7 10 10

4 h 10 0 9 10

5 h 10 3 8 10

6 h 10 0 7 1

Species

Sample j

kSample

TotalAbsentPresent

a+bbaPresent

c+ddcAbsent

pb+da+cTotal

CSMC IAa d

pJK

CCJ IAa

a b cJK

CCC IAa

a b cJK 2

2

25


P MD commonly used to measuredistance between groups (e.g., indiscriminant analysis)

P MD inversely weights the distancebetween group centroids by thevariance, so the distance is greaterin case B than case A, eventhough the centroids areequidistant in hyperspace.

Case A

Case B

n = #sites; g = #groups0if = mean of variable i in group fwij = inverse of the pooled within-groups var-

cov matrix of X for variable i and j

26

Association Coefficients

Simple MatchingCoefficient

P Applied to categorical data.

P Measure agreement between2 rows representing 2 sampleentities.

P May be special cases ofdistance measures.

P Most measures for binary(presence/absence) data.

P Different associationcoefficients emphasizedifferent aspects of theagreement between samples.

AssociationTable

Coefficient ofJaccard

Coefficient ofCommunity

27

P Availability - many choices,but not all, in most computerprograms.

P Compatibility - city-blockmeasures not compatible withmany multivariate procedures(DA, CANCOR, CCA).

P Theoretical basis - very little;Euclidean vs city-blockdistances in species space.

P Intuitive criteria - effects ofoutliers; sensitivity withincreasing heterogeneity.

Chosing a Distance Coefficient?

X1

X2

Centroid

%

X2

X1

City-blockdistance

X2

X1

Correlationdistance

Euclideandistance

28

Ecological Distance Blues

PWhich resemblance measure should Iuse?

P Should I standardize my data beforecalculating resemblance?

P If so, which standardization should Iuse, column or row standardizationboth (wisconsin), based on norm,range, total, or max, etc., and how willthis affect my interpretation ofecological distance?

And we are just getting started!

the multivariate dataset 2-way data matrix · original data matrix dissimilarity matrix euclidean...

Documents