ababcdfghiejkl · 2020. 4. 24. · homogeneous data are all alike; all heterogeneous data are...
TRANSCRIPT
![Page 1: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/1.jpg)
Usingdistancestoaddressthechallengesofheterogeneousdata
SusanHolmeshttp://www-stat.stanford.edu/˜susan/
Bio-X andStatistics, StanfordUniversity
July29, 2015
ABabcdfghiejkl. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 2: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/2.jpg)
Themesseswedealwith
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 3: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/3.jpg)
Homogeneous data are all alike;all heterogeneous data are
heterogeneous in their own way.
.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 4: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/4.jpg)
GoalsinModernBiology: SystemsApproachLookatthedata/allthedata: dataintegration
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 5: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/5.jpg)
GoalsinModernBiology: SystemsApproachLookatthedata/allthedata: dataintegration
Tumor Cells
0 5000 10000 15000 20000
05000
10000
15000
20000
05e-0
40.0
010 1 1 0 -1 1 0 0 0 -1
0 1 1 0 0 0 0 0 0 1
0 1 -1 0 -1 0 0 0 0 -1
0 1 1 0 0 -1 1 0 1 1
0 1 1 0 0 0 0 1 0 1( (
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 6: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/6.jpg)
Whatdostatisticiansdo?
▶ Designnewexperimentstotestscientifichypotheses.▶ Visualizeandsummarizedatainwaysthataccountfor
uncertainties.▶ Lookformeaningfuldifferencesorstructureinhigh
dimensionalnoisydata.▶ Predicttheclassofnewobservationsgivenpreviously
observedones.▶ Predictthevalueofaresponsevariablegivenawhole
setofotherexplanatoryvariables.▶ Combinedifferentsourcesofdatatounderstandcomplex
interactions.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 7: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/7.jpg)
Today'schallenge
▶ Dataarenotuniformlydistributedfromsomemanifold.
▶ Dataarenotanidenticallydistributedrandomsample.
▶ Dataarenotindependent.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 8: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/8.jpg)
Datacanoftenbeseenaspointsinastatespace
Rp
x
x1
2x
x
x
x
x
2
.
.
.
.
.
.
p
i
1
3
.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 9: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/9.jpg)
DistancesinStatistics
▶ EuclideanDistances, spatialdistances.▶ WeightedEuclideandistances: Mahalanobisdistancefor
discriminantanalysis.▶ Chisquaredistancesforcontingencytablesanddiscrete
data.▶ Jaccarddistancesforpresenceabsenceisoneof50
distancesusedinEcology.▶ EarthMover'sdistanceontreesorgraphs.▶ Biologicallymeaningfuldistances(DNA,haplotype,
Proteins).
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 10: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/10.jpg)
Whatdostatisticiansusedistancesfor?
▶ SummariesthroughFréchetMeansandMediansandpseudovariances.
▶ CenterofCloudofObjects Tk (equalweights): Find T0
thatminimizeseither∑K
k=1 d2(T0,Tk) thisisthe (L2)definitionoftheFréchetmeanobject,
▶ or∑K
k=1 d(T0,Tk) (L1 orGeometricMedian).▶ Pseudovariance= 1
K−1
∑Kk=1 d2(T0,Tk) = s2. Dimension
reductionandvisualization.NearestNeighborMethods.Clustering.Makenetworkedgesfromclosepoints. Predictionbyminimizingweightedresidualdistances.Cross-products: correlations, autocorrelations.Generalizationsofanalysisofvariance.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 11: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/11.jpg)
Whatdostatisticiansusedistancesfor?
▶ SummariesthroughFréchetMeansandMediansandpseudovariances.
▶ Dimensionreductionandvisualization.▶ NearestNeighborMethods.▶ Clustering.▶ Makenetworkedgesfromclosepoints.▶ Predictionbyminimizingweightedresidualdistances.▶ Cross-products: correlations, autocorrelations.▶ Generalizationsofanalysisofvariance.
Findingtherightdistanceusuallysolvesthestatisticalproblem.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 12: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/12.jpg)
Part I
The Geometries of Data
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 13: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/13.jpg)
Firstexample: cellsegmentationJointworkwithAdamKapelnerandPP Lee.Stainedbiopsyslides. Multispectralimaging(8levels/wavelengths).StainedLymphNode Aimtoidentifycell.
Pointssimilarinfeaturespaceareofthesametype.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 14: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/14.jpg)
Problem: Stainingisheterogeneous
Bothimagesarefromthesameimageset. ThestainedcellsarecancercellsstainedwithFastRedred.Someregionsofthetissuestainliketheimageontheleftandotherregionsstainastheleft.ThisshowsthelevelofheterogeneityThesearetwo``subclasses''ofthesamephenotype(theleftisnamedsubclass``A,''theright, subclass``B'').
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 15: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/15.jpg)
Problem: StainingisheterogeneousExtremevariabilityintheimagecolors/intensity/contrast.Pixelsfromasamecellnotindependentandidenticallydistributedacrossthedifferentslidesoracrossdifferentcelltypes.
Simplenearestneighborapproach:-Take8dimensionalpixelspoints.-Assigningthepointtotheclosestneighbor
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 16: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/16.jpg)
Problem: StainingisheterogeneousExtremevariabilityintheimagecolors/intensity/contrast.Pixelsfromasamecellnotindependentandidenticallydistributedacrossthedifferentslidesoracrossdifferentcelltypes. ?
Simplenearestneighborapproach:-Take8dimensionalpixelspoints.-Assigningthepointtotheclosestneighbor
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 17: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/17.jpg)
0 5 10
−2
02
46
8
Orange
Red
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 18: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/18.jpg)
0 5 10
−2
02
46
8
Orange
Red
●
(3.2,2)
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 19: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/19.jpg)
0 5 10
−2
02
46
8
Orange
Red
●
(3.2,2)
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 20: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/20.jpg)
0 5 10
−2
02
46
8
Orange
Red
●
(3.2,2)
D12(p,m1)=19.7 D2
2(p,m2)=16
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 21: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/21.jpg)
MultivariateNormalData
MahalanobisTransformation.Severaldifferentclusterswithdifferentvariance-covariancematricesanddifferentmeans.(µ1,Σ1) (µ2,Σ2)
D21(x, µ1) = (x− µ1)
TΣ−11 (x− µ1)
D22(x, µ2) = (x− µ2)
TΣ−12 (x− µ2)
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 22: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/22.jpg)
CorrespondingDataTransformation
H = I− 1Dn1T, S = X′HDnHX
zi. = S− 12 (xi. − x)
Thisissometimescalled`datasphering'.
0 5 10
−20
24
68
Sphered O
Speh
ered
Red
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 23: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/23.jpg)
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 24: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/24.jpg)
OutputDataTumor
Tumor Cells
0 5000 10000 15000 20000
050
0010
000
1500
020
000
05e
−04
0.00
1
NumberofTumorcells: 27,822 . .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 25: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/25.jpg)
WecanaddinformationthroughchoiceofdistancesSampledatacanoftenbeseen Variablesare`vectors'aspointsinastatespace. indatapointspaceRp Rn
x
x1
2x
x
x
x
x
2
.
.
.
.
.
.
p
i
1
3
. x
x 1
2x
x
x
x
x
2
.
.
.
. .
.
n
j
1
3
.
x4 .
x 3.
xtQy =< x, y >Q xtDy =< x, y >DDuality: Transposabledata. . .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 26: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/26.jpg)
DataAnalysis: GeometricalApproachi. Thedataare p variablesmeasuredon n observations.ii. X with n rows(theobservations)and p columns(the
variables).iii. D isan n× n matrixofweightsonthe``observations'',
whichismostoftendiagonalbutnotalways.iv Symmetricdefinitepositivematrix Q, weightson
. variables, often Q =
1σ21
0 0 0 ...
0 1σ22
0 0 ...
0 0. . . 0 ...
... ... ... 0 1σ2p
.
x
x1
2x
x
x
x
x
2
.
.
.
.
.
.
p
i
1
3
.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 27: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/27.jpg)
EuclideanSpaceanddimensionreduction
Thesethreematricesformtheessential``triplet" (X,Q,D)definingamultivariatedataanalysis.Q and D definegeometriesorinnerproductsin Rp and Rn,respectively, through
xtQy =< x, y >Q x, y ∈ Rp
xtDy =< x, y >D x, y ∈ Rn.
Thiscanbeextendedtomoreinnerproductsgivingwhatisknownas Kernel methods.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 28: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/28.jpg)
PrincipalComponentAnalysis: DimensionReduction
PCA seekstoreplacetheoriginal(centered)matrix X byamatrixoflowerrank, thiscanbesolvedusingthesingularvaluedecompositionof X:
X = USV′, with U′DU = In and V′QV = Ip and S diagonal
XX′ = US2U′, with U′DU = In and S2 = Λ
PCA isalinearnonparametricmultivariatemethodfordimensionreduction. D and Q aretherelevantmetricsonthedualrowandcolumnspacesof n samplesand p variables.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 29: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/29.jpg)
A CommutativeDiagramApproach
CaillezandPages, 1976. Escoufier, 1977.Statisticianssearchforapproximationswithcertainproperties, forthecaseofPCA forinstance, werephrasetheproblemasfollows:
▶ Q canbeseenasalinearfunctionfrom Rp toRp∗ = L(Rp), thespaceofscalarlinearfunctionson Rp.
▶ D canbeseenasalinearfunctionfrom Rn toRn∗ = L(Rn).
▶
V = XtDX
Rp∗ −−−−→X
Rn
Qx yV D
y xW
Rp ←−−−−Xt
Rn∗
W = XQXt
Thisdualitygives`transposable'data.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 30: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/30.jpg)
PropertiesoftheDiagram
Rankofthediagram:X,Xt,VQ and WD allhavethesamerank.For Q and D symmetricmatrices, VQ and WD arediagonalisableandhavethesameeigenvalues.
λ1 ≥ λ2 ≥ λ3 ≥ . . . ≥ λr ≥ 0 ≥ · · · ≥ 0.
Eigendecompositionofthediagram: VQ is Q symmetric, thuswecanfind Z suchthat
VQZ = ZΛ,ZtQZ = Ip, where Λ = diag(λ1, λ2, . . . , λp). (1)
ModernextensionstothisapproachincludeKernelmethodsinMachineLearning.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 31: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/31.jpg)
PredictingandSummarizingthroughdistances
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 32: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/32.jpg)
ComparingTwoDiagrams: theRV coefficientManyproblemscanberephrasedintermsofcomparisonoftwo``dualitydiagrams"orputmoresimply, twocharacterizingoperators, builtfromtwo``triplets", usuallywithoneofthetripletsbeingaresponseorhavingconstraintsimposedonit.Mostoftenwhatisdoneistocomparetwosuchdiagrams,andtrytogetonetomatchtheotherinsomeoptimalway.(O = WD)Tocomparetwosymmetricoperators, thereiseitheravectorcovarianceasinnerproductcovV(O1,O2) = Tr(Ot
1O2) =< O1,O2 > oravectorcorrelation(Escoufier, 1977)
RV(O1,O2) =Tr(Ot
1O2)√Tr(Ot
1O1)tr(Ot2O2)
.
Ifweweretocomparethetwotriplets(Xn×1, 1,
1nIn
)and(
Yn×1, 1,1nIn
)wewouldhave RV = ρ2.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 33: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/33.jpg)
Part II
Dimension Reduction: theEuclidean embedding workhorse:
MDS
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 34: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/34.jpg)
MetricMultidimensionalScalingSchoenberg(1935)
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 35: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/35.jpg)
FromCoordinatestoDistancesandBack
Ifwestartedwithoriginaldatain Rp thatarenotcentered:Y, applythecenteringmatrix
X = HY, with H = (I− 1
n11′), and 1′ = (1, 1, 1 . . . , 1)
Call B = XX′, if D(2) isthematrixofsquareddistancesbetweenrowsofX intheeuclideancoordinates, wecanshowthat
−1
2HD(2)H = B
Schoenberg'sresult: exactEuclideandistance If B ispositivesemi-definitethen D canbeseenasadistancebetweenpointsinaEuclideanspace.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 36: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/36.jpg)
ReverseengineeringanEuclideanembedding
Wecangobackwardsfromamatrix D to X bytakingtheeigendecompositionof B = −1
2HD(2)H inmuchthesameway
thatPCA providesthebestrank r approximationfordatabytakingthesingularvaluedecompositionof X, ortheeigendecompositionof XX′.
X(r) = US(r)V′ with S(r) =
s1 0 0 0 ...0 s2 0 0 ...0 0 ... ... ...0 0 ... sr ...... ... ... 0 0
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 37: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/37.jpg)
MultidimensionalScaling(MDS)
Simpleclassicalmultidimensionalscaling.▶ SquareD elementwise D(2) = D2.▶ Compute −1
2 HD2H = B.▶ Diagonalize B tofindtheprincipalcoordinates SV′.▶ Chooseanumberofdimensionsbyinspectingthe
eigenvalue'sscreeplot.Theadvantageisthattheoriginaldistancesdon'thavetobeEuclidean.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 38: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/38.jpg)
TakingCategoricalDataandMakingitintoaContinuum
HorseshoeExample:JointwithPersiDiaconisandSharadGoel(AnnalsofAppliedStats, 2005). Datafrom2005U.S.HouseofRepresentativesrollcallvotes. Wefurtherrestrictedouranalysistothe401Representativesthatvotedonatleast90% oftherollcalls(220Republicans, 180Democratsand1Independent)leadingtoa 401× 669 matrixofvotingdata.
TheDataV1 V2 V3 V4 V5 V6 V7 V8 V9 V10
R1 -1 -1 1 -1 0 1 1 1 1 1 ...
R2 -1 -1 1 -1 0 1 1 1 1 1 ...
R3 1 1 -1 1 -1 1 1 -1 -1 -1 ...
R4 1 1 -1 1 -1 1 1 -1 -1 -1 ...
R5 1 1 -1 1 -1 1 1 -1 -1 -1 ...
R6 -1 -1 1 -1 0 1 1 1 1 1 ...
R7 -1 -1 1 -1 -1 1 1 1 1 1 ...
R8 -1 -1 1 -1 0 1 1 1 1 1 ...
R9 1 1 -1 1 -1 1 1 -1 -1 -1 ...
R10 -1 -1 1 -1 0 1 1 0 0 0 .... .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 39: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/39.jpg)
L1 distance
Wedefineadistancebetweenlegislatorsas
d(li, lj) =1
669
669∑k=1
|vik − vjk|.
Roughly, d(li, lj) isthepercentageofrollcallsonwhichlegislators li and lj disagreed.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 40: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/40.jpg)
!0.1!0.05
00.05
0.1!0.2
!0.1
0
0.1
0.2
!0.2
!0.15
!0.1
!0.05
0
0.05
0.1
0.15
3-DimensionalMDS mappingoflegislatorsbasedonthe2005U.S.HouseofRepresentativesrollcallvotes. Weused
dissimilarityindices1-exp(−λd(R1,R2))
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 41: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/41.jpg)
!0.1!0.05
00.05
0.1!0.2
!0.1
0
0.1
0.2
!0.2
!0.15
!0.1
!0.05
0
0.05
0.1
0.15
3-DimensionalMDS mappingoflegislatorsbasedonthe2005U.S.HouseofRepresentativesrollcallvotes. Colorhasbeenaddedtoindicatethepartyaffiliationofeachrepresentative.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 42: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/42.jpg)
0 50 100 150 200 250 300 350 4000
10
20
30
40
50
60
70
80
90
100
MDS Rank
Natio
nal J
ourn
al S
core
ComparisonoftheMDS derivedrankforRepresentativeswiththeNationalJournal'sliberalscore
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 43: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/43.jpg)
AnApplication: VisualizingGeodesicDistancesbetweenTrees
▶ NearestNeighborInterchange(NNI). RotationMoves
4
0
2 31
0
4321
0
41 32
▶ Fill-inofNNI moves: Billera, Holmes, Vogtmann(2001)(BHV).Theboundariesbetweenregionsrepresentanareaofuncertaintyabouttheexactbranchingorder. Inbiologicalterminologythisiscalledan`unresolved'tree.Moredetailshere
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 44: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/44.jpg)
AnApplication: VisualizingGeodesicDistancesbetweenTrees
▶ NearestNeighborInterchange(NNI). RotationMoves
4
0
2 31
0
4321
0
41 32
▶ Fill-inofNNI moves: Billera, Holmes, Vogtmann(2001)(BHV).Theboundariesbetweenregionsrepresentanareaofuncertaintyabouttheexactbranchingorder. Inbiologicalterminologythisiscalledan`unresolved'tree.Moredetailshere
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 45: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/45.jpg)
AnApplication: VisualizingGeodesicDistancesbetweenTrees
▶ NearestNeighborInterchange(NNI). RotationMoves
4
0
2 31
0
4321
0
41 32
▶ Fill-inofNNI moves: Billera, Holmes, Vogtmann(2001)(BHV).Theboundariesbetweenregionsrepresentanareaofuncertaintyabouttheexactbranchingorder. Inbiologicalterminologythisiscalledan`unresolved'tree.Moredetailshere
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 46: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/46.jpg)
AnApplication: VisualizingGeodesicDistancesbetweenTrees
▶ NearestNeighborInterchange(NNI). RotationMoves
4
0
2 31
0
4321
0
41 32
▶ Fill-inofNNI moves: Billera, Holmes, Vogtmann(2001)(BHV).Theboundariesbetweenregionsrepresentanareaofuncertaintyabouttheexactbranchingorder. Inbiologicalterminologythisiscalledan`unresolved'tree.Moredetailshere
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 47: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/47.jpg)
A ConePath
A pathbetweentwotrees T and T′ alwaysexists. Sinceallorthantsconnectattheorigin, anytwotrees T and T′ canbeconnectedbyatwo-segmentpath, thisiscalledthecone-path.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 48: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/48.jpg)
c
a
c
b ba
Theorem(Billera, Holmes, Vogtmann(BHV,2001)):TreespacewithBHV metricisaCAT(0)space, thatis, ithasnon-positivecurvature.Thisimpliestherearegeodesicbetweenanytwotrees(Gromov).Note: ThisspaceoftreesisnotanEuclideanspace.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 49: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/49.jpg)
c
a
c
b baThesizeofthe``pseudo-variance''canbeestimatedfrom∑
pid(T0,Ti)2.PropertiesoftheFréchetmeanofasetoftreeshasbeen(Bhattacharyaetal.2010, Miller, Mattingley, Owen, Marron, al.2013).
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 50: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/50.jpg)
PhylogeneticTreesMalariaDataasseenusing ape
Pre1
Pme2
Plo6
Pga11
Pma3
Pbe5
Pfr7
Pkn8
Pcy9
Pvi10
Pfa4
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 51: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/51.jpg)
SamplingDistributionforTrees
Data 1
23
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 52: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/52.jpg)
Data 1
23
Treespace Tn
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 53: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/53.jpg)
Data1
23
4
True Sampling Distribution
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 54: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/54.jpg)
Data1
23
4
Bootstrap Sampling Distribution (non parametric)
n
^
*
**
*
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 55: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/55.jpg)
BootstrapofMalariaData
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 56: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/56.jpg)
HierarchicalClusteringTrees
HEA2
5_EF
FE_3
MEL
39_E
FFE_
2HE
A31_
EFFE
_2M
EL67
_EFF
E_4
HEA5
5_EF
FE_4
HEA5
9_EF
FE_5
HEA2
6_EF
FE_1
MEL
51_E
FFE_
5M
EL36
_EFF
E_1
MEL
53_E
FFE_
3HE
A31_
NAI_
2HE
A55_
NAI_
4M
EL67
_NAI
_4M
EL53
_NAI
_3HE
A25_
NAI_
3M
EL51
_NAI
_5HE
A59_
NAI_
5HE
A26_
NAI_
1M
EL36
_NAI
_1M
EL39
_NAI
_2M
EL51
_MEM
_5HE
A26_
MEM
_1M
EL67
_MEM
_4HE
A31_
MEM
_2HE
A55_
MEM
_4HE
A25_
MEM
_3HE
A59_
MEM
_5M
EL53
_NAI
_3M
EL36
_MEM
_1M
EL39
_MEM
_2
Human AF5q31 protein (AF5q31) intracellular hyaluronan−bindiselectin L (lymphocyte adhesioHuman cDNA FLJ10470 fis, cloneHuman mRNA for KIAA0303 gene, KIAA0303 proteinSTAT induced STAT inhibitor 3KIAA0752 proteinGRB2−related adaptor proteindelta (Drosophila)−like 1stanninproteoglycan link proteinIncyte ESTamyloid beta (A4) precursor prHuman genomic DNA, chromosome follicular lymphoma variant trHuman, clone IMAGE:3875338, mRHuman, Similar to phosphodiestHuman sodium/myo−inositol cotrESTs, Weakly similar to MUC2_HHuman CpG island DNA genomic MPOU domain, class 2, transcripHuman zinc finger protein ZNF2Human 54 kDa progesterone receprotein tyrosine phosphatase, platelet/endothelial cell adheHuman cDNA FLJ20849 fis, cloneHuman mRNA for KIAA0972 proteiKIAA0290 proteinHuman clone 295, 5cM region sueukaryotic translation initiatferritin, heavy polypeptide 1Human cDNA: FLJ22008 fis, clonHuman insulin−like growth factgranzyme K (serine protease, gHuman Epstein−Barr virus inducPAS−serine/threonine kinaselymphotoxin beta (TNF superfamHuman mRNA for nel−related prochemokine (C−C motif) receptorPAS−serine/threonine kinaseHuman mRNA for alpha−actinin, Human mRNA encoding the c−myc Human RATS1 mRNA, complete cdshyaluronoglucosaminidase 2Human DNA for muscle nicotinicHuman epithelial V−like antigeinterferon gamma receptor 2 (iHomo sapiens clone 24775 mRNA syntaphilinHuman mRNA for endosialin protHuman zinc finger protein PLAGshort−chain dehydrogenase/reduHuman, short−chain dehydrogena
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 57: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/57.jpg)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0 20000 40000 60000 80000 120000
Eigenvalues of MDS for bootstrapped trees
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 58: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/58.jpg)
−40 −20 0 20 40
−60
−40
−20
020
40
Bootstrapped trees
o1
2
3
4
5
6
7
8
9
1011
12
1314
15
1617
18
19
20
21
22
23
24
25 26
2728
29
30
31
32
33
34
35
36
373839
40
41
42
43
4445
46
47 4849
50
51
52
53
54
55
5657
5859
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
7576
77
78
7980
81
8283
84
85
86
8788
89
90
91
92
93
94
95
9697
98
99
100o
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 59: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/59.jpg)
Part III
Combine and Compare Trees,Graphs and Contingent Count
Data
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 60: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/60.jpg)
LayersofDatainthe MicrobiomeJoshuaLederberg:`theecologicalcommunityofcommensal,symbiotic, andpathogenicmicroorganismsthatliterallyshareourbodyspaceandhavebeenallbutignoredasdeterminantsofhealthanddisease'Microbiome Completecollectionofgenescontainedinthe
genomesofmicrobeslivinginagivenenvironment.
Numbers Humansshelter100trillionmicrobes(1014), (wearemadeof10 ×1012 cells).
Metagenome Compositionofallgenespresentinanenvironment(soil, gut, seawater), regardlessofspecies.
Transciptome ThesearethemRNA transcriptsinthecell, itreflectsthegenesthatarebeingactivelyexpressedatanygiventime.
Metabolome Themetabolites(smallmolecules)nucleicorfattyacids, sugars,... presentinthesampleeitherendogenousorexogenous(medication, pollution).
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 61: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/61.jpg)
.
Source: YK LeeandSK MazmanianScience, 2010.. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 62: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/62.jpg)
Bacteriaetc... andUs
Thehumanmicrobiomeorhumanmicrobiotaistheassemblageofmicroorganismsthatresideonthesurfaceandindeeplayersofskin, inthesalivaandoralmucosa, intheconjunctiva, andinthegastrointestinaltracts.
▶ Theyincludebacteria, fungi, andarchaea.▶ Someoftheseorganismsperformtasksthatareuseful
forthehumanhost. (liveinsymbiosis)▶ Majorityhavenoknownbeneficialorharmfuleffect.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 63: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/63.jpg)
HumanMicrobiome: Whatarethedata?
DNA TheGenomicmaterialpresent(16sRNA-geneespecially, butalsoshotgun).
RNA Whatgenesarebeingturnedon(geneexpression), transcriptomics.
MassSpec Specificsignaturesofchemicalcompoundspresent(LC/MS,GC/MS).
Clinical Multivariateinformationaboutpatients'clinicalstatus, medication, weight.
Environmental Location, nutrition, drugs, chemicals,temperature, time.
DomainKnowledge Metabolicnetworks, phylogenetictrees,geneontologies.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 64: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/64.jpg)
HeterogeneousDataObjects
Objectorientedinputanddatamanipulationwith phyloseq
(McMurdieandHolmes, 2013, PlosONE)ObjectorienteddatainR:
Taxonomy Table taxonomyTableslots: .Data
OTU Abundanceclass: otuTableslots: .Data, speciesAreRows
Sample VariablessampleDataslots: .Data,names,row.names,.S3Class
Phylogenetic Treeclass: phyloslots: see ape package
matrix matrixdata.frame
phyloseqslots:otuTablesampleDatataxTabtre
Experiment-level data object:
Component data objects:
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 65: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/65.jpg)
Part IV
Heteroscedasticity: Mixturesand to Normalize them
Source: xkcd.. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 66: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/66.jpg)
Pointsaremeasuredwithunequalvariance
x
x1
2x
x
x
x
x
2
.
.
.
.
.
.
p
i
1
3
.
xn ..
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 67: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/67.jpg)
Some real data (Caporoso et al, 2011)
> GlobalPatterns
phyloseq-class experiment-level object
otu_table() OTU Table: [ 19216 taxa and 26 samples ]
sample_data()Sample Data: [ 26 samples by 7 sample variables ]
tax_table()Taxonomy Table: [ 19216 taxa by 7 taxonomic ranks ]
phy_tree() Phylogenetic Tree:[ 19216 tips and 19215 internal nodes ]
> sample_sums(GlobalPatterns)
CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr
864077 1135457 697509 1543451 2076476 718943 433894 186297
.....
NP3 NP5 TRRsed1 TRRsed2 TRRsed3 TS28 TS29 Even1
1478965 1652754 58688 493126 279704 937466 1211071 1216137
> summary(sample_sums(GlobalPatterns))
Min. 1st Qu. Median Mean 3rd Qu. Max.
58690 567100 1107000 1085000 1527000 2357000
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 68: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/68.jpg)
Pointsaremeasuredwithunequalvariance
x
x1
2x
x
x
x
x
2
.
.
.
.
.
.
p
i
1
3
.
xn ..
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 69: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/69.jpg)
Equalizationofvariances
Inthisbinomialexamplethevarianceoftheproportionestimateis Var(Xn ) =
pqn =
qnE(
Xn ), afunctionofthemean.
Thisisacommonoccurrenceandonethatistraditionallydealtwithinstatisticsbyapplyingvariance-stabilizingtransformations.However, inordertofindtherighttransformation, weneedagoodmodelfortheerror.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 70: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/70.jpg)
VarianceStabilization
Prefertodealwitherrorsacrosssampleswhichareindependentandidenticallydistributed.Inparticularhomoscedasticity(equalvariances)acrossallthenoiselevels.Thisisnotthecasewhenwehaveunequalsamplesizesandvariationsintheaccuracyacrossinstruments.A standardwayofdealingwithheteroscedasticnoiseistotrytodecomposethesourcesofheterogeneityandapplytransformationsthatmakethenoisevariancealmostconstant.Thesearecalled variancestabilizingtransformations.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 71: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/71.jpg)
MixtureModelingworksMiracles
▶ Beta-Binomial(deepSNV).▶ ZeroinflatedPoissonorGaussian.▶ Gamma-Poisson.
MixturesareubiquitousbecauseofamathematicaltheoremDeFinnetti'sTheorem
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 72: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/72.jpg)
WolfgangHuber. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 73: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/73.jpg)
Correcttransformationsareavailable
McMurdieandHolmes(2014)``WasteNot, WantNot: Whyrarefyingmicrobiomedataisinadmissible'', PLOSComputationalBiology, Methods.WeproposetomodelthereadcountsIftechnicalreplicateshavesamenumberofreads: sj,Poissonvariationwithmean µ = sjui.Taxa i incidenceproportion ui.Numberofreadsforthesample j andtaxa i wouldbe
Kij ∼ Poisson (sjui)
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 74: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/74.jpg)
A distanceontheknowntreeMonge-Kantorovichearthmover'sdistanceonthetree.Usedtocomparetwosamplesorbodysitesforinstance.Incorporatetaxaabundancesandphylogenetictree
Epulopiscium
Clostridium
Adlercreutzia
Lachnospira
Alistipes
Roseburia
Coprococcus
Clostridium
Blautia
Coprococcus
Dehalobacterium
Clostridium
Clostridium
Clostridium
Coprobacillus
Coprococcus
Clostridium
Clostridium
Moryella
Abundance
1
25
625
Class
Actinobacteria (class)
Bacilli
Bacteroidia
Clostridia
Erysipelotrichi
Gammaproteobacteria
Mollicutes
Verrucomicrobiae
YS2
Dualitydiagrammethodsthatcanuseanydependencystructure.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 75: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/75.jpg)
UnifracDistance(LozuponeandKnight, 2005)
isadistancebetweengroupsoforganismsthatarerelatedtoeachotherbyatree.SupposewehavetheOTUspresentinsample1(blue)andinsample2(red).Question: Dothetwosamplesdifferphylogenetically?ItisdefinedastheratioofthesumofthelengthsofthebranchesleadingtomembersofgroupA ormembersofgroupB butnotbothtothetotalbranchlengthofthetree.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 76: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/76.jpg)
WeightedUnifracdistance A modificationofUniFrac,weightedUniFracisdefinedin(Lozuponeetal., 2007)as
n∑i=1
bi × |AiAT− Bi
BT|
▶ n = numberofbranchesinthetree
▶ bi =lengthoftheithbranch
▶ Ai =numberofdescendantsofithbranchingroupA
▶ AT =totalnumberofsequencesingroupA
[7].[6]. . .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 77: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/77.jpg)
Costelloetal. 2010
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 78: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/78.jpg)
Rao'sDistance
Westartwithadistancebetweenindividuals.Theheterogeneityofapopulation(Hi )istheaveragedistancebetweenmembersofthatpopulation.Theheterogeneitybetweentwopopulations(Hij)istheaveragedistancebetweenamemberofpopulation i andamemberofpopulation j.Thedistancebetweentwopopulationsis
Dij = Hij −1
2(Hi + Hj)
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 79: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/79.jpg)
DecompositionofDiversity
Ifwehavepopulations 1, . . . , k withfrequencies π1, . . . , πk,thenthediversityofallthepopulationstogetheris
H0 =
k∑i=1
πiHi +∑i
∑j
πiπjDij = H(w) + D(b)
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 80: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/80.jpg)
DoublePrincipalCoordinateAnalysisPavoine, DufourandChessel(2004), Purdom(2010)andFukuyamaetal. (2011). .Supposewehavenspeciesinplocationsanda(euclidean)matrix ∆ givingthesquaresofthepairwisedistancesbetweenthespecies. Thenwecan
▶ Usethedistancesbetweenspeciestofindanembeddingin n− 1 -dimensionalspacesuchthattheeuclideandistancesbetweenthespeciesisthesameasthedistancesbetweenthespeciesdefinedin ∆.
▶ Placeeachoftheplocationsatthebarycenterofitsspeciesprofile. TheeuclideandistancesbetweenthelocationswillbethesameasthesquarerootoftheRaodissimilaritybetweenthem.
▶ UsePCA tofindalower-dimensionalrepresentationofthelocations.
Givethespeciesandcommunitiescoordinatessuchthattheinertiadecomposesthesamewaythediversitydoes.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 81: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/81.jpg)
FukuyamaandHolmes, 2012.Method Originaldescription Newformula PropertiesDPCoA square root of Rao's distance
basedonthesquarerootofthepatristicdistances
[∑
i bi(Ai/AT − Bi/BT)2]1/2 Mostsensitivetooutliers, leastsensitive to noise, upweightsdeep differences, gives OTUlocations
wUniFrac∑
i bi |Ai/AT − Bi/BT|∑
i bi |Ai/AT − Bi/BT| Less sensitive to outliers/moresensitivetonoisethanDPCoA
UniFrac fractionofbranchesleadingtoexactlyonegroup
∑i bi1{
Ai/AT−Bi/BTAi/AT+Bi/BT
≥ 1} Sensitive to noise, upweightsshallowdifferencesonthetree
Summaryofthemethodsunderconsideration. ``Outliers"referstohighlyabundantOTUs, andnoisereferstonoiseindetectinglow-abundanceOTUs(seethetextformoredetail).
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 82: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/82.jpg)
AntibioticTimeCourseData
Measurementsofabout2500differentbacterialOTUsfromstoolsamplesofthreepatients(D,E,F)Eachpatientsampled ∼ 50timesduringthecourseoftreatmentwithciprofloxacin(anantibiotic).TimescategorizedasPreCp, 1stCp, 1stWPC (weekpostcipro), Interim, 2ndCp, 2ndWPC,andPostCp.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 83: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/83.jpg)
UniFrac
Axis 1: 14.7%
Axi
s 2:
10.
3%
−0.2
−0.1
0.0
0.1
0.2
●
●
●●●
●
●
●●
●●
●
●
●●
●●●
● ●
●●
●●●●●
●●●●
●●
●●
●
●
●
●●
●●
●●
●
●
●
●● ●●
●
●●●●
−0.2 −0.1 0.0 0.1 0.2 0.3 0.4
weighted UniFrac
Axis 1: 47.6%A
xis
2: 1
2.3%
−0.2
−0.1
0.0
0.1
0.2
●
●
●●● ● ●
●●
●●
●●
●●●
●
●
● ●
●
● ●●
●
●
●●
●● ●●●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
● ●●
●
●
●
−0.4−0.3−0.2−0.1 0.0 0.1 0.2 0.3
weighted UF on presence/absence
Axis 1: 32.7%
Axi
s 2:
15.
1%
−0.05
0.00
0.05
0.10
●
●
●●
●●
●
●
●
●
●
●
●
● ●●
●●
●
●
●
●●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
−0.10−0.050.000.050.100.150.20
subject
● D
E
F
ComparingtheUniFracvariants. Fromlefttoright:PCoA/MDS withunweightedUniFrac, withweightedUniFrac,andwithweightedUniFracperformedonpresence/absencedataextractedfromtheabundancedatausedintheothertwoplots
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 84: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/84.jpg)
(a) MDS of OTUs
Axis 1: 6.2%
Axi
s 2:
3.7
%
−1.5
−1.0
−0.5
0.0
0.5
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●
−1.0 −0.5 0.0 0.5
(c) DPCoA OTU plot
CS1
CS
2
−1.0
−0.5
0.0
0.5
1.0
●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●
●
●
●●
●
●●●●●
●●●●●●●●●●●●●●●
●●●●●
●●●●●●●●●●●●●●
●●●●●
●
●
●
●●
●
●●
●
●
●
●●●●●●●●
●
●●●●●●●●●
●
●●
●
●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●
●
●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●
−1.5 −1.0 −0.5 0.0 0.5 1.0
phylum
● 4C0d−2
● Actinobacteria
● Bacteroidetes
● Candidate division TM7
● Cyanobacteria
● Firmicutes
● Fusobacteria
● Lentisphaerae
● Proteobacteria
● Synergistetes
● Verrucomicrobia
(b) DPCoA community plot
Axis 1: 40.9%
Axi
s 2:
13.
3%−0.8
−0.6
−0.4
−0.2
0.0
0.2
0.4
●
●●●
●
● ●
●
●
●
● ●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
● ●
●●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−1.0 −0.5 0.0 0.5
subject
● D
E
F
(a)PCoA/MDS oftheOTUsbasedonthepatristicdistance, (b)communityand(c)speciespointsforDPCoA afterremovingtwooutlyingspecies.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 85: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/85.jpg)
AntibioticStress
Wenextwanttovisualizetheeffectoftheantibiotic.OrdinationsofthecommunitiesduetoDPCoA andUniFracwithinformationaboutthewhetherthecommunitywasstressedornotstressed(precipro, interim, andpostciprowereconsidered``notstressed'', whilefirstcipro, firstweekpostcipro, secondcipro, andsecondweekpostciprowereconsidered``stressed'').WeseethatforUniFrac, thefirstaxisseemstoseparatethestressedcommunitiesfromthenotstressedcommunities.DPCoA alsoseemstoseparatetheoutthestressedcommunitiesalongthefirstaxis(inthedirectionassociatedwithBacteroidetes), althoughonlyforsubjectsD andE.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 86: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/86.jpg)
Axis1
Axi
s2
−0.2
−0.1
0.0
0.1
0.2
●
●
● ●
●
●
● ●
●●
●●●●
●●●
● ●
●
●
●
●
●
● ●●
●
●●●
●
●
●
●
●●●
●●
●
●
●
●●●
●
●●
●
●
●
●● ●●
● ●●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●●●
●
●●
●●
●
●
●
●
●
●●●
●
●
●
●
D 1
D 2
E 1 E 2
F 1F 2
−0.2 −0.1 0.0 0.1 0.2 0.3 0.4
Antibiotic stress
● 1: not stressed
2: stressed
Subject
● D
● E
● F
PCoA/MDS withunweightedUniFrac. Thelabelsrepresentsubjectplusantibioticcondition.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 87: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/87.jpg)
Axis1
Axi
s2
−0.8
−0.6
−0.4
−0.2
0.0
0.2
0.4
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●●
●
●
●
●
●
●● ●
●
●●●
●
●
●●
● ●
●
●
●●
●
●
● ●●
●
●●
●
●●
●●
●●
● ●
●
●
●
●●
●●●
●●
●●
●
●
●●D 1D 2
E 1
E 2
F 1F 2
−1.0 −0.5 0.0 0.5
CommunitypointsasrepresentedbyDPCoA.Thelabelsrepresentsubjectplusantibioticcondition.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 88: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/88.jpg)
ConclusionsforAntibioticStress
SinceUniFracemphasizesshallowdifferencesonthetreeandsincePCoA/MDS withUniFracseemstoseparatethesubjectsfromeachotherbetterthantheothertwomethods, wecanconcludethatthedifferencesbetweensubjectsaremainlyshallowones. However, DPCoA alsoseparatesthesubjectsandthestressedversusnon-stressedcommunities, andexaminingthecommunityandOTU ordinationscantellusaboutthedifferencesinthecompositionsofthesecommunities.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 89: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/89.jpg)
Distancesenablestatisticiansto....
▶ Summarizedatawithmedians, meansandprincipaldirections.
▶ Encodesomevariationsinuncertainty.▶ Makecomparisonsofheterogeneoussourcesof
information.▶ Integratenetworkandtreeinformation.▶ Measurediversity, inertiaandgeneralizethenotionof
variance.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 90: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/90.jpg)
Questionsformathematicians?▶ Howtomakeamethoddesignedforuniformlydistributed
pointsworkforpointsgeneratedbymixturesofheterogeneousdistributions?ExamplesfromworkbyEdelsbrunner, Carlsson,Zoromodianandco-authors.
Source:Zoromodian.. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 91: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/91.jpg)
Questionsformathematicians▶ Howtobuilddistancesbetweenimagesthataccountfor
unequalmeasurementerrors, evenlocally?
x
x1
2x
x
x
x
x
2
.
.
.
.
.
.
p
i
1
3
.
xn ..
WorkbyAdler, TaylorandWorsley(2003,2005,2007)usingRandomFields.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 92: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/92.jpg)
Questionsformathematicians
▶ HowwellcantheEuclideanembeddingapproximationsdo?
▶ Aretherebetterwaysofapproximatingthecommutativediagrams?ThisisalsoanimportantpointofcontactwiththeuseofStein'smethodinprobabilitytheory.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 93: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/93.jpg)
Questionsformathematicians
▶ HowwellcantheEuclideanembeddingapproximationsdo?
▶ Aretherebetterwaysofapproximatingthecommutativediagrams?ThisisalsoanimportantpointofcontactwiththeuseofStein'smethodinprobabilitytheory.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 94: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/94.jpg)
Questionsformathematicians
▶ Howtodistinguishbetweentheeffectofthecurvatureofastatespaceandtheeffectoftheunequalsampling?
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 95: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/95.jpg)
AnswerscomefromDifferentialGeometry.
XavierPennec, YannOllivier, TomFletcher, RabiBhattacharya.Inparticularenableustoincorporatetherelevantdatadependenttransformationsintolocalizedmetrics.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 96: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/96.jpg)
Outputshowingposterioruncertaintymeasures
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 97: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/97.jpg)
BenefittingfromthetoolsandschoolsofStatisticians.......
Thankstothe R community:▶ RStudiofortoolsforreproducibleresearchandHadley
Wickhamforggplot2.▶ Ecologistsandbiologists: Chessel, Jombart, Dray,
Thioulouse ade4 andEmmanuelParadis ape.
Collaborators: DavidRelman, AlfredSpormann, YvesEscoufier,LesDethfelsen, JustinSonnenburg, PersiDiaconis, SergioBaccallado, ElisabethPurdom.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 98: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/98.jpg)
LabGroup
PostdoctoralFellowsPaul(Joey)McMurdie, BenCallahan, SimonRubinstein-Salzado, ChristofSeiler.Students: JohnChakerian, JuliaFukuyama, KrisSankaran.Fundingfrom NIH/NIGMS R01, NSF-VIGRE andNSF-DMS.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 99: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/99.jpg)
ReferencesL. Billera, S. Holmes, andK. Vogtmann.Thegeometryoftreespace.Adv.Appl.Maths, 771--801, 2001.
J. ChakerianandS. Holmes.distory:Distancesbetweentrees, 2010.
DanielChessel, AnneDufour, andJeanThioulouse.Theade4package-i: One-tablemethods.R News, 4(1):5--10, 2004.
P. Diaconis, S. Goel, andS. Holmes.Horseshoesinmultidimensionalscalingandkernelmethods.AnnalsofAppliedStatistics, 2007.
Y. Escoufier.Operatorsrelatedtoadatamatrix.InJ.R.et al.Barra, editor, RecentdevelopmentsinStatistics., pages125--131.NorthHolland,, 1977.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 100: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/100.jpg)
Steven N EvansandFrederick A Matsen.ThephylogeneticKantorovich-Rubinsteinmetricforenvironmentalsequencesamples.arXiv, q-bio.PE,Jan2010.
M Hamady, C Lozupone, andR Knight.Fastunifrac: facilitatinghigh-throughputphylogeneticanalysesofmicrobialcommunitiesincludinganalysisofpyrosequencingandphylochipdata.TheISME Journal, Jan2009.
SusanHolmes.Multivariateanalysis: TheFrenchway.InD. NolanandT. P.Speed, editors, ProbabilityandStatistics: EssaysinHonorofDavidA.Freedman,volume 56ofIMS LectureNotes--MonographSeries.IMS,Beachwood, OH,2006.
RossIhakaandRobertGentleman.R:A languagefordataanalysisandgraphics.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 101: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/101.jpg)
JournalofComputationalandGraphicalStatistics,5(3):299--314, 1996.
K. Mardia, J. Kent, andJ. Bibby.MultiariateAnalysis.AcademicPress, NY., 1979.
P. J.McMurdieandS. Holmes.Phyloseq: Reproduibleresearchplatformforbacterialcensusdata.PlosONE,2013.April22,.
SerbanNacu, RebeccaCritchley-Thorne, PeterLee, andSusanHolmes.Geneexpressionnetworkanalysisandapplicationstoimmunology.Bioinformatics, 23(7):850--8, Apr2007.
SandrinePavoine, Anne-BéatriceDufour, andDanielChessel.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 102: ABabcdfghiejkl · 2020. 4. 24. · Homogeneous data are all alike; all heterogeneous data are heterogeneous in their own way](https://reader034.vdocuments.net/reader034/viewer/2022051806/5ffae1e595c98c368b217ae0/html5/thumbnails/102.jpg)
Fromdissimilaritiesamongspeciestodissimilaritiesamongcommunities: adoubleprincipalcoordinateanalysis.JournalofTheoreticalBiology, 228(4):523--537, 2004.
ElizabethPurdom.Analysisofadatamatrixandagraph: Metagenomicdataandthephylogenetictree.AnnalsofAppliedStatistics, Jul2010.
C. R.Rao.Theuseandinterpretationofprincipalcomponentanalysisinappliedresearch.SankhyaA,26:329--359., 1964.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .