development of otu analysis in nutrigen
TRANSCRIPT
Development of OTU Analysis in NutriGenIntegrating OTU data with other NutriGen Data
Mateen Shaikh and Joseph Beyene
McMaster University
December 19 2014
Mateen and Joseph (McMaster) Development of OTU Analysis in NutriGen December 19 2014 1 / 18
TOC
BackgroundThe Data’s ContextInvestigations
Differential Abundance TestsPermutation
Various Linear ModelsCandidatesExemplifying ResultsStatistical issues
Next Steps. . . In this framework
Mateen and Joseph (McMaster) Development of OTU Analysis in NutriGen December 19 2014 2 / 18
Background The Data’s Context
I ≈ 250 infants contributed microbiome samples from CHILD(processed)
I ≈ 180 infants contributed microbiome samples from START(processing)
I Methods developed from the START samples
I Continuing from the work already complete by Mike Surette’s Lab (JSand MS)
I Picking up at the OTU table
Mateen and Joseph (McMaster) Development of OTU Analysis in NutriGen December 19 2014 3 / 18
Background Investigations
Goals
Determine relationships between the microbiome and
I Changes in breastfeeding
I Mother’s GDM
I Diet
I Other health outcomes (adiposity, asthma, etc.)
I Introduction of (types of) foods
I Integration with other large data types (genotype, methylation,expression)
Mateen and Joseph (McMaster) Development of OTU Analysis in NutriGen December 19 2014 4 / 18
Background Investigations
Sample from CHILD
SA10 SA11 SA12 SA13 SA14 SA15 SA16 SA17 SA18 SA19
1 8933 1967 8145 1423 4035 4468 5174 12909 1763 1046
2 2321 2148 3708 1655 226 6007 2190 5276 1529 1284
3 352 88 135 28 867 2452 4069 9971 87 2381
4 1 2 2 2274 1 1 9198 3 2473 0
5 72 114 159 165 0 1262 360 63 0 95
6 0 81 0 0 0 0 1353 2 0 0
7 0 13 0 0 0 0 0 1 0 0
8 0 2 0 1 0 1 79 4 0 0
9 0 0 0 0 0 0 0 0 0 0
10 0 0 1 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0 0 0 0
13 0 0 0 0 0 0 0 0 0 0
> mean(otutable==0)
[1] 0.9776366
Mateen and Joseph (McMaster) Development of OTU Analysis in NutriGen December 19 2014 5 / 18
Differential Abundance Tests Permutation
Permutation Tests
I Simple for a few categorical variables
I Prefer a quantile-based measure, because of heavy positive skew, butchoosing a quantile (like the median) can be problematic
I Fairly conservative but provides p-values nonetheless
Mateen and Joseph (McMaster) Development of OTU Analysis in NutriGen December 19 2014 6 / 18
Differential Abundance Tests Permutation
GDMotu# otu pval
17 Bacteroidaceae; g Bacteroides 0.00591 ostridiales; f Lachnospiraceae 0.020
564 ostridiales; f Lachnospiraceae 0.02446 ostridiales; f Lachnospiraceae 0.02518 Lachnospiraceae; g Lachnospira 0.030
313 acteriaceae; g Bifidobacterium 0.032248 ostridiales; f Lachnospiraceae 0.040
76 ococcaceae; g Faecalibacterium 0.046154 Lachnospiraceae; g Lachnospira 0.049408 ostridiales; f Lachnospiraceae 0.051
87 ostridiales; f Lachnospiraceae 0.060153 ostridiales; f Lachnospiraceae 0.068207 acteriaceae; g Bifidobacterium 0.069464 ctinomycetaceae; g Actinomyces 0.071
10 acteriaceae; g Bifidobacterium 0.072263 ostridiales; f Lachnospiraceae 0.075
45 acteriaceae; g Bifidobacterium 0.081410 ostridiales; f Lachnospiraceae 0.082
82 teriales; f Bifidobacteriaceae 0.0836 acteriaceae; g Bifidobacterium 0.085
109 c Clostridia; o Clostridiales 0.09824 omonadaceae; g Parabacteroides 0.103
7 ococcaceae; g Faecalibacterium 0.104397 inobacteria; o Actinomycetales 0.113264 ostridiales; f Lachnospiraceae 0.114220 ococcaceae; g Faecalibacterium 0.121
70 eriales; f Alcaligenaceae; g 0.122100 ostridiales; f Lachnospiraceae 0.126368 Root; p Firmicutes 0.136
12 acteriaceae; g Bifidobacterium 0.146
Mateen and Joseph (McMaster) Development of OTU Analysis in NutriGen December 19 2014 7 / 18
Differential Abundance Tests Permutation
Still Breast Feedingotu# otu pval
21 Veillonellaceae; g Veillonella <2e-1664 ostridiales; f Veillonellaceae <2e-1645 acteriaceae; g Bifidobacterium 0.001
112 c Clostridia; o Clostridiales 0.00313 acteriaceae; g Bifidobacterium 0.00411 erobacteriaceae; g Escherichia 0.00635 nterococcaceae; g Enterococcus 0.00638 Veillonellaceae; g Veillonella 0.00840 teriales; f Enterobacteriaceae 0.01812 acteriaceae; g Bifidobacterium 0.02441 Veillonellaceae; g Dialister 0.036
239 ococcaceae; g Faecalibacterium 0.07176 ococcaceae; g Faecalibacterium 0.133
6 acteriaceae; g Bifidobacterium 0.13486 acteriaceae; g Bifidobacterium 0.167
7 ococcaceae; g Faecalibacterium 0.1744 acteriaceae; g Bifidobacterium 0.205
176 nterococcaceae; g Enterococcus 0.338220 ococcaceae; g Faecalibacterium 0.368109 c Clostridia; o Clostridiales 0.376
92 s; f Micrococcaceae; g Rothia 0.41743 ostridiales; f Ruminococcaceae 0.43436 Bacteroidaceae; g Bacteroides 0.43922 ucomicrobiaceae; g Akkermansia 0.53347 eptococcaceae; g Streptococcus 0.60518 Lachnospiraceae; g Lachnospira 0.78017 Bacteroidaceae; g Bacteroides 0.864
1 ostridiales; f Lachnospiraceae 1.0002 f Lachnospiraceae; g Blautia 1.0003 achnospiraceae; g Ruminococcus 1.000
Mateen and Joseph (McMaster) Development of OTU Analysis in NutriGen December 19 2014 8 / 18
Differential Abundance Tests Permutation
Delivery (V/CS)otu# otu pval
186 Root; p Firmicutes 0.02523 Bacteroidaceae; g Bacteroides 0.05736 Bacteroidaceae; g Bacteroides 0.06614 ostridiales; f Lachnospiraceae 0.080
105 Ruminococcaceae; g Clostridium 0.08820 Bacteroidaceae; g Bacteroides 0.119
145 c Clostridia; o Clostridiales 0.12372 Bacteroidaceae; g Bacteroides 0.12417 Bacteroidaceae; g Bacteroides 0.203
320 c Clostridia; o Clostridiales 0.212121 Clostridiaceae; g Clostridium 0.256
32 es; f Erysipelotrichaceae; g 0.25742 ; f Clostridiaceae; g Sarcina 0.337
217 tinobacteria; c Actinobacteria 0.4054 acteriaceae; g Bifidobacterium 0.466
16 riobacteriaceae; g Collinsella 0.49053 f Lachnospiraceae; g Blautia 0.53221 Veillonellaceae; g Veillonella 0.552
1 ostridiales; f Lachnospiraceae 1.0002 f Lachnospiraceae; g Blautia 1.0003 achnospiraceae; g Ruminococcus 1.0005 eptococcaceae; g Streptococcus 1.0006 acteriaceae; g Bifidobacterium 1.0007 ococcaceae; g Faecalibacterium 1.0008 lostridiales; f Clostridiaceae 1.0009 lostridiales; f Clostridiaceae 1.000
10 acteriaceae; g Bifidobacterium 1.00011 erobacteriaceae; g Escherichia 1.00012 acteriaceae; g Bifidobacterium 1.00013 acteriaceae; g Bifidobacterium 1.000
Mateen and Joseph (McMaster) Development of OTU Analysis in NutriGen December 19 2014 9 / 18
Various Linear Models Candidates
I Poisson regression for count variables
I Issues with model assumptions and fit
I Some strategies to mitigate these
I Handles more complex relationships (non-binary independentvariables)
I p-values can be misleadingly low!
Mateen and Joseph (McMaster) Development of OTU Analysis in NutriGen December 19 2014 10 / 18
Various Linear Models Candidates
`````````````̀OverdispersionZeroes
GLM Hurdle Zero-Inflated
Poisson • • •Negative Binomial • • •
I All models use canonical link
I When variable is binary, results are comparable to permutation tests
I Run into problems fitting the more flexible models
I Issue with quality of fit on all models (assumption violations, somegross)
Mateen and Joseph (McMaster) Development of OTU Analysis in NutriGen December 19 2014 11 / 18
Various Linear Models Candidates
Model Selection
QuantitativelyI Two criteria (and problems):
I Choose between models (different methods)I Quality of model (what if all available models fit poorly)
I For the first, various principle-of-parsimony heuristics are applicable
I For the second, deviance might work
Mateen and Joseph (McMaster) Development of OTU Analysis in NutriGen December 19 2014 12 / 18
Various Linear Models Exemplifying Results
p−values from poisson regression
p−values
Fre
quen
cy
0.0 0.2 0.4 0.6 0.8 1.0
050
015
00
Log deviance ratios from poisson regression
Log−deviance ratios
Fre
quen
cy
−2 0 2 4 6 8 10
010
0020
00
Deviance−based areas from poisson regression
Tail areas
Fre
quen
cy
0.0 0.2 0.4 0.6 0.8 1.0
020
00otu# otu pval
1 ostridiales; f Lachnospiraceae < 2.22e-162 f Lachnospiraceae; g Blautia < 2.22e-163 achnospiraceae; g Ruminococcus < 2.22e-164 acteriaceae; g Bifidobacterium < 2.22e-165 eptococcaceae; g Streptococcus < 2.22e-166 acteriaceae; g Bifidobacterium < 2.22e-168 lostridiales; f Clostridiaceae < 2.22e-169 lostridiales; f Clostridiaceae < 2.22e-16
10 acteriaceae; g Bifidobacterium < 2.22e-1611 erobacteriaceae; g Escherichia < 2.22e-1613 acteriaceae; g Bifidobacterium < 2.22e-1614 ostridiales; f Lachnospiraceae < 2.22e-1615 ostridiales; f Lachnospiraceae < 2.22e-1616 riobacteriaceae; g Collinsella < 2.22e-1617 Bacteroidaceae; g Bacteroides < 2.22e-1618 Lachnospiraceae; g Lachnospira < 2.22e-1619 ococcaceae; g Faecalibacterium < 2.22e-1621 Veillonellaceae; g Veillonella < 2.22e-1622 ucomicrobiaceae; g Akkermansia < 2.22e-1623 Bacteroidaceae; g Bacteroides < 2.22e-1624 omonadaceae; g Parabacteroides < 2.22e-1625 ostridiales; f Lachnospiraceae < 2.22e-1626 ostridiales; f Lachnospiraceae < 2.22e-1627 acteriaceae; g Bifidobacterium < 2.22e-1628 Bacteroidaceae; g Bacteroides < 2.22e-1630 treptococcaceae; g Lactococcus < 2.22e-1631 ipelotrichaceae; g Clostridium < 2.22e-1632 es; f Erysipelotrichaceae; g < 2.22e-1633 uminococcaceae; g Ruminococcus < 2.22e-1634 bacillales; f Streptococcaceae < 2.22e-16
Mateen and Joseph (McMaster) Development of OTU Analysis in NutriGen December 19 2014 13 / 18
Various Linear Models Exemplifying Results
p−values from NB regression
p−values
Fre
quen
cy
0.0 0.2 0.4 0.6 0.8 1.0
050
015
00
Log deviance ratios from NB regression
Log−deviance ratios
Fre
quen
cy
−5 0 5
010
0025
00
Deviance−based areas from NB regression
Tail areas
Fre
quen
cy
0.0 0.2 0.4 0.6 0.8 1.0
020
00otu# otu pval
116 cteriales; f Coriobacteriaceae < 2.22e-16183 obacteriaceae; g Adlercreutzia < 2.22e-16470 ostridiales; f Lachnospiraceae < 2.22e-16500 lonellaceae; g Acidaminococcus < 2.22e-16
57 tobacillaceae; g Lactobacillus < 2.22e-16428 Veillonellaceae; g Dialister < 2.22e-16449 Coriobacteriaceae; g Slackia < 2.22e-16324 tinobacteria; c Actinobacteria < 2.22e-16394 cteriales; f Coriobacteriaceae < 2.22e-16373 Moraxellaceae; g Acinetobacter 1.1682e-15151 ostridiales; f Lachnospiraceae 8.0287e-14743 teriales; f Enterobacteriaceae 2.5791e-11335 ipelotrichaceae; g Clostridium 3.1231e-11115 tobacillaceae; g Lactobacillus 1.9556e-10293 uminococcaceae; g Ruminococcus 3.7384e-10299 Veillonellaceae; g Megasphaera 5.6156e-10
68 tinobacteria; c Actinobacteria 8.3059e-10533 Root; p Firmicutes 4.3500e-09121 Clostridiaceae; g Clostridium 5.2253e-09
8 lostridiales; f Clostridiaceae 1.1172e-08132 c Clostridia; o Clostridiales 2.9136e-08
64 ostridiales; f Veillonellaceae 9.2874e-0894 c Clostridia; o Clostridiales 1.1626e-07
301 ostridiales; f Lachnospiraceae 1.3031e-0753 f Lachnospiraceae; g Blautia 1.5183e-07
238 ostridiales; f Lachnospiraceae 2.3633e-07791 Prevotellaceae; g Prevotella 2.4478e-07
51 omonadaceae; g Parabacteroides 2.4608e-07693 es; f Erysipelotrichaceae; g 2.5293e-07106 tobacillaceae; g Lactobacillus 2.7752e-07
Mateen and Joseph (McMaster) Development of OTU Analysis in NutriGen December 19 2014 14 / 18
Various Linear Models Statistical issues
I Traditional diagnostics would make the NB and its variants appealing
I Concerning distributional issues with OTUs
I Example of a significant OTU: NB(µ = 21.39, θ = 0.0017)
bf 0 5315
91 1
¬bf 0 1 9
117 1 1
Mateen and Joseph (McMaster) Development of OTU Analysis in NutriGen December 19 2014 15 / 18
Various Linear Models Statistical issues
Picking significant OTUs is highly characteristic of individual methods(inflated false positives)
Mateen and Joseph (McMaster) Development of OTU Analysis in NutriGen December 19 2014 16 / 18
Next Steps . . . In this framework
I Poorly fitting models may benefit regression from finite mixtures ofpoisson/nb to split the extremes (group starvation is an issue)
I Adjustments by cohort when START arrives
I For variables with ordinality, apply model selection among thewell-fitting models components.
Mateen and Joseph (McMaster) Development of OTU Analysis in NutriGen December 19 2014 17 / 18
Fin
Mateen and Joseph (McMaster) Development of OTU Analysis in NutriGen December 19 2014 18 / 18