microbiome analysis · • 4 time points (0, 3, 6, 10 weeks) • 5 treatments • 6 replicate per...
TRANSCRIPT
![Page 1: MICROBIOME ANALYSIS · • 4 time points (0, 3, 6, 10 weeks) • 5 treatments • 6 replicate per treatment •HYPOTHESIS: • Biostimulantproducts have beneficial effects on the](https://reader034.vdocuments.net/reader034/viewer/2022042415/5f30337081d5a621d35b596a/html5/thumbnails/1.jpg)
MICROBIOME ANALYSISA WORKSHOP WITH CODE
![Page 2: MICROBIOME ANALYSIS · • 4 time points (0, 3, 6, 10 weeks) • 5 treatments • 6 replicate per treatment •HYPOTHESIS: • Biostimulantproducts have beneficial effects on the](https://reader034.vdocuments.net/reader034/viewer/2022042415/5f30337081d5a621d35b596a/html5/thumbnails/2.jpg)
OUR MISSION
• Untangle complex system interactions
• Interactions are bidirectional
• There is always a three-way (at least) influence
Conditions
MicrobiotaHost
![Page 3: MICROBIOME ANALYSIS · • 4 time points (0, 3, 6, 10 weeks) • 5 treatments • 6 replicate per treatment •HYPOTHESIS: • Biostimulantproducts have beneficial effects on the](https://reader034.vdocuments.net/reader034/viewer/2022042415/5f30337081d5a621d35b596a/html5/thumbnails/3.jpg)
OUR DATA
Sequence 1
SampleData 1
Sequence 2…
Sequence p
Sample 1 Sample 2 … Sample n
Data 2 Data m…
Sample 2…
Sample n
Counts (0 to ∞)
Continuous and categorical values
K1, p1, c1…Taxonomy
K1, p2, c2…
……
Amplicon sequence variants from DADA2 Phylogenetic affiliations
Observation data
![Page 4: MICROBIOME ANALYSIS · • 4 time points (0, 3, 6, 10 weeks) • 5 treatments • 6 replicate per treatment •HYPOTHESIS: • Biostimulantproducts have beneficial effects on the](https://reader034.vdocuments.net/reader034/viewer/2022042415/5f30337081d5a621d35b596a/html5/thumbnails/4.jpg)
A GENERIC DATA SCIENCE WORKFLOW
Objective
• Define the
problem
• Define the
hypothesis
• Gather data
Exploration
• Bird's eye
view
• VISUAL!!
Cleaning
• Fix empty
values
• Check outliers
Feature
engineering
• Creating new
variables
• Deleting
variables
• Encoding of
categorical
variables
• Eigenvectors
Statistical
analysis
• Correlation
• Significances
• Distances
• Model
COMMUNICATION
• Meaningful
graphs
• Meaningful
numbers
• Poster,
presentations,
papers, etc.
![Page 5: MICROBIOME ANALYSIS · • 4 time points (0, 3, 6, 10 weeks) • 5 treatments • 6 replicate per treatment •HYPOTHESIS: • Biostimulantproducts have beneficial effects on the](https://reader034.vdocuments.net/reader034/viewer/2022042415/5f30337081d5a621d35b596a/html5/thumbnails/5.jpg)
A GENERIC DATA SCIENCE WORKFLOW
COMMUNICATION
Objective
Exploration
CleaninigFeature engineering
Statistical analysis
![Page 6: MICROBIOME ANALYSIS · • 4 time points (0, 3, 6, 10 weeks) • 5 treatments • 6 replicate per treatment •HYPOTHESIS: • Biostimulantproducts have beneficial effects on the](https://reader034.vdocuments.net/reader034/viewer/2022042415/5f30337081d5a621d35b596a/html5/thumbnails/6.jpg)
OBJECTIVE
BIOMICS project
• Tomato plants in the greenhouse treated with 4 different biostimulant products containing different types of ”beneficial microbes”
• Endomaxx, Inocucor, Pathway, (Earthcare with) Sumagrow
• Sacrificial sampling (tot 120 samples)
• 4 time points (0, 3, 6, 10 weeks)
• 5 treatments
• 6 replicate per treatment
• HYPOTHESIS:
• Biostimulant products have beneficial effects on the plants through impacts on the microbiome
• ”Beneficial microbes” in the products survive and/or thrive in the soil
• GATHERED DATA
• Leaf and root area and dry weight
• Chlorophyll content of the leaves (SPAD)
• Soil microbiota (bulk and rhizosphere not differentiated) -> 16S and ITS rRNA genes
Objective
•Define the problem
•Define hypothesis
•Gather data
![Page 7: MICROBIOME ANALYSIS · • 4 time points (0, 3, 6, 10 weeks) • 5 treatments • 6 replicate per treatment •HYPOTHESIS: • Biostimulantproducts have beneficial effects on the](https://reader034.vdocuments.net/reader034/viewer/2022042415/5f30337081d5a621d35b596a/html5/thumbnails/7.jpg)
IMPORTANT!!
SAMPLE_IDs V1 V2 V3 etcSample1Sample2…
NO IN
DEXE
S
VARIABLES
Exploration
•Bird’s eye view of your data
•VISUAL!!!
![Page 8: MICROBIOME ANALYSIS · • 4 time points (0, 3, 6, 10 weeks) • 5 treatments • 6 replicate per treatment •HYPOTHESIS: • Biostimulantproducts have beneficial effects on the](https://reader034.vdocuments.net/reader034/viewer/2022042415/5f30337081d5a621d35b596a/html5/thumbnails/8.jpg)
EXPLORATION
• Boxplots or violin plots for numeric variables
• Preliminary analysis of distributions
Exploration
•Bird’s eye view of your data
•VISUAL!!!
![Page 9: MICROBIOME ANALYSIS · • 4 time points (0, 3, 6, 10 weeks) • 5 treatments • 6 replicate per treatment •HYPOTHESIS: • Biostimulantproducts have beneficial effects on the](https://reader034.vdocuments.net/reader034/viewer/2022042415/5f30337081d5a621d35b596a/html5/thumbnails/9.jpg)
WHY IS IT IMPORTANT?
![Page 10: MICROBIOME ANALYSIS · • 4 time points (0, 3, 6, 10 weeks) • 5 treatments • 6 replicate per treatment •HYPOTHESIS: • Biostimulantproducts have beneficial effects on the](https://reader034.vdocuments.net/reader034/viewer/2022042415/5f30337081d5a621d35b596a/html5/thumbnails/10.jpg)
GARBAGE IN GARBAGE OUT
Your results will be only as good as your data.
![Page 11: MICROBIOME ANALYSIS · • 4 time points (0, 3, 6, 10 weeks) • 5 treatments • 6 replicate per treatment •HYPOTHESIS: • Biostimulantproducts have beneficial effects on the](https://reader034.vdocuments.net/reader034/viewer/2022042415/5f30337081d5a621d35b596a/html5/thumbnails/11.jpg)
THE PROBLEM OF MISSING DATA
• Empty values (NAs or NaNs)
• NOT ZEROS!! ZERO IS A VALUE!!
• Can have three reasons:
• MAR (missing at random): not related to the origin but related to the observation
• MCAR (completely at random): not related to the sampling nor the observation
• MNAR (not at random): depending on some variables (i.e. rich people do not share their income data)
• NO SINGLE SOLUTION!
Imputation
General
Categorical
NAs = new level
Multiple imputation
Logistic Regression
Continuous
Mean
Median
Mode
Multiple Imputation
Linear RegressionTime Series
With Seasonality but no Trend
With Trend but no seasonality
With no trend and no
seasonality
Deletion
Deleting Rows
Pairwise Deletion
Deleting columns
Cleaning
•Fix empty values
•Check outliers
![Page 12: MICROBIOME ANALYSIS · • 4 time points (0, 3, 6, 10 weeks) • 5 treatments • 6 replicate per treatment •HYPOTHESIS: • Biostimulantproducts have beneficial effects on the](https://reader034.vdocuments.net/reader034/viewer/2022042415/5f30337081d5a621d35b596a/html5/thumbnails/12.jpg)
OUTLIERS
• Data entry errors (human errors)• Measurement errors (instrument errors)• Experimental errors (data extraction or experiment
planning/executing errors)• Intentional (dummy outliers made to test detection
methods)• Data processing errors (data manipulation or data
set unintended mutations)• Sampling errors (extracting or mixing data from
wrong or various sources)• Natural (not an error, novelties in data)• NO SINGLE SOLUTION!!
Cleaning
•Fix empty values
•Check outliers
![Page 13: MICROBIOME ANALYSIS · • 4 time points (0, 3, 6, 10 weeks) • 5 treatments • 6 replicate per treatment •HYPOTHESIS: • Biostimulantproducts have beneficial effects on the](https://reader034.vdocuments.net/reader034/viewer/2022042415/5f30337081d5a621d35b596a/html5/thumbnails/13.jpg)
BASICS OF FEATURE ENGINEERING
Feature engineering
•Creating new variables
•Deleting variables
•Encoding of categorical variables
•Eigenvectors
“Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied machine learning’ is basically feature engineering.”
Prof. Andrew Ng
“Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.”
Dr. Jason Brownlee
![Page 14: MICROBIOME ANALYSIS · • 4 time points (0, 3, 6, 10 weeks) • 5 treatments • 6 replicate per treatment •HYPOTHESIS: • Biostimulantproducts have beneficial effects on the](https://reader034.vdocuments.net/reader034/viewer/2022042415/5f30337081d5a621d35b596a/html5/thumbnails/14.jpg)
WHAT CAN WE DO
CATEGORICAL VARIABLES• Combine variables
• Aggregate variables into superior categories
• Encoding variables
CONTINUOUS VARIABLES• Extract mean, quartiles, median etc.
• Mathematical combinations
• Transform
• Normalization
• Log-transform
• Rank/Binarize
• Bin
• PCA, LDA, etcFeature engineering
•Creating new variables
•Deleting variables
•Encoding of categorical variables
•Eigenvectors
State Florida Washington
Florida 1 0
Colorado 0 0
Colorado 0 0
Washington à 1 1
Florida 1 0
Washington 0 1
Florida 1 0
Colorado 0 0
![Page 15: MICROBIOME ANALYSIS · • 4 time points (0, 3, 6, 10 weeks) • 5 treatments • 6 replicate per treatment •HYPOTHESIS: • Biostimulantproducts have beneficial effects on the](https://reader034.vdocuments.net/reader034/viewer/2022042415/5f30337081d5a621d35b596a/html5/thumbnails/15.jpg)
PRINCIPAL COMPONENT ANALYSIS
• Reduces the number of features (!!!) while preserving as much as the randomness of the original datasets
• Many algorithms fail with too many features
• Dimensionality problem
Feature engineering
•Creating new variables
•Deleting variables
•Encoding of categorical variables
•Eigenvectors
![Page 16: MICROBIOME ANALYSIS · • 4 time points (0, 3, 6, 10 weeks) • 5 treatments • 6 replicate per treatment •HYPOTHESIS: • Biostimulantproducts have beneficial effects on the](https://reader034.vdocuments.net/reader034/viewer/2022042415/5f30337081d5a621d35b596a/html5/thumbnails/16.jpg)
PCA: PRINCIPAL COMPONENT ANALYSIS
• Standardize (i.e. normalize) data
• Find the eigenvalues and eigenvectors of covariance matrices
• Transform the data by projecting them on the eigenvectors
• Plots the components
Feature engineering
•Creating new variables
•Deleting variables
•Encoding of categorical variables
•Eigenvectors
![Page 17: MICROBIOME ANALYSIS · • 4 time points (0, 3, 6, 10 weeks) • 5 treatments • 6 replicate per treatment •HYPOTHESIS: • Biostimulantproducts have beneficial effects on the](https://reader034.vdocuments.net/reader034/viewer/2022042415/5f30337081d5a621d35b596a/html5/thumbnails/17.jpg)
LDA: LINEAR DISCRIMINANT ANALYSIS
• CLASSIFICATION method
• Similar to PCA but:• Forces the components on determined variables
instead of calculating them
• Maximize distances between categories
• We will use it to combine categorical variables and continuous variables
![Page 18: MICROBIOME ANALYSIS · • 4 time points (0, 3, 6, 10 weeks) • 5 treatments • 6 replicate per treatment •HYPOTHESIS: • Biostimulantproducts have beneficial effects on the](https://reader034.vdocuments.net/reader034/viewer/2022042415/5f30337081d5a621d35b596a/html5/thumbnails/18.jpg)
CORRELATIONS
WHY• Find dependencies between variables in your
data
• Strongly correlated variables break models
• Determine which variable needs to be eliminated
WHY NOT• NOT CAUSATION
• May miss unknown (confounding) variables
• Require normal distributions
Statistical analysis
•Correlations•Significances•Distances•Modeling
Correlation: 0.992082
![Page 19: MICROBIOME ANALYSIS · • 4 time points (0, 3, 6, 10 weeks) • 5 treatments • 6 replicate per treatment •HYPOTHESIS: • Biostimulantproducts have beneficial effects on the](https://reader034.vdocuments.net/reader034/viewer/2022042415/5f30337081d5a621d35b596a/html5/thumbnails/19.jpg)
ANOVA
• Compares differences between groups• Normalizes distribution of each group into t-
Student distribution
• Applies two-way t-test for each pair:
• i.e. is the overlapping region of the distribution greater or lesser than the significance value?
• F = variability between groups/ within groupsStatistical analysis
•Correlations•Significances•Distances•Modeling
![Page 20: MICROBIOME ANALYSIS · • 4 time points (0, 3, 6, 10 weeks) • 5 treatments • 6 replicate per treatment •HYPOTHESIS: • Biostimulantproducts have beneficial effects on the](https://reader034.vdocuments.net/reader034/viewer/2022042415/5f30337081d5a621d35b596a/html5/thumbnails/20.jpg)
TUKEY’S HSD TEST
• ANOVA tells you IF there are differences. Tukey’s tells you WHERE there are differences.
• More general than two-way ANOVA but less powerful
Statistical analysis
•Correlations•Significances•Distances•Modeling
![Page 21: MICROBIOME ANALYSIS · • 4 time points (0, 3, 6, 10 weeks) • 5 treatments • 6 replicate per treatment •HYPOTHESIS: • Biostimulantproducts have beneficial effects on the](https://reader034.vdocuments.net/reader034/viewer/2022042415/5f30337081d5a621d35b596a/html5/thumbnails/21.jpg)
HOW ARE WE DOING?
A B C
![Page 22: MICROBIOME ANALYSIS · • 4 time points (0, 3, 6, 10 weeks) • 5 treatments • 6 replicate per treatment •HYPOTHESIS: • Biostimulantproducts have beneficial effects on the](https://reader034.vdocuments.net/reader034/viewer/2022042415/5f30337081d5a621d35b596a/html5/thumbnails/22.jpg)
CODE