multivarite and network tools for biological data analysis
TRANSCRIPT
Dmitry Grapov and Oliver FiehnUniversity of California, Davis
Multivariate Analysis and Visualization Tools for
Metabolomic Data
State of the art facility producing massive amounts of biological data…
>20-30K samples/yr>200 studies
Sam
ple
Variable
Data Analysis and Visualization
Quality Assessment• use replicated mesurements
and/or internal standards to estimate analytical variance
Statistical and Multivariate• use the experimental design
to test hypotheses and/or identify trends in analytes
Functional• use statistical and multivariate
results to identify impacted biochemical domains
Network• integrate statistical and
multivariate results with the experimental design and analyte metadata
experimental design - organism, sex, age etc.analyte description and metadata- biochemical class, mass spectra, etc.
VariableSample
Sam
ple
Variable
Data Analysis and Visualization
Quality Assessment• use replicated mesurements
and/or internal standards to estimate analytical variance
Statistical and Multivariate• use the experimental design
to test hypotheses and/or identify trends in analytes
Functional• use statistical and multivariate
results to identify impacted biochemical domains
Network• integrate statistical and
multivariate results with the experimental design and analyte metadata
Network Mapping
experimental design - organism, sex, age etc.analyte description and metadata- biochemical class, mass spectra, etc.
VariableSample
Principal Component Analysis (PCA) of all analytes, showing QC sample scores
Data Quality AssessmentDrift in >400 replicated measurements across >100 analytical batches for a single analyte
Acquisition batch
Abun
danc
e QCs embedded among >5,5000 samples (1:10) collected over 1.5 yrs
If the biological effect size is less than the analytical variance
then the experiment will incorrectly yield insignificant results
Data Quality AssessmentAnalyte specific data quality
overviewSample specific normalization can be used to estimate and remove analytical variance
Raw Data Normalized Data
Normalizations need to be numerically and visually validated
log mean
low precision
%RS
D
high precision
SamplesQCs
Network Mapping
Ranked statistically significant differences within a a biochemical
context
Statistics
Multivariate
Context
++=
Statistical and Multivariate AnalysesGroup 1
Group 2
What analytes are different between the
two groups of samples?
Statistical
significant differences lacking rank and
context
t-Test
Multivariate
ranked differences lacking significance
and context
O-PLS-DA
Network Mapping
Statistics
Multivariate
Context
++=
Statistical and Multivariate AnalysesGroup 1
Group 2
What analytes are different between the
two groups of samples?
Statistical
t-Test
Multivariate
O-PLS-DA
To see the big picture it is necessary too view the data from multiple different angles
DeviumWebhttps://github.com/dgrapov/DeviumWeb
• visualization• statistics• clustering • PCA• O-PLS
DeviumWebhttps://github.com/dgrapov/DeviumWeb
• visualization• statistics• clustering • PCA• O-PLS
Functional Analysis
Nucl. Acids Res. (2008) 36 (suppl 2): W423-W426.doi: 10.1093/nar/gkn282
Identify changes or enrichment in biochemical domains
• decrease• increase
Functional Analysis: opportunity for ‘Omic integration
Use domain knowledge databases to integrate genomic, proteomic and metabolomic data
Current approaches can be limited to pathway level analyses
Networks
Biochemical•reaction•domain
Structural •molecular fingerprints• mass spectra
Empirical •correlation•partial correlation
BMC Bioinformatics 2012, 13:99 doi:10.1186/1471-2105-13-99
Mapped Network
- displaying metabolic differences in control vs.
malignant lung tissue
Biochemical Relationships
http://www.genome.jp/dbget-bin/www_bget?rn:R00975
Structural Similarity
http://pubchem.ncbi.nlm.nih.gov//score_matrix/score_matrix.cgi
Empirical NetworksUse experiment specific or data driven relationships to gain novel insight
into biochemical relationshipsurea cycle
nucleotide
synthesis
protein
glycosylation
Mass Spectral NetworksUse mass spectra as a proxy for structure to help make sense of
unknown compounds’ biochemical identities
Watrous J et al. PNAS 2012;109:E1743-E1752
unknown compounds are likely phytosterol esters
Mass Spectral NetworksUse mass spectra and empirical relationships to narrow down the
biochemical roles for unknown compounds
Rigorous chemical experiments identified the unknown compounds as partial derivatization products of glucose
MetaMapRhttps://github.com/dgrapov/MetaMapR
Analysis at the Metabolomic Scale and Beyond
pyruvate lactate
enzyme
gene Bgene A
Pathway independent metabolomic (known and unknown), proteomic and genomic data integration
Software and Resources•DeviumWeb- Dynamic multivariate data analysis and visualization platformurl: https://github.com/dgrapov/DeviumWeb
•imDEV- Microsoft Excel add-in for multivariate analysisurl: http://sourceforge.net/projects/imdev/
•MetaMapR: Network analysis tools for metabolomicsurl: https://github.com/dgrapov/MetaMapR
•TeachingDemos- Tutorials and demonstrations•url: http://sourceforge.net/projects/teachingdemos/?source=directory•url: https://github.com/dgrapov/TeachingDemos
•Data analysis case studies and Examplesurl: http://imdevsoftware.wordpress.com/
[email protected] metabolomics.ucdavis.edu
This research was supported in part by NIH 1 U24 DK097154