high dimensional biological data analysis and visualization

Post on 10-May-2015

15.950 Views

Category:

Education

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Examples of data analysis and visualization of high dimensional metabolomic data.

TRANSCRIPT

Dmitry Grapov, PhD

Metabolomic Data Analysis for the Study of Diseases

State of the art facility producing massive amounts of biological data…

>13,000 samples/yr>160 studies~32,000 data points/study

Goals?

Analysis at the Metabolomic Scale

Univariate vs. MultivariateUnivariate

Gro

up 1

Gro

up 2

Multivariate Predictive Modeling

Hypothesis testing (t-Test, ANOVA, etc.) PCA O-/PLS/-DA

univariate/bivariate vs.

\ multivariate

mixed up samples?outliers?

Univariate vs. Multivariate

Data Complexity

nm

1-D 2-D m-D

Data

samples

variables

complexity

Meta Data

Experimental Design =

Variable # = dimensionality

Statistical Analysis• Identify differences in sample population

means• sensitive to distribution shape

• parametric = assumes normality

• error in Y, not in X (Y = mX + error)

• optimal for long data

• assumed independence

• false discovery rate (FDR) long

wide

n-of-one

Achieving “significance” is a function of:

significance level (α) and power (1-β )

effect size (standardized difference in means)

sample size (n)

Type I Error: False Positives

• Type II Error: False Negatives

• Type I risk =

• 1-(1-p.value)m

m = number of variables tested

FDR correction

• p-value adjustment or estimate of FDR (Fdr, q-value)

False Discovery Rate (FDR)

Bioinformatics (2008) 24 (12):1461-1462

FDR correctionFD

R ad

just

ed p

-val

ue

p-value

Benjamini & Hochberg (1995) (“BH”)• Accepted standard

Bonferroni• Very conservative• adjusted p-value = p-value*# of tests (e.g. 0.005 * 148 = 0.74 )

Multivariate AnalysisClustering• Grouping based on similarity/dissimilarity

Principal Components Analysis (PCA)• Identify modes of variance in the data

Partial Least Squares (PLS) • Identify modes of variance in the data

correlated with a hypothesis

Cluster AnalysisUse similarity/dissimilarity to group a collection of samples or variables

Approaches• hierarchical (HCA)• non-hierarchical (k-NN, k-means)• distribution (mixtures models)• density (DBSCAN)• self organizing maps (SOM)

Linkage k-means

Distribution Density

Hierarchical Cluster Analysissimilarity/dissimilarity defines “nearness” or distance

objects are grouped based on linkage methods

Hierarchy of Similarity

Sim

ilarit

y

x

xx

x

How does my metadata match my data structure?

Hierarchy of effect sizes

Projection of Data

The algorithm defines the position of the light sourcePrincipal Components Analysis (PCA)

• unsupervised• maximize variance (X)

Partial Least Squares Projection to Latent Structures (PLS)

• supervised• maximize covariance (Y ~ X)

James X. Li, 2009, VisuMap Tech.

PC1PC2

http://www.scholarpedia.org/article/Eigenfaces

Raw data PCA dimensions

Interpreting PCA Results

Variance explained (eigenvalues)

Row (sample) scores and column (variable) loadings

How are scores and loadings related?

Centering and Scaling

PMID: 16762068

Use PLS to test a hypothesis

time = 0 120 min.

Partial Least Squares (PLS) is used to identify planes of maximum correlation between X measurements and Y (hypothesis)

PCA PLS

PLS model validation is critical

Determine in-sample (Q2) and out-of-sample error (RMSEP) and compare to a random model

• permutation tests

• training/testing

Databases for organism specific biochemical information:

Multiple organisms

• KEGG

• BioCyc

• Reactome

Human

• HMDB

• SMPDB

Biochemical domain information

Pathway Enrichment Analysis

http://www.metaboanalyst.ca/MetaboAnalyst/faces/UploadView.jsp

enrichmenttopological importance

Biochemical

Network Mapping

doi:10.1186/1471-2105-13-99

Structural Similarity

Data visualization as form of analysis

DM

Liver CYP2D6

Dextromethorphan = additives in

dextrorphan

• high fructose corn syrup

• antioxidants

• flavor

Identification of relationships between altered metabolites urea cycle

nucleotide

synthesis

protein

glycosylation

Identification of treatment effects

Analysis of differential metabolic responses

Treatment 1 Treatment 2

Resources• DeviumWeb- Dynamic multivariate data analysis and

visualization platformurl: https://github.com/dgrapov/DeviumWeb

• imDEV- Microsoft Excel add-in for multivariate analysisurl: http://sourceforge.net/projects/imdev/

• MetaMapR: Network analysis tools for metabolomicsurl: https://github.com/dgrapov/MetaMapR

• TeachingDemos- Tutorials and demonstrations• url: http://sourceforge.net/projects/teachingdemos/?source=directory• url: https://github.com/dgrapov/TeachingDemos

• CDS Blog- Data analysis case studiesurl: http://imdevsoftware.wordpress.com/

dgrapov@ucdavis.edu metabolomics.ucdavis.edu

This research was supported in part by NIH 1 U24 DK097154

top related