data normalization approaches for large-scale biological studies

39
Data Normalization Approaches for Large-Scale Metabolomic Studies Dmitry Grapov, PhD

Upload: dmitry-grapov

Post on 10-May-2015

5.690 views

Category:

Education


2 download

DESCRIPTION

Overview of how to estimate data quality and validate normalization approaches to remove analytical variance. See here for animations used in the presentation: http://imdevsoftware.wordpress.com/2014/06/04/using-repeated-measures-to-remove-artifacts-from-longitudinal-data/

TRANSCRIPT

Page 1: Data Normalization Approaches for Large-scale Biological Studies

Data Normalization Approaches for Large-Scale Metabolomic Studies

Dmitry Grapov, PhD

Page 2: Data Normalization Approaches for Large-scale Biological Studies

Analytical Variance

Variation in sample measurements stemming from sample handling, data acquisition, processing, etc• Can modify or mask true biological variability• Calculated based on variance in replicated measurements• Can be accounted for using data normalization approachesGoal- minimize analytical variance using data normalization

Drift in >400 replicated measurements across >100 batches

Page 3: Data Normalization Approaches for Large-scale Biological Studies

Need for Normalization

To remove non-biological (e.g. analytical) drift/variance/artifacts in measurements

Acquisition order Processing/acquisition batches

SamplesQuality Controls (QCs)

Page 4: Data Normalization Approaches for Large-scale Biological Studies

Quantifying Data Quality (precision)Calculate median inter- and intra-batch %RSD (for replicated measurements)

Analyte specific performance across whole study

Within batch performance

Page 5: Data Normalization Approaches for Large-scale Biological Studies

Visualizing Performance

Intra-batch (within) precision for normalization methods

Inter-batch (across) precision for normalization methods

RSD = relative standard deviation = standard deviation/mean

Page 6: Data Normalization Approaches for Large-scale Biological Studies

Visualizing Metabolite Performance

acquisition time

batch

Univariate Multivariate

PCA

Page 7: Data Normalization Approaches for Large-scale Biological Studies

Common Normalization Approaches

Sample-wise scalar corrections• L2 norm, mean, median, sum, etc.

Internal standard (ISTD) • Ratio response (metabolite/ISTD)

• NOMIS (Sysi-Aho et al., 2007; selection of optimal combination ISTDs)

• CCRMN (Redestig et al., 2009; removal of metabolite cross contribution to ISTDs)

Quality control (QC) or reference sample• Batch ratio (mean, median)

• Loess (doi:10.1038/nprot.2011.335; locally estimated scatterplot smoothing)

• Hierarchical mixed effects (Jauhiainen et al. 2014)

• Quantile (Bolstad et al., 2003; minimize variance in metabolite distribution)

Variance Based• RUV-2 (De Livera et al., 2012; variance removal for hypothesis testing)

• Variance stabilizing normalization (Huber et al. 2002)

Page 8: Data Normalization Approaches for Large-scale Biological Studies

Evaluation of Normalizations

Use QC to define:• Median within batch %RSD• Median analyte study wide %RSD• All normalization specific parameters

• Split QCs into training and test set• Optimize tuning parameters using leave-one-out

cross-validation• Assess performance on test set

Image: http://pingax.com/regularization-implementation-r/?utm_source=rss&utm_medium=rss&utm_campaign=regularization-implementation-r

Page 9: Data Normalization Approaches for Large-scale Biological Studies

Scalar Normalization

Calculate sample-specific scalar to ensure each sample’s (sum, mean, median, etc) signal is equivalent

• Using sum signal normalization (sum norm) assumes equivalent total metabolite signal per sample

• Can correct for batch effects when valid

BMC Bioinformatics 2007, 8:93  doi:10.1186/1471-2105-8-93

Theses normalizations may hide true biological trends or create false ones

After sum norm phospholipids seem lower in ob/ob when in reality theses are the same as in wt samples

Page 10: Data Normalization Approaches for Large-scale Biological Studies

Batch Ratio (BR) Normalization

Use QCs to calculate: 1. batch/analyte specific

correction factor = (batch median /global median)

2. Apply ratio to samples

• simple

Page 11: Data Normalization Approaches for Large-scale Biological Studies

LOESS Normalization (local smoothing)

For each analyte use QCs to:• Tune LOESS model (span or degree of smoothing)• LOESS model to remove analytical variance from samples

raw LOESS normalized

Page 12: Data Normalization Approaches for Large-scale Biological Studies

LOESS Normalization

LOESS span has a large effect model fit

span (α) defines the degree of smoothing and is critical for controlling overfitting

Page 13: Data Normalization Approaches for Large-scale Biological Studies

LOESS Normalization

raw samples (red) normalized based on QCs (black)

model is trained on QCs and applied to samples

span: too high just right?

Can not assume convergence of training and test performance because test data has analytical + biological variance

Page 14: Data Normalization Approaches for Large-scale Biological Studies

LOESS NormalizationAvoiding over fitting is critical using the LOESS normalization

Page 15: Data Normalization Approaches for Large-scale Biological Studies

Exammple LOESS Normalization

raw span =0.75 span =0.005

Page 16: Data Normalization Approaches for Large-scale Biological Studies

Metabolomic Data Case Study I

GC-TOF• 310 metabolites for 4930 samples • 132 batches

• ~41 samples per batch • ~1:10 QCs/samples (487 QCs or 9%)• No Internal Standards (ISTDs)

Normalizations Implemented• Batch ratio • LOESS • Sum known metabolite signal (mTIC) normalization

Page 17: Data Normalization Approaches for Large-scale Biological Studies

Batch Performance (GC-TOF Raw)

Within batch• Median: 26 • Min: 19• Max: 69

MedianRSD count cumulative %10-20 3 220-30 98 7630-40 26 9640-50 3 9850-60 1 9960-70 1 100

Page 18: Data Normalization Approaches for Large-scale Biological Studies

MedianRSD count cumulative %0-10 10 310-20 83 3020-30 100 6230-40 69 8440-50 32 9450-60 6 9660-70 3 9770-80 5 9880-90 1 9990-100 1 100

Analyte Performance (GC-TOF Raw)

Within Batch • Median: 24 • Min: 7• Max: 79

Page 19: Data Normalization Approaches for Large-scale Biological Studies

PCA (GC-TOF Raw)

Page 20: Data Normalization Approaches for Large-scale Biological Studies

Within batches • Median: 23 • Min: 17• Max: 69

MedianRSD count cumulative %

10-20 25 2320-30 67 8530-40 15 9940-50 1 10060-70 1 101

Batch Performance (GC-TOF BR)

Page 21: Data Normalization Approaches for Large-scale Biological Studies

MedianRSD count cumulative %0-10 17 610-20 103 3920-30 112 7530-40 57 9340-50 12 9750-60 5 9960-70 3 10070-80 1 100

Across batches • Median: 24 • Min: 7• Max: 79

Batch Performance (GC-TOF BR)

Page 22: Data Normalization Approaches for Large-scale Biological Studies

PCA (GC-TOF BR)

Page 23: Data Normalization Approaches for Large-scale Biological Studies

BR Normalization Limitations

• Very susceptible to outliers

• Requires many QCs• Can inflate variance

when training and test set trends do not match

Page 24: Data Normalization Approaches for Large-scale Biological Studies

Within batches • Median: 19 • Min: 11• Max: 58

MedianRSD count cumulative %

10-20 75 5720-30 51 9630-40 4 9940-50 1 9950-60 1 100

Batch Performance (GC-TOF LOESS)

Page 25: Data Normalization Approaches for Large-scale Biological Studies

MedianRSD count cumulative %0-10 17 610-20 103 3920-30 112 7530-40 57 9340-50 12 9750-60 5 9960-70 3 10070-80 1 100

Across batches • Median: 19• Min: 2.9• Max: 66

Batch Performance (GC-TOF LOESS)

Page 26: Data Normalization Approaches for Large-scale Biological Studies

PCA (GC-TOF LOESS)

Page 27: Data Normalization Approaches for Large-scale Biological Studies

LOESS Normalization Limitations

raw normalized

LOESS normalization can inflate variance when:• overtrained• training examples do

not match test set

Page 28: Data Normalization Approaches for Large-scale Biological Studies

Sum mTIC Normalization (GC-TOF)

Improved performance over raw and BR, but alters data from magnitudinal to compositional

Page 29: Data Normalization Approaches for Large-scale Biological Studies

Sum mTIC Normalization (GC-TOF)

Poor removal of trends due to acquisition time, but limits magnitude of outliers samples compared to other approaches

time

Raw

mTIC Normalized

Page 30: Data Normalization Approaches for Large-scale Biological Studies

Metabolomic Data Case Study II

LC-Q-TOF• 340+ metabolites for 4930 samples • 132 batches

• ~41 samples per batch • ~1:10 QC/samples (524 QCs or 11%)• NIST reference (63 or 1%)• 14 internal standards (ISTDs)

• NOMIS (IS = ISTD)• qcISTD

Page 31: Data Normalization Approaches for Large-scale Biological Studies

Internal Standards Normalization

Anal

yte

Retention time

Internal standards (ISTD) • qcISTD(QC optimized

metabolite/ISTD)

• NOMIS (Sysi-Aho et al., 2007; selection of optimal combination ISTDs)

• CCRMN (Redestig et al., 2009; removal of metabolite cross contribution to ISTDs)

NOMIS

Page 32: Data Normalization Approaches for Large-scale Biological Studies

ISTD Based Normalizations (LC/Q-TOF)

• NOMIS (linear combination of optimal ISTDs; Sysi-Aho et al., 2007)

• qcISTD (QC optimized ISTD strategy)

PC 38:6

Poor performance with NOMIS

Page 33: Data Normalization Approaches for Large-scale Biological Studies

qcISTD Normalization

Use QC samples to:1. Evaluate analyte %RSD

before and after corrections using all ISTDs

2. Select analyte/ISTD combinations with %RSD improvement over raw data at some threshold (e.g 10%)

3. Correct sample analytes with QC defined ISTD if ISTD recovery is above some minimal threshold (e.g. > 20% of median)

• Subject to overfitting

191 of 326 (60%) are ISTD corrected

Page 34: Data Normalization Approaches for Large-scale Biological Studies

qcISTD Normalization

ISTD used by retention time (Rt) Total number of analytes corrected by ISTD

Page 35: Data Normalization Approaches for Large-scale Biological Studies

Optimal Lipidomic ISTDS

Page 36: Data Normalization Approaches for Large-scale Biological Studies

Normalizations (LC-Q-TOF)

LOESS performs very poorly for two metabolites

• qcISTD performs better than LOESS • qcISTD + LOESS leads to highest replicate

precision

Page 37: Data Normalization Approaches for Large-scale Biological Studies

PCA (LC/Q-TOF)

Raw (%RSD = 13) qcISTD (9)

LOESS (12)

qcISTD + LOESS (8)

Only LOESS included normalizations effectively remove analytical batch effects

Page 38: Data Normalization Approaches for Large-scale Biological Studies

Conclusion

• Comparison of common data normalization approaches suggests that in addition to ISTD corrections, LOESS (analyte-specific, non-linear adjustment based on QC performance at various data acquisition times) is superior to batch based corrections.

• Further validations need to be completed to confirm the effects of normalizations on samples’ variance

• These findings suggest that inclusion of “batch” as a covariate in statistical models will not fully account for analytical variance

R code for all normalization functions can be found at :https://github.com/dgrapov/devium/blob/master/R/Devium%20Normalization.r

Page 39: Data Normalization Approaches for Large-scale Biological Studies

[email protected] metabolomics.ucdavis.edu

This research was supported in part by NIH 1 U24 DK097154