removing unwanted variation from high-throughput omic...
TRANSCRIPT
![Page 1: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/1.jpg)
Removing Unwanted Variation from High-throughput Omic Data
With Johann Gagnon-Bartsch & Laurent Jacob UC Berkeley, WEHI & CNRS, Lyon"
"AMSI-SSAI Lecture, ABS, August 26, 2013"
1
![Page 2: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/2.jpg)
The problem
" Genomic and other omic data can be affected by
unwanted variation. "" For example, batch effects due to time, space,
equipment, operators, reagents, sample source, sample quality, environmental conditions,…the list goes on… "
" Also we often wish to combine data, both within and
across platforms. Differences between studies and platforms need to be dealt with."
2
![Page 3: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/3.jpg)
A few examples"
3
![Page 4: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/4.jpg)
Data structure"
In each of following examples, our data has the form"
4
m rows = samples typically 10s, 100s and at 7mes 1,000s
n columns = genes (~20,000), or SNPs = DNA variants (up to 2 million) , or …
m
n
![Page 5: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/5.jpg)
Snapshot view (SVD, PCA, MDS…)"
5
Data matrix ≈
m
n 2 n
2
m
If we apply the (suitably scaled) row eigenvectors to the rows of the data matrix, we get 2 values for each sample. These we plot, see next few slides.
![Page 6: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/6.jpg)
Artifact (batch) can overwhelm biology Gene expression microarrays
"
Adapted from Lazar C et al. Brief Bioinform 2013
batch 1 batch 2
Principal component 1
Principal com
pone
nt 2
![Page 7: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/7.jpg)
From: J Novembre et al. Nature 456 (2008)
SNP genotypes: popula7on structure within Europe.
There are situa7ons in which we would like to remove such structure!
![Page 8: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/8.jpg)
A microarray experiment with central retina tissue from the rd1 mouse: 4 times x 3!
Light blue: 2 months Dark blue: 4 months Purple: 6 months Red: 8 months
Ideally we would have seen 4 7ght groups of 3 !, !, ! and ! resp.
Principal com
pone
nt 2
Principal component 1
Very severe batch effects
rd1 is a mouse model of re8ni8s pigmentosa: loss of rod photoreceptors, followed by that of cone photoreceptors
![Page 9: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/9.jpg)
RNA-seq data: batch corresponds to plate barcode "
82 AML samples"
PC 1
PC 2
![Page 10: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/10.jpg)
PC2 vs PC1 for 12 zebrafish RNA-seq runs: 3 treated vs 3 control (in duplicate)"
10
−150 −100 −50 0 50 100
−100
−50
050
All genes
First PC
Seco
nd P
C
Ctl. 1
Ctl. 3
Ctl. 5
Trt. 9
Trt. 11Trt. 13
Ctl. 1
Ctl. 3
Ctl. 5
Trt. 9
Trt. 11Trt. 13
First principal component
Second
prin
cipal com
pone
nt
The biology is not evident in the first 2 PCs
![Page 11: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/11.jpg)
11
Nature Reviews Gene/cs, vol 11, October 2010, p. 733
They iden7fy fatally flawed studies!
![Page 12: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/12.jpg)
Combining 3 experiments"
• Three microarray gene expression experiments carried out at different times are all comparisons of the form"
Knock-Out (3X) vs Wild-Type (3X) "• All are in T-cells, and while the three KOs differ
(Id2, Tbet, Blimp), the WT mice are the same. "• The idea is to combine the three experiments into
one, to benefit from the increased WT replication, and to compare the different KOs. "
12
![Page 13: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/13.jpg)
Raw Quantile-normalized"
13
−50 0 50
−40
−20
020
Raw data
PC1
PC2
−60 −40 −20 0 20 40
−60
−40
−20
020
40
Standard normalization
PC1PC
2
−20 −10 0 10
−30
−20
−10
010
2030
After RUV−rand
PC1
PC2
Blue: wild-‐type, Red: knock-‐out. Shapes: Different experiments (KOs)
![Page 14: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/14.jpg)
LC-MS"
GC-MS"
Mi
Illumina GA-‐2 Affymetrix
Agilent Illumina
Microarrays
HiSeq
MiSeq
![Page 15: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/15.jpg)
Illumina Infinium Human Methylation Beadchips : a special problem "
27k 450k
The 27k probes are on the 450k chip. "Wanted: to combine data from these two arrays."
![Page 16: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/16.jpg)
Some scientific goals sought using gene expression microarrays and
analogous platforms "
• Quantification of expression"• Differential Expression (DE) "• Classification "• Clustering"• Correlating"
16
![Page 17: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/17.jpg)
Some consequences of Unwanted Variation"
• Poor quantification of expression"• False discoveries (type 1 errors)"• Missed discoveries (type 2 errors)"• Incorrect predictions"• Artificial clusters"• Wrong correlations"
17
![Page 18: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/18.jpg)
Aim for today "
To describe some ways of" "• identifying and removing (i.e. adjusting for)
unwanted factors, when aiming to achieve these goals, and "
• telling whether or not it helped."
18
![Page 19: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/19.jpg)
I will begin with Differential Expression"
19
![Page 20: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/20.jpg)
The model we and others use"
m samples, n genes, k unwanted factors" " Ym×n = Xm×pβp×n + Wm×kαk×n + εm×n %% where " Y is a matrix of gene expression measurments, observed," X carries the factors of interest, observed," β are gene coefficients, unobserved, " W carries unwanted factors, unobserved, " α are gene coefficients, unobserved, " ε are errors, unobserved."
20
![Page 21: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/21.jpg)
The model we use in pictures "
21
yij = xiβj + wiαj + εij
![Page 22: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/22.jpg)
Relation to an econometric model"
% % % %Yit = Xit’β + uit , "where Xit is a p×1 vector of observable regressors, β is a p×1 vector of unknown coefficients, and uit has a common factor structure"% % % %uit = λi’Ft + εit , "
where λi is a vector of factor loadings and Ft is a vector of common factors, and the εit are idiosyncratic errors, i=1,…N cross-sectional units, t=1,…,T time periods. This is a model for panel data, Bai (2005), where interest is in estimating β. Often N >> T. Note the difference between the 2 models.""
![Page 23: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/23.jpg)
The model, 2" %Our goal: for differential expression, to estimate β.%"Note: W is unobserved, Otherwise, this is a standard
linear model. "Our strategy: use factor analysis to estimate W%"There are identifiability issues"• The correlation between X and W is unknown"• β and α are not identifiable"%(The examples we use below have p=1.)"
23
![Page 24: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/24.jpg)
Identifiability: we don’t know the correlation of W (k=1) with X!
24
Two samples Each dot a gene
![Page 25: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/25.jpg)
Some ways of dealing with these problems with gene expression microarrays"
• Standard linear regression"• EB linear regression (ComBat)"• Naïve factor analysis (SVD)"• Full Bayes using MCMC "• Variational Bayes (VIBES, Infer.NET, PEER)"• Surrogate Variable Analysis (SVA)"• Linear model with sparsity (LEAPP)"• Mixed model analysis (ICE)"
"
25
![Page 26: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/26.jpg)
We might have genes not affected by X!
26
Call such genes nega7ve controls.
![Page 27: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/27.jpg)
Our solution: Use control genes "
Negative controls: Assume βj = 0.%"Positive controls: Assume βj ≠ 0.%
27
“controls” in this context means "“controls w.r.t. differential expression”"
![Page 28: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/28.jpg)
Some history"
• Lucas et al (2006) Sparse Statistical Modelling in Gene Expression Genomics, created covariates from PCA based on signal from control and housekeeping probes"
• Behzadi et al, (2007) A component based noise correction method (CompCor) for BOLD and perfusion based fMRI Neuroimaging.Ccreated covariates from PCA based on signal from “noise ROI” (white matter, CSF)"
• Tradition in analytical chemistry/metabolomics: use of “internal standards” "
28
![Page 29: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/29.jpg)
Using the negative controls c " Yc = Wαc + εc "
"Just do a factor analysis on the negative controls!""Examples of negative controls "• housekeeping genes, "• spiked-in controls"• genes chosen carefully"
This works!"29
![Page 30: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/30.jpg)
Introducing the two-step: RUV-2"
1. Do a factor analysis on Yc to estimate W.%2. Then regress Y on X and the estimated W to get
an estimate of β adjusted for W.%
"
30
There are many ways to do the factor analysis, including "SVD, the EM-algorithm, and using Infer.NET (variational "Bayes), the last two needing a probability model.""SVD: Write Yc = UΛVT , then put W^ = UΛk , Λk = k largest. ""
![Page 31: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/31.jpg)
Ex: gender differences in the brain(Vawter et al, Neuropsychopharmacology 2004)"
• 5 men, 5 women"• 3 brain regions (AnCing, DLPFC, Cb)"• Each sample done in 3 labs"• 2 Affymetrix chip types: HGU95a, HGU95av2"• There should be (5+5) × 3 × 3 = 90 arrays, but 6
are missing, so there are just 84."" We’ll ignore regions, and focus on gender. "
31
![Page 32: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/32.jpg)
Ex: gender differences in the brain, 2"
• 12,685 probe sets"• 799 housekeeping genes, 33 spike-in negative
controls"• Positive controls: genes on the Y and X chromosomes"
There’s no connection between Y and X here and the Y and X in my model – they are italicised, and colored!"
32
![Page 33: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/33.jpg)
Gender differences in the brain# X/Y genes in the top 40 "
33 Preprocessing = standard RMA"
Method W/o preprocessing With preprocessing
No 7 13
Regression 6 16
SVA (IRW) 6 17
ComBat 14 17
RUV2-‐SVD 22 20
RUV2-‐EM 22 22
![Page 34: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/34.jpg)
How did we find k?"
34
![Page 35: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/35.jpg)
Possible ways to determine k!
• Scree plots"• Quality measures/plots" - p-value histograms" - RLE plots"• More math" - hypothesis tests" - move beyond factor "analysis"
• Positive controls"" 35
![Page 36: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/36.jpg)
Number of X/Y genes in Top 20 /40
Top 20 Top 40
36
![Page 37: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/37.jpg)
What next?"• We have an alternative to RUV2 called RUV4,
which has some advantages. "• We have a form of RUV4 called RUVinv for
which we do not need to estimate k. "• In all applications, the main issue is: what do we
use as negative controls ? We can derive empirical negative control genes. "
• We can ridge to improve conditioning"• We can smooth the gene-specific variances and
get better Type 1 error control"Details in UC Berkeley Statistics Technical Report #820"
"37
![Page 38: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/38.jpg)
Gender data, 4: not preprocessed"
38
Method #X/Y in top 100 Type 1 error × 100
Unadjusted 10 0 SVA-‐IRW 12 0 LEAPP 19 1 ICE 17 0 RUV4 (HK) 29 12 RUVinv (HK) 26 7 RUVinv-‐evar (HK) 26 6 RUVrinv-‐evar (HK) 28 6 RUVrinv-‐evar (full) 32 6 RUVrinv-‐evar (emp) 30 6
![Page 39: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/39.jpg)
Relation of negative controls to instrumental variables"
Instruments are variables that are correlated with the factor of interest but uncorrelated with the error term (or in our case, the unwanted variation). "They can be used to obtain unbiased es7mates of the effect of interest (in our case, β ). "Let V be a full rank m×r matrix of instruments such that m > r ≥ p, such that V’W = 0, and such that V’X is full rank. The IVLS estimator of β would be " " " %%
39
[X 'V (V 'V )−1V 'X]−1X 'V (V 'V )−1V 'Y
![Page 40: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/40.jpg)
Analogous formulae"
Alterna7vely, we may write the IVLS es7mator as """"Compare this to the RUV-‐2 es7mator ""
40
(X 'RW X)−1X 'RWY
(X 'PVX)−1X 'PVY
![Page 41: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/41.jpg)
Comparison"
With IVLS we iden7fy a “safe” subspace using instruments. Instruments are variables that we assume lie within the “safe” subspace. With RUV-‐2 we iden7fy a “safe” subspace using nega7ve controls. Nega7ve controls are variables that we assume lie within the “dangerous” subspace that is the orthogonal complement of the “safe” subspace. With both IVLS and RUV-‐2 there is the caveat that X must not be orthogonal to the “safe” subspace. In the case of IVLS, this means that V must be reasonably correlated with X; we want to avoid weak instruments. In the case of RUV-‐2, this means that X must lie outside R(W^); the control genes must not be influenced by X.
41
![Page 42: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/42.jpg)
What next?"
• Next I’ll give a quick look at some applications of these ideas to various examples."
• In all applications, the main issue is: what do we use as negative controls and positive controls, if any. "
42
![Page 43: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/43.jpg)
MicroArray Quality Control dataset"
• Two mRNA samples (Stratagene Universal Human Reference RNA, and Ambion Human Brain RNA)"
• Each sample was assayed 5 times at each of 6 sites on the Affymetrix HU133Plus2.0 platform: 60 arrays in all."
• The labs at the different sites have all done a pretty good job on their assays. However, one lab lacked experience."
• Here we let our approach discover the site effects, not including them as dummy variables (you will see why not)."
43
![Page 44: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/44.jpg)
The figure (w1) shows clear site effects (different colors represent different sites)"
Note the purple: whatever factor is varying from site to site is also varying within this site. Dummy variables would not have worked as well here. The effects are small.
44
![Page 45: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/45.jpg)
Removing severe batch effects"• Back to our mouse model of retinitis pigmentosa
(loss of rod and later cone photoreceptors). "
• Initially no significantly downregulated retinal genes were found between 2 and 8 months (left volcano plot on the next slide)."
• Using RUV (right plot on the next slide), we were able to find several significantly down-regulated retinal, even cone-specific genes, which were later confirmed. "
45
![Page 46: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/46.jpg)
Standard analysis "
46 log2(fold change) 8m/2m
-‐log 1
0 (p-‐value)
Green dots: genes expressed in the re7na
![Page 47: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/47.jpg)
Standard analysis Analysis with RUV"
47 log2(fold change) 8m/2m log2(fold change) 8m/2m
-‐log 1
0 (p-‐value)
-‐log 1
0 (p-‐value)
Green dots: genes expressed in the re7na
Includes cone genes
![Page 48: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/48.jpg)
Back to our 3 treatment vs 3 control (in duplicate) RNA-seq zf experiment"
48
![Page 49: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/49.jpg)
PC2 vs PC1 of normalized data "
49
−100 0 100 200
−100
−50
050
RAW
First PC
Seco
nd P
CCtl. 1
Ctl. 3Ctl. 5
Trt. 9
Trt. 11Trt. 13
−150 0 100
−50
050
100
TC
First PC
Seco
nd P
C
Ctl. 1
Ctl. 3Ctl. 5
Trt. 9
Trt. 11
Trt. 13
−50 50 150
−50
050
100
150
UQ
First PC
Seco
nd P
C
Ctl. 1
Ctl. 3
Ctl. 5
Trt. 9
Trt. 11Trt. 13
−150 −50 50
−50
050
100
150
TMM
First PC
Seco
nd P
C
Ctl. 1
Ctl. 3
Ctl. 5
Trt. 9
Trt. 11Trt. 13
−150 −50 50
−50
050
100
150
AH
First PC
Seco
nd P
C
Ctl. 1
Ctl. 3
Ctl. 5
Trt. 9
Trt. 11Trt. 13
−50 50 150
−100
−50
050
FQ
First PC
Seco
nd P
C
Ctl. 1
Ctl. 3
Ctl. 5
Trt. 9
Trt. 11Trt. 13
−50 50 150
−100
−50
050
CL
First PC
Seco
nd P
CCtl. 1
Ctl. 3
Ctl. 5
Trt. 9
Trt. 11Trt. 13
−100 0 50
−50
050
100
RUV
First PC
Seco
nd P
C
Ctl. 1
Ctl. 3Ctl. 5
Trt. 9Trt. 11
Trt. 13
We’d hope to see the trt vs. ctl difference wouldn’t we?
![Page 50: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/50.jpg)
Back to combining 3 sets of 3 KO vs 3 WT T-cell microarray experiments (with same WT)"
50
![Page 51: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/51.jpg)
Raw Q-norm RUVrandom"
51
−50 0 50
−40
−20
020
Raw data
PC1
PC2
−60 −40 −20 0 20 40
−60
−40
−20
020
40
Standard normalization
PC1
PC2
−20 −10 0 10
−30
−20
−10
010
2030
After RUV−rand
PC1
PC2
Blue: wild-‐type, Red: knock-‐out. Shapes: Different experiments (KOs)
![Page 52: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/52.jpg)
Summary" With very simple statistical methods, we can: ""• Use negative control genes to estimate the
unwanted factors,"• Use positive control genes or other methods to
estimate the number of unwanted factors."" With slightly more complex statistical methods, we
can avoid estimating the number of unwanted factors, and relax the control gene assumption. "
" 52
![Page 53: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/53.jpg)
In later work we"
• Apply these differential expression ideas in other contexts; microarray methylation data, mass spec metabolomic data, RNA-seq gene expression data,…"
• We have analogous results for prediction (classification), clustering and correlating"
• We can combine different studies on the same platform (e.g. two or more Affymetrix studies), on similar but distinct platforms (e.g. Affymetrix, Agilent and Illumina microarray studies), and studies on totally different platforms, e.g. GC-MS and LC-MS metabolomic data, microarray and RNA-seq data. "
53
![Page 54: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/54.jpg)
Acknowledgements"Johann Gagnon-Bartsch"Department of Statistics, "
University of California at Berkeley"Laurent Jacob"
Laboratoire Biométrie et Biologie Evolutive, Université Lyon, CNRS, UMR 5558, Villeurbanne"
Francois Collin, Genomic Health"Moshe Olshansky, WEHI"
Alysha De Livera, Univ Melbourne"Jovana Maksimovic, MCRI"
54
![Page 55: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/55.jpg)
Clustering or “cleaning”"
55
![Page 56: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/56.jpg)
The problem"
We now assume we don't know X any more, e.g. for clustering, or cleaning a dataset. "
" We can still estimate W as before, using Yc , but
then we can't do the regression step. " " We have several statistical approaches to this
problem, details omitted. One is RUV-random."""
56
![Page 57: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/57.jpg)
![Page 58: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/58.jpg)
We know of 5 age-related mets: 3 going up, 2 going down. Look at volcano plots of age effects, adjusted for sex and BMI"
RUV-‐random pulls two mets up out of the pool
![Page 59: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/59.jpg)
The biological solution"
" Reference controls are simply technical replicates,
but replicates whose variation might well be representative of the very unwanted variation we wish to remove. That’s going to be our hope when we use them. (We’ll check the results, of course.) Any replicates will help, but reference controls have a better chance of spanning the space of UV."
59
![Page 60: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/60.jpg)
Diagram illustrating a reference control in 6 batches of 5 samples of 2 types"
60
Note that a naïve batch adjustment here would equalize red and green, on average.
Walker et al BMC Genomics (2008)
![Page 61: Removing Unwanted Variation from High-throughput Omic Dataresearch.amsi.org.au/wp-content/uploads/sites/3/2014/10/Speed_AMSI... · The problem " Genomic and other omic data can be](https://reader031.vdocuments.net/reader031/viewer/2022022706/5be1c14a09d3f288328bd14a/html5/thumbnails/61.jpg)
How do we use the reference control replicates? Simplest version."
• Note that the reference control Ys have the same (unknown) X, and so their row differences Yd satisfy"
Yd = Wdα + εd %
• Estimate α from the svd of the left hand side, say α^ = EkQT, where Yd = PEQT."
• Plug α^ into the formula Yc = Wαc + εc for negative control genes, and estimate W by linear regression."
• Once W and α have been estimated, subtract W^α^.% This too works! (but we can do better)"
61