unwanted variation - biostatisticskhansen/teaching/2014/140.668/batch.pdf · commonly used to...
TRANSCRIPT
Unwanted variationWe have seen that microarray expression data suffers from “unwanted variation”.
We have discussed a number of preprocessing algorithms intended to increase the signal-to-noise in such data.
But sometimes (unwanted) variation remains.
Batch effects
Normalization is a data analysis technique that adjusts global properties of measure-ments for individual samples so that they can be more appropriately compared. Including a normalization step is now standard in data analysis of gene expression experiments6. But normalization does not remove batch effects, which affect specific subsets of genes and may affect different genes in different ways. In some cases, these normalization procedures may even exacerbate technical artefacts in high-throughput measurements, as batch and other technical effects violate the assumptions of normalization methods. Although specific normalization methods have been developed for microarray studies that take into account study design7 or otherwise correct for the batch problem8, they are still not widely used.
As described here, in surveying a large body of published data involving high-throughput studies and a number of technol-ogy platforms, we have found evidence of batch effects. In many cases we have found that these can lead to erroneous biological conclusions, supporting the conclusions of previous publications4,5. Here we analyse the extent of the problem, review the critical downstream consequences of batch effects and describe experimental and computa-tional solutions to reduce their impact on high-throughput data.
#P�KNNWUVTCVKQP�QH�DCVEJ�GHHGEVUTo introduce the batch effect problem we used data from a previously published bladder cancer study9 (FIG. 1). In this study,
microarray expression profiling was used to examine the gene expression patterns in superficial transitional cell carcinoma (sTCC) with and without surrounding carci-noma in situ (CIS). Hierarchical cluster anal-ysis separated the sTCC samples according to the presence or absence of CIS. However, the presence or absence of CIS was strongly confounded with processing date, as reported in REF. 10. In high-throughput studies it is typical for global properties of the raw data distribution to vary strongly across arrays, as they do for this data set (FIG. 1a). After normalization, these global differences are greatly reduced (FIG. 1b), and the sTCC study properly normalized the data. However, it is typical to observe substantial batch effects on subsets of specific genes that are not addressed by normalization (FIG. 1c). In gene expression studies, the greatest source of dif-ferential expression is nearly always across batches rather than across biological groups, which can lead to confusing or incorrect biological conclusions owing to the influ-ence of technical artefacts. For example, the control samples in the sTCC study clustered perfectly by the processing date (FIG. 1d), and the processing date was confounded with the presence/absence status.
)TQWR�CPF�FCVG�CTG�QPN[�UWTTQICVGUThe processing of samples using protocols that differ among laboratories has been linked to batch effects. In such cases, the samples that have been processed using the same protocol are known as processing groups. For example, multiple laboratory comparisons of microarray experiments have shown strong laboratory-specific effects11. In addition, in nearly every gene expression study, large variations are associated with the processing date12, and in microarray studies focusing on copy number variation, large effects are associ-ated with DNA preparation groups13. The processing group and date are therefore commonly used to account for batch effects. However, in a typical experiment these are probably only surrogates for other sources of variation, such as ozone levels, labora-tory temperatures and reagent quality12,14. Unfortunately, many possible sources of batch effects are not recorded, and data analysts are left with just processing group and date as surrogates.
One way to quantify the affect of non-biological variables is to examine the principal components of the data. Principal components are estimates of the most com-mon patterns that exist across features. For example, if most genes in a microarray study
Nature Reviews | Genetics
Expr
essio
n
Sample
Batch 1 Batch 2
Expr
essio
n
14
16
a
c
b
d
12
10
8
6
4Ex
pres
sion
2
12
14
10
8
6
4
Sample Sample
Nor
mal
Nor
mal
Nor
mal
Nor
mal
Nor
mal
Nor
mal
Nor
malNor
mal
10
9
8
7
6
5
4
(KIWTG���̂ �&GOQPUVTCVKQP�QH�PQTOCNK\CVKQP�CPF�UWTXKXKPI�DCVEJ�GHHGEVU��(QT�C�RWDNKUJGF�DNCFFGT�ECPEGT�OKETQCTTC[�FCVC�UGV�QDVCKPGF�WUKPI�CP�#HH[OGVTKZ�RNCVHQTO���YG�QDVCKPGF�VJG�TCY�FCVC�HQT�QPN[�VJG�PQTOCN�UCORNGU��*GTG��ITGGP�CPF�QTCPIG�TGRTGUGPV�VYQ�FKHHGTGPV�RTQEGUUKPI�FCVGU��C�^�$QZ�RNQV�QH�TCY�IGPG�GZRTGUUKQP�FCVC�NQI�DCUG�����D�̂ �$QZ�RNQV�QH�FCVC�RTQEGUUGF�YKVJ�4/#��C�YKFGN[�WUGF�RTGRTQE�GUUKPI�CNIQTKVJO�HQT�#HH[OGVTKZ�FCVC����4/#�CRRNKGU�SWCPVKNG�PQTOCNK\CVKQP�t�C�VGEJPKSWG�VJCV�HQTEGU�VJG�FKUVTKDWVKQP�QH�VJG�TCY�UKIPCN�KPVGPUKVKGU�HTQO�VJG�OKETQCTTC[�FCVC�VQ�DG�VJG�UCOG�KP�CNN�UCORNGU����E�^�'ZCORNG�QH�VGP�IGPGU�VJCV�CTG�UWUEGRVKDNG�VQ�DCVEJ�GHHGEVU�GXGP�CHVGT�PQTOCNK\CVKQP��*WPFTGFU�QH�IGPGU�UJQY�UKOKNCT�DGJCXKQWT�DWV��HQT�ENCTKV[��CTG�PQV�UJQYP��F�̂ �%NWUVGTKPI�QH�UCORNGU�CHVGT�PQTOCNK\CVKQP��0QVG�VJCV�VJG�UCORNGU�RGTHGEVN[�ENWUVGT�D[�RTQEGUUKPI�FCVG�
2'452'%6 +8'5
����| OCTOBER 2010 | VOLUME 11 �YYY�PCVWTG�EQO�TGXKGYU�IGPGVKEU
© 20 Macmillan Publishers Limited. All rights reserved10
8 normal samplescolor: processing date
Batch effects
Normalization is a data analysis technique that adjusts global properties of measure-ments for individual samples so that they can be more appropriately compared. Including a normalization step is now standard in data analysis of gene expression experiments6. But normalization does not remove batch effects, which affect specific subsets of genes and may affect different genes in different ways. In some cases, these normalization procedures may even exacerbate technical artefacts in high-throughput measurements, as batch and other technical effects violate the assumptions of normalization methods. Although specific normalization methods have been developed for microarray studies that take into account study design7 or otherwise correct for the batch problem8, they are still not widely used.
As described here, in surveying a large body of published data involving high-throughput studies and a number of technol-ogy platforms, we have found evidence of batch effects. In many cases we have found that these can lead to erroneous biological conclusions, supporting the conclusions of previous publications4,5. Here we analyse the extent of the problem, review the critical downstream consequences of batch effects and describe experimental and computa-tional solutions to reduce their impact on high-throughput data.
#P�KNNWUVTCVKQP�QH�DCVEJ�GHHGEVUTo introduce the batch effect problem we used data from a previously published bladder cancer study9 (FIG. 1). In this study,
microarray expression profiling was used to examine the gene expression patterns in superficial transitional cell carcinoma (sTCC) with and without surrounding carci-noma in situ (CIS). Hierarchical cluster anal-ysis separated the sTCC samples according to the presence or absence of CIS. However, the presence or absence of CIS was strongly confounded with processing date, as reported in REF. 10. In high-throughput studies it is typical for global properties of the raw data distribution to vary strongly across arrays, as they do for this data set (FIG. 1a). After normalization, these global differences are greatly reduced (FIG. 1b), and the sTCC study properly normalized the data. However, it is typical to observe substantial batch effects on subsets of specific genes that are not addressed by normalization (FIG. 1c). In gene expression studies, the greatest source of dif-ferential expression is nearly always across batches rather than across biological groups, which can lead to confusing or incorrect biological conclusions owing to the influ-ence of technical artefacts. For example, the control samples in the sTCC study clustered perfectly by the processing date (FIG. 1d), and the processing date was confounded with the presence/absence status.
)TQWR�CPF�FCVG�CTG�QPN[�UWTTQICVGUThe processing of samples using protocols that differ among laboratories has been linked to batch effects. In such cases, the samples that have been processed using the same protocol are known as processing groups. For example, multiple laboratory comparisons of microarray experiments have shown strong laboratory-specific effects11. In addition, in nearly every gene expression study, large variations are associated with the processing date12, and in microarray studies focusing on copy number variation, large effects are associ-ated with DNA preparation groups13. The processing group and date are therefore commonly used to account for batch effects. However, in a typical experiment these are probably only surrogates for other sources of variation, such as ozone levels, labora-tory temperatures and reagent quality12,14. Unfortunately, many possible sources of batch effects are not recorded, and data analysts are left with just processing group and date as surrogates.
One way to quantify the affect of non-biological variables is to examine the principal components of the data. Principal components are estimates of the most com-mon patterns that exist across features. For example, if most genes in a microarray study
Nature Reviews | Genetics
Expr
essio
n
Sample
Batch 1 Batch 2
Expr
essio
n
14
16
a
c
b
d
12
10
8
6
4
Expr
essio
n
2
12
14
10
8
6
4
Sample Sample
Nor
mal
Nor
mal
Nor
mal
Nor
mal
Nor
mal
Nor
mal
Nor
malNor
mal
10
9
8
7
6
5
4
(KIWTG���̂ �&GOQPUVTCVKQP�QH�PQTOCNK\CVKQP�CPF�UWTXKXKPI�DCVEJ�GHHGEVU��(QT�C�RWDNKUJGF�DNCFFGT�ECPEGT�OKETQCTTC[�FCVC�UGV�QDVCKPGF�WUKPI�CP�#HH[OGVTKZ�RNCVHQTO���YG�QDVCKPGF�VJG�TCY�FCVC�HQT�QPN[�VJG�PQTOCN�UCORNGU��*GTG��ITGGP�CPF�QTCPIG�TGRTGUGPV�VYQ�FKHHGTGPV�RTQEGUUKPI�FCVGU��C�^�$QZ�RNQV�QH�TCY�IGPG�GZRTGUUKQP�FCVC�NQI�DCUG�����D�̂ �$QZ�RNQV�QH�FCVC�RTQEGUUGF�YKVJ�4/#��C�YKFGN[�WUGF�RTGRTQE�GUUKPI�CNIQTKVJO�HQT�#HH[OGVTKZ�FCVC����4/#�CRRNKGU�SWCPVKNG�PQTOCNK\CVKQP�t�C�VGEJPKSWG�VJCV�HQTEGU�VJG�FKUVTKDWVKQP�QH�VJG�TCY�UKIPCN�KPVGPUKVKGU�HTQO�VJG�OKETQCTTC[�FCVC�VQ�DG�VJG�UCOG�KP�CNN�UCORNGU����E�^�'ZCORNG�QH�VGP�IGPGU�VJCV�CTG�UWUEGRVKDNG�VQ�DCVEJ�GHHGEVU�GXGP�CHVGT�PQTOCNK\CVKQP��*WPFTGFU�QH�IGPGU�UJQY�UKOKNCT�DGJCXKQWT�DWV��HQT�ENCTKV[��CTG�PQV�UJQYP��F�̂ �%NWUVGTKPI�QH�UCORNGU�CHVGT�PQTOCNK\CVKQP��0QVG�VJCV�VJG�UCORNGU�RGTHGEVN[�ENWUVGT�D[�RTQEGUUKPI�FCVG�
2'452'%6 +8'5
����| OCTOBER 2010 | VOLUME 11 �YYY�PCVWTG�EQO�TGXKGYU�IGPGVKEU
© 20 Macmillan Publishers Limited. All rights reserved10
One geneClustering
Batch effectsBatch effects: unwanted variation remaining after normalization.
Often associated with processing date or processing batch.
Often thought to be technical, but can be biological (later).
Two extreme situations:(1) Batch effect is orthogonal to comparison of interest. This will increase noise and dilute signal, but otherwise ok.(2) Batch effect is confounded with comparison of interest. This will make any conclusion highly circumspect (and likely wrong).
Sources of Heterogeneity
The Effect of Heterogeneity Color = Environment
(Idaghdour et al. 2008)
Color = Processing Year
(Cheung et al. 2008)
Color = Allele
(Brem et al. 2005)
ASimpleSimulatedExample
IndependentE DependentE
Genes
Genes
Arrays Arrays
GenebyGeneModel
expression=b0+b1×group+noise
Testwhetherb1=0<=>T‐testforgeneI
CalculateaP‐value
NullP‐ValueDistribu$ons
IndependentE
DependentE
Frequency
Frequency
Frequency
Frequency
Frequency
Frequency
Frequency
Frequency
P‐value P‐value P‐value P‐value
P‐value P‐value P‐value P‐value
NullP‐ValueDistribu$ons
|ρ| = 0.40 |ρ| = 0.31 |ρ| = 0.10 |ρ| = 0.00 Correla$on
IndependentE
DependentE
Frequency
Frequency
Frequency
Frequency
Frequency
Frequency
Frequency
Frequency
P‐value P‐value P‐value P‐value
P‐value P‐value P‐value P‐value
FalseDiscoveryRateEs$mates
IndependentE DependentE
RankingEs$mates
IndependentE DependentE
PrincipalComponentsAnalysis/
SingularValueDecomposi$on
• Amethodtoiden$fypaternsinthedatathat
explainalargepercentageofthevaria$on
• PCAandSVDhavedifferentmathema$cal
goalsbutendupes$ma$ngthesamething
• FirstproposedforgenomicsbyAlteretal.
(2000)PNAS
SingularValueDecomposi$on
samples
genes = × ×
U D VT
Dataeigenarrays/
leusingularvectors/
loadings
singular
values
eigengenes/
rightsingularvectors/
principalcomponents
Proper$esofSVDsamples
genes = × ×
U D VT
ColumnsofVTdescribepaternsacrossgenes
ColumnsofUdescribepaternsacrossarrays
isthepercentofvaria$onexplainedbytheithcolumnofV
€
di
2/ d
i
2
i=1
n
∑
ColumnsofVT/rowsofUareorthogonalandcalculatedoneata$me
1Patern1stSV
= × +
1stColumnofU 1stColumnofVT
2Paterns,1stSV
= × +
1stColumnofU 1stColumnofVT
Surrogate Variable Analysis TheData
TrueBatchEs$mateofBatch
Pr(!Group&Batch)
Surrogate Variable Analysis TheData
TrueBatchEs$mateofBatch
Pr(!Group&Batch)
Surrogate Variable Analysis TheData
TrueBatchEs$mateofBatch
Pr(!Group&Batch)
Surrogate Variable Analysis TheData
TrueBatchEs$mateofBatch
Pr(!Group&Batch)
Surrogate Variable Analysis TheData
TrueBatchEs$mateofBatch
Pr(!Group&Batch)
Surrogate Variable Analysis TheData
TrueBatchEs$mateofBatch
Pr(!Group&Batch)
SVAAdjustedGenebyGeneModel
expression=b0+b1×group+surrogates+noise
Testwhetherb1=0
CalculateaP‐value
Ranking Estimates
IndependentE DependentEDependentE+IRW‐SVA
RankingbyTrueSignaltoNoise RankingbyTrueSignaltoNoise RankingbyTrueSignaltoNoise
AverageRankingbyT‐Sta$s$c
AverageRankingbyT‐Sta$s$c
AverageRankingbyT‐Sta$s$c
Methods(IRW-) SVA by Leek and Storey.
RUV by Gagnon-Bartsch and Speed
Many methods are small modifications of SVA.
The idea behind RUV is to use control (genes / probes) to estimate batch effects. Controls could be probes unaffected by biology (negative controls) or genes known to be differentially expressed.
An example of “biological” batch effect Changes in cell type composition.
Expression changes associated with Age or Cell cycle.