unwanted variation - biostatisticskhansen/teaching/2014/140.668/batch.pdf · commonly used to...

Unwanted variationWe have seen that microarray expression data suffers from “unwanted variation”.

We have discussed a number of preprocessing algorithms intended to increase the signal-to-noise in such data.

But sometimes (unwanted) variation remains.

Batch effects

Normalization is a data analysis technique that adjusts global properties of measure-ments for individual samples so that they can be more appropriately compared. Including a normalization step is now standard in data analysis of gene expression experiments6. But normalization does not remove batch effects, which affect specific subsets of genes and may affect different genes in different ways. In some cases, these normalization procedures may even exacerbate technical artefacts in high-throughput measurements, as batch and other technical effects violate the assumptions of normalization methods. Although specific normalization methods have been developed for microarray studies that take into account study design7 or otherwise correct for the batch problem8, they are still not widely used.

As described here, in surveying a large body of published data involving high-throughput studies and a number of technol-ogy platforms, we have found evidence of batch effects. In many cases we have found that these can lead to erroneous biological conclusions, supporting the conclusions of previous publications4,5. Here we analyse the extent of the problem, review the critical downstream consequences of batch effects and describe experimental and computa-tional solutions to reduce their impact on high-throughput data.

#P�KNNWUVTCVKQP�QH�DCVEJ�GHHGEVUTo introduce the batch effect problem we used data from a previously published bladder cancer study9 (FIG. 1). In this study,

microarray expression profiling was used to examine the gene expression patterns in superficial transitional cell carcinoma (sTCC) with and without surrounding carci-noma in situ (CIS). Hierarchical cluster anal-ysis separated the sTCC samples according to the presence or absence of CIS. However, the presence or absence of CIS was strongly confounded with processing date, as reported in REF. 10. In high-throughput studies it is typical for global properties of the raw data distribution to vary strongly across arrays, as they do for this data set (FIG. 1a). After normalization, these global differences are greatly reduced (FIG. 1b), and the sTCC study properly normalized the data. However, it is typical to observe substantial batch effects on subsets of specific genes that are not addressed by normalization (FIG. 1c). In gene expression studies, the greatest source of dif-ferential expression is nearly always across batches rather than across biological groups, which can lead to confusing or incorrect biological conclusions owing to the influ-ence of technical artefacts. For example, the control samples in the sTCC study clustered perfectly by the processing date (FIG. 1d), and the processing date was confounded with the presence/absence status.

)TQWR�CPF�FCVG�CTG�QPN[�UWTTQICVGUThe processing of samples using protocols that differ among laboratories has been linked to batch effects. In such cases, the samples that have been processed using the same protocol are known as processing groups. For example, multiple laboratory comparisons of microarray experiments have shown strong laboratory-specific effects11. In addition, in nearly every gene expression study, large variations are associated with the processing date12, and in microarray studies focusing on copy number variation, large effects are associ-ated with DNA preparation groups13. The processing group and date are therefore commonly used to account for batch effects. However, in a typical experiment these are probably only surrogates for other sources of variation, such as ozone levels, labora-tory temperatures and reagent quality12,14. Unfortunately, many possible sources of batch effects are not recorded, and data analysts are left with just processing group and date as surrogates.

One way to quantify the affect of non-biological variables is to examine the principal components of the data. Principal components are estimates of the most com-mon patterns that exist across features. For example, if most genes in a microarray study

Nature Reviews | Genetics

Expr

essio

n

Sample

Batch 1 Batch 2

Expr

essio

n

14

16

a

c

b

d

12

10

8

6

4Ex

pres

sion

2

12

14

10

8

6

4

Sample Sample

Nor

mal

Nor

mal

Nor

mal

Nor

mal

Nor

mal

Nor

mal

Nor

malNor

mal

10

9

8

7

6

5

4

(KIWTG��̂ �&GOQPUVTCVKQP�QH�PQTOCNK\CVKQP�CPF�UWTXKXKPI�DCVEJ�GHHGEVU��(QT�C�RWDNKUJGF�DNCFFGT�ECPEGT�OKETQCTTC[�FCVC�UGV�QDVCKPGF�WUKPI�CP�#HH[OGVTKZ�RNCVHQTO��YG�QDVCKPGF�VJG�TCY�FCVC�HQT�QPN[�VJG�PQTOCN�UCORNGU��*GTG��ITGGP�CPF�QTCPIG�TGRTGUGPV�VYQ�FKHHGTGPV�RTQEGUUKPI�FCVGU��C�^�$QZ�RNQV�QH�TCY�IGPG�GZRTGUUKQP�FCVC�NQI�DCUG��D�̂ �$QZ�RNQV�QH�FCVC�RTQEGUUGF�YKVJ�4/#��C�YKFGN[�WUGF�RTGRTQE�GUUKPI�CNIQTKVJO�HQT�#HH[OGVTKZ�FCVC��4/#�CRRNKGU�SWCPVKNG�PQTOCNK\CVKQP�t�C�VGEJPKSWG�VJCV�HQTEGU�VJG�FKUVTKDWVKQP�QH�VJG�TCY�UKIPCN�KPVGPUKVKGU�HTQO�VJG�OKETQCTTC[�FCVC�VQ�DG�VJG�UCOG�KP�CNN�UCORNGU��E�^�'ZCORNG�QH�VGP�IGPGU�VJCV�CTG�UWUEGRVKDNG�VQ�DCVEJ�GHHGEVU�GXGP�CHVGT�PQTOCNK\CVKQP��*WPFTGFU�QH�IGPGU�UJQY�UKOKNCT�DGJCXKQWT�DWV��HQT�ENCTKV[��CTG�PQV�UJQYP��F�̂ �%NWUVGTKPI�QH�UCORNGU�CHVGT�PQTOCNK\CVKQP��0QVG�VJCV�VJG�UCORNGU�RGTHGEVN[�ENWUVGT�D[�RTQEGUUKPI�FCVG�

2'452'%6 +8'5

��| OCTOBER 2010 | VOLUME 11 �YYY�PCVWTG�EQO�TGXKGYU�IGPGVKEU

© 20 Macmillan Publishers Limited. All rights reserved10

8 normal samplescolor: processing date

Batch effects

Normalization is a data analysis technique that adjusts global properties of measure-ments for individual samples so that they can be more appropriately compared. Including a normalization step is now standard in data analysis of gene expression experiments6. But normalization does not remove batch effects, which affect specific subsets of genes and may affect different genes in different ways. In some cases, these normalization procedures may even exacerbate technical artefacts in high-throughput measurements, as batch and other technical effects violate the assumptions of normalization methods. Although specific normalization methods have been developed for microarray studies that take into account study design7 or otherwise correct for the batch problem8, they are still not widely used.

As described here, in surveying a large body of published data involving high-throughput studies and a number of technol-ogy platforms, we have found evidence of batch effects. In many cases we have found that these can lead to erroneous biological conclusions, supporting the conclusions of previous publications4,5. Here we analyse the extent of the problem, review the critical downstream consequences of batch effects and describe experimental and computa-tional solutions to reduce their impact on high-throughput data.

#P�KNNWUVTCVKQP�QH�DCVEJ�GHHGEVUTo introduce the batch effect problem we used data from a previously published bladder cancer study9 (FIG. 1). In this study,

microarray expression profiling was used to examine the gene expression patterns in superficial transitional cell carcinoma (sTCC) with and without surrounding carci-noma in situ (CIS). Hierarchical cluster anal-ysis separated the sTCC samples according to the presence or absence of CIS. However, the presence or absence of CIS was strongly confounded with processing date, as reported in REF. 10. In high-throughput studies it is typical for global properties of the raw data distribution to vary strongly across arrays, as they do for this data set (FIG. 1a). After normalization, these global differences are greatly reduced (FIG. 1b), and the sTCC study properly normalized the data. However, it is typical to observe substantial batch effects on subsets of specific genes that are not addressed by normalization (FIG. 1c). In gene expression studies, the greatest source of dif-ferential expression is nearly always across batches rather than across biological groups, which can lead to confusing or incorrect biological conclusions owing to the influ-ence of technical artefacts. For example, the control samples in the sTCC study clustered perfectly by the processing date (FIG. 1d), and the processing date was confounded with the presence/absence status.

)TQWR�CPF�FCVG�CTG�QPN[�UWTTQICVGUThe processing of samples using protocols that differ among laboratories has been linked to batch effects. In such cases, the samples that have been processed using the same protocol are known as processing groups. For example, multiple laboratory comparisons of microarray experiments have shown strong laboratory-specific effects11. In addition, in nearly every gene expression study, large variations are associated with the processing date12, and in microarray studies focusing on copy number variation, large effects are associ-ated with DNA preparation groups13. The processing group and date are therefore commonly used to account for batch effects. However, in a typical experiment these are probably only surrogates for other sources of variation, such as ozone levels, labora-tory temperatures and reagent quality12,14. Unfortunately, many possible sources of batch effects are not recorded, and data analysts are left with just processing group and date as surrogates.

One way to quantify the affect of non-biological variables is to examine the principal components of the data. Principal components are estimates of the most com-mon patterns that exist across features. For example, if most genes in a microarray study

Nature Reviews | Genetics

Expr

essio

n

Sample

Batch 1 Batch 2

Expr

essio

n

14

16

a

c

b

d

12

10

8

6

4

Expr

essio

n

2

12

14

10

8

6

4

Sample Sample

Nor

mal

Nor

mal

Nor

mal

Nor

mal

Nor

mal

Nor

mal

Nor

malNor

mal

10

9

8

7

6

5

4

(KIWTG��̂ �&GOQPUVTCVKQP�QH�PQTOCNK\CVKQP�CPF�UWTXKXKPI�DCVEJ�GHHGEVU��(QT�C�RWDNKUJGF�DNCFFGT�ECPEGT�OKETQCTTC[�FCVC�UGV�QDVCKPGF�WUKPI�CP�#HH[OGVTKZ�RNCVHQTO��YG�QDVCKPGF�VJG�TCY�FCVC�HQT�QPN[�VJG�PQTOCN�UCORNGU��*GTG��ITGGP�CPF�QTCPIG�TGRTGUGPV�VYQ�FKHHGTGPV�RTQEGUUKPI�FCVGU��C�^�$QZ�RNQV�QH�TCY�IGPG�GZRTGUUKQP�FCVC�NQI�DCUG��D�̂ �$QZ�RNQV�QH�FCVC�RTQEGUUGF�YKVJ�4/#��C�YKFGN[�WUGF�RTGRTQE�GUUKPI�CNIQTKVJO�HQT�#HH[OGVTKZ�FCVC��4/#�CRRNKGU�SWCPVKNG�PQTOCNK\CVKQP�t�C�VGEJPKSWG�VJCV�HQTEGU�VJG�FKUVTKDWVKQP�QH�VJG�TCY�UKIPCN�KPVGPUKVKGU�HTQO�VJG�OKETQCTTC[�FCVC�VQ�DG�VJG�UCOG�KP�CNN�UCORNGU��E�^�'ZCORNG�QH�VGP�IGPGU�VJCV�CTG�UWUEGRVKDNG�VQ�DCVEJ�GHHGEVU�GXGP�CHVGT�PQTOCNK\CVKQP��*WPFTGFU�QH�IGPGU�UJQY�UKOKNCT�DGJCXKQWT�DWV��HQT�ENCTKV[��CTG�PQV�UJQYP��F�̂ �%NWUVGTKPI�QH�UCORNGU�CHVGT�PQTOCNK\CVKQP��0QVG�VJCV�VJG�UCORNGU�RGTHGEVN[�ENWUVGT�D[�RTQEGUUKPI�FCVG�

2'452'%6 +8'5

��| OCTOBER 2010 | VOLUME 11 �YYY�PCVWTG�EQO�TGXKGYU�IGPGVKEU

© 20 Macmillan Publishers Limited. All rights reserved10

One geneClustering

Batch effectsBatch effects: unwanted variation remaining after normalization.

Often associated with processing date or processing batch.

Often thought to be technical, but can be biological (later).

Two extreme situations:(1) Batch effect is orthogonal to comparison of interest. This will increase noise and dilute signal, but otherwise ok.(2) Batch effect is confounded with comparison of interest. This will make any conclusion highly circumspect (and likely wrong).

Sources of Heterogeneity

The Effect of Heterogeneity Color = Environment

(Idaghdour et al. 2008)

Color = Processing Year

(Cheung et al. 2008)

Color = Allele

(Brem et al. 2005)

ASimpleSimulatedExample

IndependentE DependentE

Genes

Genes

Arrays Arrays

GenebyGeneModel

expression=b0+b1×group+noise

Testwhetherb1=0<=>T‐testforgeneI

CalculateaP‐value

NullP‐ValueDistribu$ons

IndependentE

DependentE

Frequency

Frequency

Frequency

Frequency

Frequency

Frequency

Frequency

Frequency

P‐value P‐value P‐value P‐value


NullP‐ValueDistribu$ons

|ρ| = 0.40 |ρ| = 0.31 |ρ| = 0.10 |ρ| = 0.00 Correla$on

IndependentE

DependentE

Frequency

Frequency

Frequency

Frequency

Frequency

Frequency

Frequency

Frequency



FalseDiscoveryRateEs$mates


RankingEs$mates


PrincipalComponentsAnalysis/

SingularValueDecomposi$on

•  Amethodtoiden$fypaternsinthedatathat

explainalargepercentageofthevaria$on

•  PCAandSVDhavedifferentmathema$cal

goalsbutendupes$ma$ngthesamething

•  FirstproposedforgenomicsbyAlteretal.

(2000)PNAS

SingularValueDecomposi$on

samples

genes = × ×

U D VT

Dataeigenarrays/

leusingularvectors/

loadings

singular

values

eigengenes/

rightsingularvectors/

principalcomponents

Proper$esofSVDsamples

genes = × ×

U D VT

ColumnsofVTdescribepaternsacrossgenes

ColumnsofUdescribepaternsacrossarrays

isthepercentofvaria$onexplainedbytheithcolumnofV

€

di

2/ d

i

2

i=1

n

∑

ColumnsofVT/rowsofUareorthogonalandcalculatedoneata$me

1Patern1stSV

= × +

1stColumnofU 1stColumnofVT

2Paterns,1stSV

= × +

1stColumnofU 1stColumnofVT

Surrogate Variable Analysis TheData

TrueBatchEs$mateofBatch

Pr(!Group&Batch)

SVAAdjustedGenebyGeneModel

expression=b0+b1×group+surrogates+noise

Testwhetherb1=0

CalculateaP‐value

Ranking Estimates

IndependentE DependentEDependentE+IRW‐SVA

RankingbyTrueSignaltoNoise RankingbyTrueSignaltoNoise RankingbyTrueSignaltoNoise

AverageRankingbyT‐Sta$s$c



Methods(IRW-) SVA by Leek and Storey.

RUV by Gagnon-Bartsch and Speed

Many methods are small modifications of SVA.

The idea behind RUV is to use control (genes / probes) to estimate batch effects. Controls could be probes unaffected by biology (negative controls) or genes known to be differentially expressed.

An example of “biological” batch effect Changes in cell type composition.

Expression changes associated with Age or Cell cycle.

unwanted variation - biostatisticskhansen/teaching/2014/140.668/batch.pdf · commonly used to...

Documents