unwanted variation - biostatisticskhansen/teaching/2014/140.668/batch.pdf · commonly used to...

27
Unwanted variation We have seen that microarray expression data suers from “unwanted variation”. We have discussed a number of preprocessing algorithms intended to increase the signal-to-noise in such data. But sometimes (unwanted) variation remains.

Upload: others

Post on 08-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Unwanted variation - Biostatisticskhansen/teaching/2014/140.668/batch.pdf · commonly used to account for batch effects. However, in a typical experiment these are probably only surrogates

Unwanted variationWe have seen that microarray expression data suffers from “unwanted variation”.

We have discussed a number of preprocessing algorithms intended to increase the signal-to-noise in such data.

But sometimes (unwanted) variation remains.

Page 2: Unwanted variation - Biostatisticskhansen/teaching/2014/140.668/batch.pdf · commonly used to account for batch effects. However, in a typical experiment these are probably only surrogates

Batch effects

Normalization is a data analysis technique that adjusts global properties of measure-ments for individual samples so that they can be more appropriately compared. Including a normalization step is now standard in data analysis of gene expression experiments6. But normalization does not remove batch effects, which affect specific subsets of genes and may affect different genes in different ways. In some cases, these normalization procedures may even exacerbate technical artefacts in high-throughput measurements, as batch and other technical effects violate the assumptions of normalization methods. Although specific normalization methods have been developed for microarray studies that take into account study design7 or otherwise correct for the batch problem8, they are still not widely used.

As described here, in surveying a large body of published data involving high-throughput studies and a number of technol-ogy platforms, we have found evidence of batch effects. In many cases we have found that these can lead to erroneous biological conclusions, supporting the conclusions of previous publications4,5. Here we analyse the extent of the problem, review the critical downstream consequences of batch effects and describe experimental and computa-tional solutions to reduce their impact on high-throughput data.

#P�KNNWUVTCVKQP�QH�DCVEJ�GHHGEVUTo introduce the batch effect problem we used data from a previously published bladder cancer study9 (FIG. 1). In this study,

microarray expression profiling was used to examine the gene expression patterns in superficial transitional cell carcinoma (sTCC) with and without surrounding carci-noma in situ (CIS). Hierarchical cluster anal-ysis separated the sTCC samples according to the presence or absence of CIS. However, the presence or absence of CIS was strongly confounded with processing date, as reported in REF. 10. In high-throughput studies it is typical for global properties of the raw data distribution to vary strongly across arrays, as they do for this data set (FIG. 1a). After normalization, these global differences are greatly reduced (FIG. 1b), and the sTCC study properly normalized the data. However, it is typical to observe substantial batch effects on subsets of specific genes that are not addressed by normalization (FIG. 1c). In gene expression studies, the greatest source of dif-ferential expression is nearly always across batches rather than across biological groups, which can lead to confusing or incorrect biological conclusions owing to the influ-ence of technical artefacts. For example, the control samples in the sTCC study clustered perfectly by the processing date (FIG. 1d), and the processing date was confounded with the presence/absence status.

)TQWR�CPF�FCVG�CTG�QPN[�UWTTQICVGUThe processing of samples using protocols that differ among laboratories has been linked to batch effects. In such cases, the samples that have been processed using the same protocol are known as processing groups. For example, multiple laboratory comparisons of microarray experiments have shown strong laboratory-specific effects11. In addition, in nearly every gene expression study, large variations are associated with the processing date12, and in microarray studies focusing on copy number variation, large effects are associ-ated with DNA preparation groups13. The processing group and date are therefore commonly used to account for batch effects. However, in a typical experiment these are probably only surrogates for other sources of variation, such as ozone levels, labora-tory temperatures and reagent quality12,14. Unfortunately, many possible sources of batch effects are not recorded, and data analysts are left with just processing group and date as surrogates.

One way to quantify the affect of non-biological variables is to examine the principal components of the data. Principal components are estimates of the most com-mon patterns that exist across features. For example, if most genes in a microarray study

Nature Reviews | Genetics

Expr

essio

n

Sample

Batch 1 Batch 2

Expr

essio

n

14

16

a

c

b

d

12

10

8

6

4Ex

pres

sion

2

12

14

10

8

6

4

Sample Sample

Nor

mal

Nor

mal

Nor

mal

Nor

mal

Nor

mal

Nor

mal

Nor

malNor

mal

10

9

8

7

6

5

4

(KIWTG���̂ �&GOQPUVTCVKQP�QH�PQTOCNK\CVKQP�CPF�UWTXKXKPI�DCVEJ�GHHGEVU��(QT�C�RWDNKUJGF�DNCFFGT�ECPEGT�OKETQCTTC[�FCVC�UGV�QDVCKPGF�WUKPI�CP�#HH[OGVTKZ�RNCVHQTO���YG�QDVCKPGF�VJG�TCY�FCVC�HQT�QPN[�VJG�PQTOCN�UCORNGU��*GTG��ITGGP�CPF�QTCPIG�TGRTGUGPV�VYQ�FKHHGTGPV�RTQEGUUKPI�FCVGU��C�^�$QZ�RNQV�QH�TCY�IGPG�GZRTGUUKQP�FCVC�NQI�DCUG�����D�̂ �$QZ�RNQV�QH�FCVC�RTQEGUUGF�YKVJ�4/#��C�YKFGN[�WUGF�RTGRTQE�GUUKPI�CNIQTKVJO�HQT�#HH[OGVTKZ�FCVC����4/#�CRRNKGU�SWCPVKNG�PQTOCNK\CVKQP�t�C�VGEJPKSWG�VJCV�HQTEGU�VJG�FKUVTKDWVKQP�QH�VJG�TCY�UKIPCN�KPVGPUKVKGU�HTQO�VJG�OKETQCTTC[�FCVC�VQ�DG�VJG�UCOG�KP�CNN�UCORNGU����E�^�'ZCORNG�QH�VGP�IGPGU�VJCV�CTG�UWUEGRVKDNG�VQ�DCVEJ�GHHGEVU�GXGP�CHVGT�PQTOCNK\CVKQP��*WPFTGFU�QH�IGPGU�UJQY�UKOKNCT�DGJCXKQWT�DWV��HQT�ENCTKV[��CTG�PQV�UJQYP��F�̂ �%NWUVGTKPI�QH�UCORNGU�CHVGT�PQTOCNK\CVKQP��0QVG�VJCV�VJG�UCORNGU�RGTHGEVN[�ENWUVGT�D[�RTQEGUUKPI�FCVG�

2'452'%6 +8'5

����| OCTOBER 2010 | VOLUME 11 �YYY�PCVWTG�EQO�TGXKGYU�IGPGVKEU

© 20 Macmillan Publishers Limited. All rights reserved10

8 normal samplescolor: processing date

Page 3: Unwanted variation - Biostatisticskhansen/teaching/2014/140.668/batch.pdf · commonly used to account for batch effects. However, in a typical experiment these are probably only surrogates

Batch effects

Normalization is a data analysis technique that adjusts global properties of measure-ments for individual samples so that they can be more appropriately compared. Including a normalization step is now standard in data analysis of gene expression experiments6. But normalization does not remove batch effects, which affect specific subsets of genes and may affect different genes in different ways. In some cases, these normalization procedures may even exacerbate technical artefacts in high-throughput measurements, as batch and other technical effects violate the assumptions of normalization methods. Although specific normalization methods have been developed for microarray studies that take into account study design7 or otherwise correct for the batch problem8, they are still not widely used.

As described here, in surveying a large body of published data involving high-throughput studies and a number of technol-ogy platforms, we have found evidence of batch effects. In many cases we have found that these can lead to erroneous biological conclusions, supporting the conclusions of previous publications4,5. Here we analyse the extent of the problem, review the critical downstream consequences of batch effects and describe experimental and computa-tional solutions to reduce their impact on high-throughput data.

#P�KNNWUVTCVKQP�QH�DCVEJ�GHHGEVUTo introduce the batch effect problem we used data from a previously published bladder cancer study9 (FIG. 1). In this study,

microarray expression profiling was used to examine the gene expression patterns in superficial transitional cell carcinoma (sTCC) with and without surrounding carci-noma in situ (CIS). Hierarchical cluster anal-ysis separated the sTCC samples according to the presence or absence of CIS. However, the presence or absence of CIS was strongly confounded with processing date, as reported in REF. 10. In high-throughput studies it is typical for global properties of the raw data distribution to vary strongly across arrays, as they do for this data set (FIG. 1a). After normalization, these global differences are greatly reduced (FIG. 1b), and the sTCC study properly normalized the data. However, it is typical to observe substantial batch effects on subsets of specific genes that are not addressed by normalization (FIG. 1c). In gene expression studies, the greatest source of dif-ferential expression is nearly always across batches rather than across biological groups, which can lead to confusing or incorrect biological conclusions owing to the influ-ence of technical artefacts. For example, the control samples in the sTCC study clustered perfectly by the processing date (FIG. 1d), and the processing date was confounded with the presence/absence status.

)TQWR�CPF�FCVG�CTG�QPN[�UWTTQICVGUThe processing of samples using protocols that differ among laboratories has been linked to batch effects. In such cases, the samples that have been processed using the same protocol are known as processing groups. For example, multiple laboratory comparisons of microarray experiments have shown strong laboratory-specific effects11. In addition, in nearly every gene expression study, large variations are associated with the processing date12, and in microarray studies focusing on copy number variation, large effects are associ-ated with DNA preparation groups13. The processing group and date are therefore commonly used to account for batch effects. However, in a typical experiment these are probably only surrogates for other sources of variation, such as ozone levels, labora-tory temperatures and reagent quality12,14. Unfortunately, many possible sources of batch effects are not recorded, and data analysts are left with just processing group and date as surrogates.

One way to quantify the affect of non-biological variables is to examine the principal components of the data. Principal components are estimates of the most com-mon patterns that exist across features. For example, if most genes in a microarray study

Nature Reviews | Genetics

Expr

essio

n

Sample

Batch 1 Batch 2

Expr

essio

n

14

16

a

c

b

d

12

10

8

6

4

Expr

essio

n

2

12

14

10

8

6

4

Sample Sample

Nor

mal

Nor

mal

Nor

mal

Nor

mal

Nor

mal

Nor

mal

Nor

malNor

mal

10

9

8

7

6

5

4

(KIWTG���̂ �&GOQPUVTCVKQP�QH�PQTOCNK\CVKQP�CPF�UWTXKXKPI�DCVEJ�GHHGEVU��(QT�C�RWDNKUJGF�DNCFFGT�ECPEGT�OKETQCTTC[�FCVC�UGV�QDVCKPGF�WUKPI�CP�#HH[OGVTKZ�RNCVHQTO���YG�QDVCKPGF�VJG�TCY�FCVC�HQT�QPN[�VJG�PQTOCN�UCORNGU��*GTG��ITGGP�CPF�QTCPIG�TGRTGUGPV�VYQ�FKHHGTGPV�RTQEGUUKPI�FCVGU��C�^�$QZ�RNQV�QH�TCY�IGPG�GZRTGUUKQP�FCVC�NQI�DCUG�����D�̂ �$QZ�RNQV�QH�FCVC�RTQEGUUGF�YKVJ�4/#��C�YKFGN[�WUGF�RTGRTQE�GUUKPI�CNIQTKVJO�HQT�#HH[OGVTKZ�FCVC����4/#�CRRNKGU�SWCPVKNG�PQTOCNK\CVKQP�t�C�VGEJPKSWG�VJCV�HQTEGU�VJG�FKUVTKDWVKQP�QH�VJG�TCY�UKIPCN�KPVGPUKVKGU�HTQO�VJG�OKETQCTTC[�FCVC�VQ�DG�VJG�UCOG�KP�CNN�UCORNGU����E�^�'ZCORNG�QH�VGP�IGPGU�VJCV�CTG�UWUEGRVKDNG�VQ�DCVEJ�GHHGEVU�GXGP�CHVGT�PQTOCNK\CVKQP��*WPFTGFU�QH�IGPGU�UJQY�UKOKNCT�DGJCXKQWT�DWV��HQT�ENCTKV[��CTG�PQV�UJQYP��F�̂ �%NWUVGTKPI�QH�UCORNGU�CHVGT�PQTOCNK\CVKQP��0QVG�VJCV�VJG�UCORNGU�RGTHGEVN[�ENWUVGT�D[�RTQEGUUKPI�FCVG�

2'452'%6 +8'5

����| OCTOBER 2010 | VOLUME 11 �YYY�PCVWTG�EQO�TGXKGYU�IGPGVKEU

© 20 Macmillan Publishers Limited. All rights reserved10

One geneClustering

Page 4: Unwanted variation - Biostatisticskhansen/teaching/2014/140.668/batch.pdf · commonly used to account for batch effects. However, in a typical experiment these are probably only surrogates

Batch effectsBatch effects: unwanted variation remaining after normalization.

Often associated with processing date or processing batch.

Often thought to be technical, but can be biological (later).

Two extreme situations:(1) Batch effect is orthogonal to comparison of interest. This will increase noise and dilute signal, but otherwise ok.(2) Batch effect is confounded with comparison of interest. This will make any conclusion highly circumspect (and likely wrong).

Page 5: Unwanted variation - Biostatisticskhansen/teaching/2014/140.668/batch.pdf · commonly used to account for batch effects. However, in a typical experiment these are probably only surrogates

Sources of Heterogeneity

Page 6: Unwanted variation - Biostatisticskhansen/teaching/2014/140.668/batch.pdf · commonly used to account for batch effects. However, in a typical experiment these are probably only surrogates

The Effect of Heterogeneity Color = Environment

(Idaghdour et al. 2008)

Color = Processing Year

(Cheung et al. 2008)

Color = Allele

(Brem et al. 2005)

Page 7: Unwanted variation - Biostatisticskhansen/teaching/2014/140.668/batch.pdf · commonly used to account for batch effects. However, in a typical experiment these are probably only surrogates

ASimpleSimulatedExample

IndependentE DependentE

Genes

Genes

Arrays Arrays

Page 8: Unwanted variation - Biostatisticskhansen/teaching/2014/140.668/batch.pdf · commonly used to account for batch effects. However, in a typical experiment these are probably only surrogates

GenebyGeneModel

expression=b0+b1×group+noise

Testwhetherb1=0<=>T‐testforgeneI

CalculateaP‐value

Page 9: Unwanted variation - Biostatisticskhansen/teaching/2014/140.668/batch.pdf · commonly used to account for batch effects. However, in a typical experiment these are probably only surrogates

NullP‐ValueDistribu$ons

IndependentE

DependentE

Frequency

Frequency

Frequency

Frequency

Frequency

Frequency

Frequency

Frequency

P‐value P‐value P‐value P‐value

P‐value P‐value P‐value P‐value

Page 10: Unwanted variation - Biostatisticskhansen/teaching/2014/140.668/batch.pdf · commonly used to account for batch effects. However, in a typical experiment these are probably only surrogates

NullP‐ValueDistribu$ons

|ρ| = 0.40 |ρ| = 0.31 |ρ| = 0.10 |ρ| = 0.00 Correla$on

IndependentE

DependentE

Frequency

Frequency

Frequency

Frequency

Frequency

Frequency

Frequency

Frequency

P‐value P‐value P‐value P‐value

P‐value P‐value P‐value P‐value

Page 11: Unwanted variation - Biostatisticskhansen/teaching/2014/140.668/batch.pdf · commonly used to account for batch effects. However, in a typical experiment these are probably only surrogates

FalseDiscoveryRateEs$mates

IndependentE DependentE

Page 12: Unwanted variation - Biostatisticskhansen/teaching/2014/140.668/batch.pdf · commonly used to account for batch effects. However, in a typical experiment these are probably only surrogates

RankingEs$mates

IndependentE DependentE

Page 13: Unwanted variation - Biostatisticskhansen/teaching/2014/140.668/batch.pdf · commonly used to account for batch effects. However, in a typical experiment these are probably only surrogates

PrincipalComponentsAnalysis/

SingularValueDecomposi$on

•  Amethodtoiden$fypaternsinthedatathat

explainalargepercentageofthevaria$on

•  PCAandSVDhavedifferentmathema$cal

goalsbutendupes$ma$ngthesamething

•  FirstproposedforgenomicsbyAlteretal.

(2000)PNAS

Page 14: Unwanted variation - Biostatisticskhansen/teaching/2014/140.668/batch.pdf · commonly used to account for batch effects. However, in a typical experiment these are probably only surrogates

SingularValueDecomposi$on

samples

genes = × ×

U D VT

Dataeigenarrays/

leusingularvectors/

loadings

singular

values

eigengenes/

rightsingularvectors/

principalcomponents

Page 15: Unwanted variation - Biostatisticskhansen/teaching/2014/140.668/batch.pdf · commonly used to account for batch effects. However, in a typical experiment these are probably only surrogates

Proper$esofSVDsamples

genes = × ×

U D VT

ColumnsofVTdescribepaternsacrossgenes

ColumnsofUdescribepaternsacrossarrays

isthepercentofvaria$onexplainedbytheithcolumnofV

di

2/ d

i

2

i=1

n

ColumnsofVT/rowsofUareorthogonalandcalculatedoneata$me

Page 16: Unwanted variation - Biostatisticskhansen/teaching/2014/140.668/batch.pdf · commonly used to account for batch effects. However, in a typical experiment these are probably only surrogates

1Patern1stSV

= × +

1stColumnofU 1stColumnofVT

Page 17: Unwanted variation - Biostatisticskhansen/teaching/2014/140.668/batch.pdf · commonly used to account for batch effects. However, in a typical experiment these are probably only surrogates

2Paterns,1stSV

= × +

1stColumnofU 1stColumnofVT

Page 18: Unwanted variation - Biostatisticskhansen/teaching/2014/140.668/batch.pdf · commonly used to account for batch effects. However, in a typical experiment these are probably only surrogates

Surrogate Variable Analysis TheData

TrueBatchEs$mateofBatch

Pr(!Group&Batch)

Page 19: Unwanted variation - Biostatisticskhansen/teaching/2014/140.668/batch.pdf · commonly used to account for batch effects. However, in a typical experiment these are probably only surrogates

Surrogate Variable Analysis TheData

TrueBatchEs$mateofBatch

Pr(!Group&Batch)

Page 20: Unwanted variation - Biostatisticskhansen/teaching/2014/140.668/batch.pdf · commonly used to account for batch effects. However, in a typical experiment these are probably only surrogates

Surrogate Variable Analysis TheData

TrueBatchEs$mateofBatch

Pr(!Group&Batch)

Page 21: Unwanted variation - Biostatisticskhansen/teaching/2014/140.668/batch.pdf · commonly used to account for batch effects. However, in a typical experiment these are probably only surrogates

Surrogate Variable Analysis TheData

TrueBatchEs$mateofBatch

Pr(!Group&Batch)

Page 22: Unwanted variation - Biostatisticskhansen/teaching/2014/140.668/batch.pdf · commonly used to account for batch effects. However, in a typical experiment these are probably only surrogates

Surrogate Variable Analysis TheData

TrueBatchEs$mateofBatch

Pr(!Group&Batch)

Page 23: Unwanted variation - Biostatisticskhansen/teaching/2014/140.668/batch.pdf · commonly used to account for batch effects. However, in a typical experiment these are probably only surrogates

Surrogate Variable Analysis TheData

TrueBatchEs$mateofBatch

Pr(!Group&Batch)

Page 24: Unwanted variation - Biostatisticskhansen/teaching/2014/140.668/batch.pdf · commonly used to account for batch effects. However, in a typical experiment these are probably only surrogates

SVAAdjustedGenebyGeneModel

expression=b0+b1×group+surrogates+noise

Testwhetherb1=0

CalculateaP‐value

Page 25: Unwanted variation - Biostatisticskhansen/teaching/2014/140.668/batch.pdf · commonly used to account for batch effects. However, in a typical experiment these are probably only surrogates

Ranking Estimates

IndependentE DependentEDependentE+IRW‐SVA

RankingbyTrueSignaltoNoise RankingbyTrueSignaltoNoise RankingbyTrueSignaltoNoise

AverageRankingbyT‐Sta$s$c

AverageRankingbyT‐Sta$s$c

AverageRankingbyT‐Sta$s$c

Page 26: Unwanted variation - Biostatisticskhansen/teaching/2014/140.668/batch.pdf · commonly used to account for batch effects. However, in a typical experiment these are probably only surrogates

Methods(IRW-) SVA by Leek and Storey.

RUV by Gagnon-Bartsch and Speed

Many methods are small modifications of SVA.

The idea behind RUV is to use control (genes / probes) to estimate batch effects. Controls could be probes unaffected by biology (negative controls) or genes known to be differentially expressed.

Page 27: Unwanted variation - Biostatisticskhansen/teaching/2014/140.668/batch.pdf · commonly used to account for batch effects. However, in a typical experiment these are probably only surrogates

An example of “biological” batch effect Changes in cell type composition.

Expression changes associated with Age or Cell cycle.