research article the impact of normalization methods on...

11
Research Article The Impact of Normalization Methods on RNA-Seq Data Analysis J. Zyprych-Walczak, 1 A. Szabelska, 1 L. Handschuh, 2,3 K. Górczak, 1 K. Klamecka, 1 M. Figlerowicz, 2 and I. Siatkowski 1 1 Department of Mathematical and Statistical Methods, Poznan University of Life Sciences, 60-637 Poznan, Poland 2 Institute of Bioorganic Chemistry, Polish Academy of Sciences, 61-704 Poznan, Poland 3 Department of Hematology and Bone Marrow Transplantation, Poznan University of Medical Sciences, 60-569 Poznan, Poland Correspondence should be addressed to J. Zyprych-Walczak; [email protected] Received 20 March 2015; Revised 17 May 2015; Accepted 18 May 2015 Academic Editor: Ernesto Picardi Copyright © 2015 J. Zyprych-Walczak et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. High-throughput sequencing technologies, such as the Illumina Hi-seq, are powerful new tools for investigating a wide range of biological and medical problems. Massive and complex data sets produced by the sequencers create a need for development of statistical and computational methods that can tackle the analysis and management of data. e data normalization is one of the most crucial steps of data processing and this process must be carefully considered as it has a profound effect on the results of the analysis. In this work, we focus on a comprehensive comparison of five normalization methods related to sequencing depth, widely used for transcriptome sequencing (RNA-seq) data, and their impact on the results of gene expression analysis. Based on this study, we suggest a universal workflow that can be applied for the selection of the optimal normalization procedure for any particular data set. e described workflow includes calculation of the bias and variance values for the control genes, sensitivity and specificity of the methods, and classification errors as well as generation of the diagnostic plots. Combining the above information facilitates the selection of the most appropriate normalization method for the studied data sets and determines which methods can be used interchangeably. 1. Introduction In recent years, RNA sequencing (RNA-seq) technology has become a mainstay of biomedical research and is an attrac- tive alternative to microarrays. RNA-seq technologies have several advantages over microarrays, including less noise (the technical biases inherent in microarray technology are not present in RNA-seq experiments) [1], the possibility of detecting alternative splicing isoforms [2, 3], and the power to detect novel genes, gene promoters, isoforms, and allele-specific expression [4]. e decreasing cost of next- generation sequencing (NGS) is an additional argument for the selection of RNA-seq instead of microarray-based gene expression analysis. Despite the lower bias in RNA- seq experiment, there are still some sources of systematic variation that should be eliminated from RNA-seq data before the differential expression (DE) analysis. In particular, these variations include between-sample differences such as library size (sequencing depth) [5] or within-sample differences, for example, in gene length [6], guanine-cytosine (GC) content [7, 8], or unwanted variation introduced by batch effect [9]. Experience with microarray data has repeatedly shown that normalization aims to ensure that expression estimates are more comparable across features (genes) and samples. However, there are still a lot of questions about the impact of the normalization method on the results of RNA-seq data analysis. e significance of RNA-seq data normalization was demonstrated in [10]. eir main finding was that the choice of normalization procedure affects the results of DE analysis: sensitivity varies more between normalization procedures than between test statistics. Authors of [10, 11] showed that the normalization step raises questions, and there is still a need Hindawi Publishing Corporation BioMed Research International Volume 2015, Article ID 621690, 10 pages http://dx.doi.org/10.1155/2015/621690

Upload: others

Post on 02-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Research Article The Impact of Normalization Methods on ...downloads.hindawi.com/journals/bmri/2015/621690.pdf · normalization step di erent aspects based on the biology or informatics

Research ArticleThe Impact of Normalization Methods onRNA-Seq Data Analysis

J Zyprych-Walczak1 A Szabelska1 L Handschuh23 K Goacuterczak1 K Klamecka1

M Figlerowicz2 and I Siatkowski1

1Department of Mathematical and Statistical Methods Poznan University of Life Sciences 60-637 Poznan Poland2Institute of Bioorganic Chemistry Polish Academy of Sciences 61-704 Poznan Poland3Department of Hematology and Bone Marrow Transplantation Poznan University of Medical Sciences 60-569 Poznan Poland

Correspondence should be addressed to J Zyprych-Walczak joannazyprychgmailcom

Received 20 March 2015 Revised 17 May 2015 Accepted 18 May 2015

Academic Editor Ernesto Picardi

Copyright copy 2015 J Zyprych-Walczak et al This is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work is properlycited

High-throughput sequencing technologies such as the Illumina Hi-seq are powerful new tools for investigating a wide range ofbiological and medical problems Massive and complex data sets produced by the sequencers create a need for development ofstatistical and computational methods that can tackle the analysis and management of data The data normalization is one of themost crucial steps of data processing and this process must be carefully considered as it has a profound effect on the results of theanalysis In this work we focus on a comprehensive comparison of five normalization methods related to sequencing depth widelyused for transcriptome sequencing (RNA-seq) data and their impact on the results of gene expression analysis Based on this studywe suggest a universal workflow that can be applied for the selection of the optimal normalization procedure for any particular dataset The described workflow includes calculation of the bias and variance values for the control genes sensitivity and specificityof the methods and classification errors as well as generation of the diagnostic plots Combining the above information facilitatesthe selection of the most appropriate normalization method for the studied data sets and determines which methods can be usedinterchangeably

1 Introduction

In recent years RNA sequencing (RNA-seq) technology hasbecome a mainstay of biomedical research and is an attrac-tive alternative to microarrays RNA-seq technologies haveseveral advantages over microarrays including less noise(the technical biases inherent in microarray technology arenot present in RNA-seq experiments) [1] the possibilityof detecting alternative splicing isoforms [2 3] and thepower to detect novel genes gene promoters isoforms andallele-specific expression [4] The decreasing cost of next-generation sequencing (NGS) is an additional argumentfor the selection of RNA-seq instead of microarray-basedgene expression analysis Despite the lower bias in RNA-seq experiment there are still some sources of systematicvariation that should be eliminated fromRNA-seq data before

the differential expression (DE) analysis In particular thesevariations include between-sample differences such as librarysize (sequencing depth) [5] or within-sample differences forexample in gene length [6] guanine-cytosine (GC) content[7 8] or unwanted variation introduced by batch effect[9] Experience with microarray data has repeatedly shownthat normalization aims to ensure that expression estimatesare more comparable across features (genes) and samplesHowever there are still a lot of questions about the impactof the normalization method on the results of RNA-seq dataanalysisThe significance of RNA-seq data normalization wasdemonstrated in [10] Their main finding was that the choiceof normalization procedure affects the results of DE analysissensitivity varies more between normalization proceduresthan between test statistics Authors of [10 11] showed that thenormalization step raises questions and there is still a need

Hindawi Publishing CorporationBioMed Research InternationalVolume 2015 Article ID 621690 10 pageshttpdxdoiorg1011552015621690

2 BioMed Research International

to provide useful practical tips and construct clear guidelinesfor researchers who may be unsure about how to choosea normalization method To this end we propose to apply acombination of graphical and statistical methods to comparethe impact of the particular normalization on the resultsof DE analysis Our investigation concerns five normaliza-tion methods widely used for normalization of RNA-seqdata Trimmed Mean of119872-values Upper Quartile MedianQuantile and PoissonSeq normalization implemented inR packages edgeR (v324) DESeq (v1121) EBSeq (v131)and PoissonSeq 3 (v112) respectively The comparison wasbased on the analysis of three data sets Two of them arepublicly available data sets (Bodymap and Cheung data) andone is the data set (AML data) coming from one of ourprojects (not published yet) In this paper we outline a simpleand effective method for comparing different normalizationapproaches and we show how researchers can improve theresults of differential expression analysis by including in thenormalization step different aspects based on the biology orinformatics

2 Materials and Methods

In this section we describe the normalization methods tobe compared and the data sets used in our study Next wepropose the criteria for comparison of the impact of thenormalization methods on the results of DE analysis

21 Normalization Methods Since the emergence of RNA-seq technology a number of normalization methods havebeen developed In our work we mainly focus on a com-parison of five of the most popular normalization methodsused for DE analysis of RNA-seq data implemented in fourBioconductor packages TrimmedMean of119872-values (TMM)[11] and Upper Quartile (UQ) [10] both implemented inthe edgeR Bioconductor package [12] Median (DES) imple-mented in the DESeq Bioconductor package [13] Quantile(EBS) [10] implemented in the EBSeq Bioconductor package[14] and PoissonSeq (PS) normalization implemented in thePoissonSeq package [15] All packages are available fromCRAN (httpcranr-projectorgwebpackages) and Biocon-ductor (httpwwwbioconductororgpackagesrelease)

Because the basic source of variations between samples isthe difference in library size (RNA samplesmay be sequencedto different depths) each library is assigned a normalizationfactor There are several ways to calculate a normalizationfactorWe consider five differentmethods described as below

Let us assume that we have 119866 genes and 119898 samples Let119870119892119895 and119870119892119903 denote read counts for gene 119892 and samples 119895 and119903 respectively The 119889119895 is the scaling factor for the 119895th sample

The first presented method calculates a Trimmed Meanof119872-values between each pair of samples This method wasproposed by Robinson and Oshlack [11] and is based on thehypothesis that most genes are not differentially expressedThe authors defined a normalization factor for a studiedsample with a reference sample as follows

log2(119889

TMM119895

) =

sum119892isin1198661015840

119908119892119895119872119892119895

sum119892isin1198661015840

119908119892119895

(1)

where 119872119892119895 = log2((119870119892119895119873119895)(119870119892119903119873119903)) 119908119892119895 = (119873119895 minus

119870119892119895)119873119895119870119892119895 + (119873119903 minus 119870119892119903)119873119903119870119892119903 119870119892119895 119870119892119903 gt 0 119873119895 119873119903 arerespectively the total number of reads for sample 119895th andreference sample 119903 and 1198661015840 represents the set of genes withnot trimmed119872119892 and 119860119892 (absolute expression levels) values(according to [11] trimmedweightedmean is the average afterremoving the upper and lower percentage of the data 119872119892values are trimmed by 30 and 119860119892 values are trimmed by5) According to the assumption thatmost genes are notDE119889TMM119903

should be close to 1 and log2(119889TMM119903

) = 0The next method elaborated in [10] is Upper Quartile

(UQ) implemented in the edgeR package Here the scalefactor is calculated from the 75th percentile of the counts foreach library after removing transcripts which are zero in alllibraries It has the following form

119889UQ119895

= UQ(

119870119892119895

sum119866

119892=1119870119892119895) (2)

where UQ(119883) is the upper quartile of sample119883 which is 119895thsample with normalized counts and119870119892119895 gt 0

Anders and Huber [13] suggested the Median normal-ization method implemented in the DESeq Bioconductorpackage This method makes the same assumption as theTMM method (the majority of genes are not DE) A scalingfactor for a given 119895th sample takes the median of the ratiosof observed 119895th samplersquos counts to the geometric mean acrosssamples (ie a pseudoreference sample)

119889MED119895

= median119892119870119892119895

(prod119898

V=1119870119892V)1119898 (3)

It is also feasible to perform Quantile normalizationacross samples as is often done in the case of microarraydata [10] Here we used Quantile normalization that isimplemented in the EBSeq Bioconductor package [14] Thisnormalization method estimates the sequencing depth of anexperiment by an upper quartile of its counts and has thefollowing form

119889119876

119895= 10log10119876119895minus(1119898)sum

119898

V=1 log10119876V (4)

where 119876119895 119876V are the upper quartiles of the 119895th sample andVth sample respectively This method aligns the distributionof all samples

Finally we also tested a normalization method (PS)proposed in [15] implemented in the PoissonSeq package Ithas the following formula

119889PS119895=

sum119892isin11986610158401015840

119870119892119895

sum119892isin11986610158401015840

sum119898

119895=1119870119892119895 (5)

where11986610158401015840 is the set of genes found by goodness-of-fit statisticsof the form

GOF119892 =119898

sum

119895=1

(119870119892119895 minus 119889TC119895sdot sum119898

119895=1119870119892119895)2

119889TC119895sdot sum119898

119895=1119870119892119895 (6)

BioMed Research International 3

where

119889TC119895

=

sum119866

119892=1119870119892119895

sum119866

119892=1sum119898

119895=1119870119892119895(7)

is the Total Count (TC) scaling factor We chose genes tothe set 11986610158401015840 for which the values of GOF119892 are in interval(025 075)

Furthermore we compared the performance of all nor-malization methods mentioned above against unnormalizeddata denoted by ldquoraw datardquo (RD)

In addition another source of variation related to GC-content and batch effect was also considered Obtainedresults revealed that inclusion of GC content analysis did notcontribute to the proposed strategy of comparison and didnot influence the summary ranking (see Figure S5 and TableS4 in Supplementary Material available online at httpdxdoiorg1011552015621690) Therefore this normalizationstrategy was not included in further analysis

Theundesirable variation coming frombatch effects suchas sampling time different technology can be adjusted withthe help of methods implemented in sva package [9] Weonly had information about the batches in the AML datathat could be introduced by two sampling dates That is whywe only considered such approach for AML data Beforeincluding additional normalization we checked whether thepresence of batch effects in AML data sets existed Detectingthe presence of batch effects in AML data was accomplishedin two ways We have provided a hierarchical clustering (seeFigure S8) together with principal component analysis (seeFigure S9) Our results did not confirm existence of batcheffect connected with sampling date in the case of AMLdata Therefore this normalization strategy was neglected infurther analysis

22 Data Sources The five normalization methods werecompared based on the analysis of three real RNA-seq datasets Two data sets included in this study were obtained frompublicly available resources (Bodymap and Cheung data)and one was obtained from the project performed at theInstitute of Bioorganic Chemistry in Poznan (AML data)The Bodymap data set was published in [16] where the tran-scriptome of nontransformed human mammary epithelialcells in reference to Illumina Bodymap data collected fromnormal tissues was studied The Cheung data set [17] comesfrom the study of gene expression of human B cells fromindividuals belonging to large families The aim of the studywas the identification of polymorphic transregulators Bothof these data sets are deposited in the Recount database athttpbowtie-biosourceforgenetrecount [18]

The AML data set comes from our as yet unpublishedRNA-seq experiment We studied gene expression profiles in30 peripheral blood (PB) and bone marrow (BM) samplesobtained from 25 adult AML (acute myeloid leukaemia)patients cured at the Department of Haematology and BoneMarrow Transplantation Poznan University of Medical Sci-ences Two samples per lane were sequenced at the Instituteof Bioorganic Chemistry in Poznan using single-read flow-cell (SR-FC) and Genome Analyzer IIx (GAIIx Illumina)

As a control one BM sample and a pool of 12 PB samplesobtained from healthy volunteers were sequenced on oneadditional lane The libraries were prepared from up to4 120583g total RNA with the use of the TruSeq RNA SamplePreparation Kit v2 (Illumina) according to the instructionsof the manufacturer and validated with a DNA 1000 Chip(Agilent) Each library generated approximately 20 million50-nt-long reads processed in CASAVA FastQC and theNGS QC Toolkit The reads were mapped to the referencehuman genome UCSC hg19 with TopHat run (v206)

Before all calculations we filtered the data sets to obtaina similar amount of genes in each data set From the Cheungand Bodymap data sets we chose only those genes for whichthe mean of the counts for all samples was greater than0 whereas in the AML data set we chose 50 as the cut-off value for the mean of the counts in all samples Thesummarised information about each data set is presentedin Table 1 while the details can be found in SupplementaryTable S1 and Figure S1 As we can see the data sets variedwith the sample sizes and gene numbers as well as thegene expression levels In Cheung data set genes with lowlevels of expression predominate In Bodymap data set thebiggest group is constituted by the genes with high levels ofexpression while in the AML data set the biggest group isconstituted by the genes with medium levels of expressionSuch variety of data sets enables revealing differences ofperformance of normalization methods in case of data withdifferent structure of genes

23 Analytical Criteria for Comparison ofNormalization Methods

231 Selection of the Housekeeping Genes (HG) The ideaof using HG arose from the previous research concerningmicroarray experiment In that experiment dye-swappedmicroarrays were used Chosen dyes played the role ofhousekeeping genes In considered RNA-seq experiments wedo not have the lists of housekeeping genes straightforwardlyThat is why we decided to perform our investigation of theimpact of normalizationmethods based on analytical versionof HG lists genes similarly expressed across samples

The square root of the mean square error was used as ameasure The housekeeping genes were selected based on theraw data by using the following formula

MSE119892 =radicsum119898

119895=1 (119870119892119895 minus 119870119892sdot)2

119898

(8)

where 119870119892sdot is the mean of the counts for gene 119892 We decidedto choose housekeeping genes for all data sets based on aparticular cut-off value and applied a linear transformation ofthe results to the interval [0 1]usingmin-max normalization

MSE119892 =MSE119892 minusmin (MSE119892)

max (MSE119892) minusmin (MSE119892) (9)

Then we selected 1 of all genes with the lowest MSEvalues as the housekeeping genes Although the number of

4 BioMed Research International

Table 1 Summary of data set information

Data set Number of samples Number of genesNumber ofgenes (afterfiltering)

Min of averageabundance of genefrom all genes

Max of averageabundance of genefrom all genes

Number of HK(housekeeping genes)

Cheung 41 52 580 12 410 0024 90180 124Bodymap 16 52 580 13 131 0 934100 131AML 27 57 736 12 749 5004 482400 127

housekeeping genes was the same in each data set differentgenes were included for each set The tables and bar plotsconcerning housekeeping gene selection are provided inSupplementary Materials (Table S2 and Figure S2) Theseindicate the same facts as for the case with all genes taken intoaccount There is some variability in data sets with respect tonumber of HG genes for each abundance level of reads Inall data sets we can see that there is the highest number ofHG genes with number of reads in genes below 500 (relativemedium abundance level)

232 Bias and Variance To evaluate the normalizationmethods used for the processing of RNA-seq data we appliedthe bias and variance criterion proposed by Argyropoulos etal [19] for the analysis of double-channelmicroarray dataWeadjusted earlier this method for one-channel microarray data[20] In this paper we transform themethod to be suitable forthe RNA-seq data We can use the following formulae of biasand variance as proxies accuracy and precision respectively

bias119894 = radic1119898

119898

sum

119895=1

(log2 (119870119894119895

119870119894sdot

) minus True Log Ratio)2

(10)

variance119894

=1

119898 minus 1

119898

sum

119895=1

(log2 (119870119894119895

119870119894sdot

)minus log2 (119870119894119895

119870119894sdot

)

119894

)

2

(11)

where 119870119894119895 denotes read counts for 119894th control gene fromthe 119895th sample 119870119894sdot is the mean counts for the 119894th controlgene and log2(119870119894119895119870119894sdot) is the mean value of log2(119870119894119895119870119894sdot) forthe 119894th control gene and True Log Ratio is the true valueof log2(119870119894119895119870119894sdot) for the 119894th control gene from the definitionof HG The accuracy of a measurement is the degree ofcloseness of measurements of a quantity to the true valueThe precision of a measurement is the degree to whichrepeated measurements under unchanged conditions showthe same results Based on the definition each gene whichis considered as HG should have the same number of countsin each sample as well as mean value of counts for this geneThus the True Log Ratio is equal to 0 and the formula (10) forthe bias reduces to the root mean square error (RMSE)

bias119894 = radic1

119898

119898

sum

119895=1

(log2 (119870119894119895

119870119894sdot

))

2

(12)

The ratios (bias and variance) for each normalizationmethodwill be the mean of bias and variance values computed forall of the control (housekeeping) genes The normalizationmethod ldquoArdquo would be preferred over the method ldquoBrdquo if it isassociated with the smallest bias and variance [19]

233 Differential Expression Analysis Our goal was to findout how normalizationmethods affect differential expressionresults Therefore after application of each normalizationmethod differential expression analysis was performed usingthe edgeR method from the edgeR Bioconductor package[12] This method was chosen because it showed reliableperformance in wide range of experiments and it allowseasy inclusion of scaling factors into the statistical test TheedgeR method was specifically developed to model countdata dispersion and it is designed for overdispersed RNA-seqdata Briefly it is assumed that counts are a negative binomialdistribution according to

119870119892119895 sim 119873119861 (119889119895120582119892119895 120601119892) (13)

where 120582119892119895 is the expression level of gene 119892 in sample 119895 119889119895 isthe normalization factor in sample 119895 and 120601119892 is the dispersionof gene 119892 The method of differential expression analysisimplemented in the edgeR package extends Fisherrsquos exacttest

234 Prediction Errors Discriminant analysis was appliedto determine the effectiveness of the sample classificationbased on found DEGs identified in each data set after eachnormalization procedure Different classification methodsmay lead to different prediction errors We estimated theclassification errors based on the five classifiers naive Bayesneural network 119896-nearest neighbour and support vectormachines and random forest that are presented in Table 2Analyzing each data set we used leave-one-out cross valida-tion (LOOCV) to obtain the estimate of the predictor errorsof the test set that results from using different classifiers For adata set with 119899 samples this method involves 119899 separate runsFor each run a number of samples minus one data pointare used to train the model and then prediction is performedon the remaining data point The overall prediction erroris the sum of the errors for all 119899 runs [21] As input forthe classifiers in the discriminant analysis we chose theseinformative genes having high power of discrimination thatare differentially expressed The set of genes was obtainedby gene selection process from a test statistic after eachnormalization procedure

BioMed Research International 5

Table 2 Classifiers used in the calculations function in R and the name and the version of R package

Classification method Function in R R package VersionNaive Bayes NaiveBayesI MLInterfaces 1400Neural network nnetI MLInterfaces 1400119896ndashnearest neighbour knnI MLInterfaces 1400Support vector machines svmI MLInterfaces 1400Random forest randomForestI MLInterfaces 1400

Cheung

20

15

10

5

0

Num

ber o

f DEG

s

TMM UQ

DES EB

S PS RD

Methods

TMM UQ

DES EB

S PS RD

Methods

TMM UQ

DES EB

S PS RD

Methods

Bodymap

100

50

0

AML

3000

2000

1000

0

Abundance of geneslt100

100ndash5000gt5000

Abundance of geneslt100

100ndash5000gt5000

Abundance of geneslt100

100ndash5000gt5000

Figure 1 Bar plots of the DEGs with specified levels of count abundance in all studied data sets On the 119909-axis the methods of normalizationare featured whereas the 119910-axis represents the number of DEGs determined after each normalization procedure The bar colours representthe groups of genes of particular level of expression

All computations and diagnostic plots were performedwith R 302 [22]

3 Results

The aim of our study was to compare various normalizationapproaches and outline the workflow that will help to selectthe normalization method appropriate for a particular dataset We decided to test five of the most commonly usedmethods of RNA-seq data normalization Trimmed Meanof 119872-values (TMM) Upper Quartile (UQ) Median (DES)Quantile (EBS) and PoissonSeq (PS) described in detailin the Methods section All methods were tested on threedifferent data sets Bodymap Cheung and AML data Thedetails concerning data sets are described in Section 2

31 Differential Expression (DE) Analysis The main purposeof many RNA-seq experiments is to identify genes that aredifferentially expressed between the compared conditions

Analyzing three mentioned above data sets we checked thedirect impact of the normalization methods described aboveon the results of DE analysis After each normalizationwe determined the lists of differentially expressed genes(DEGs) using the statistic test from the edgeR package Inthe case of each data set we compared gene expression levelsbetween two types of biological samples and ranked the genesaccording to adjusted 119875 values Genes that had adjusted 119875

values lt 005 were selected as differentially expressedThe results of DE analysis can be compared based on the

number and content of DEGs Visualisation of the DEGs ispresented in Figure 1 The bar plots show the contributionof genes with particular expression levels in DEG lists forall data sets selected from data submitted to five testedmethods of normalization and raw data (RD) As we cansee the influence of normalization methods differs betweendata sets In the case of the Cheung data set all methodsresult in a similar number of DEGs and their significantpart constitutes genes with low levels of expression A more

6 BioMed Research International

Table 3 Rank of the bias values obtained using five normalization methods and RD for the normalization of the three tested data sets

Normalization method Cheung data set rank (bias value) Bodymap data set rank (bias value) AML data set rank (bias value)TMM 35 (0890) 4 (1123) 6 (0587)UQ 35 (0890) 5 (1139) 4 (0564)DES 1 (0885) 2 (1102) 1 (0512)EBS 2 (0887) 3 (1111) 3 (0532)PS 5 (0893) 1 (1098) 2 (0523)RD 6 (0908) 6 (1159) 5 (0581)

Table 4 Rank of the variance values obtained using the five normalization methods and RD for the normalization of the three tested datasets

Normalization method Cheung data set rank(variance value)

Bodymap data setrank (variance value)

AML data set rank(variance value)

TMM 4 (0779) 4 (1305) 6 (0364)UQ 3 (0778) 5 (1330) 4 (0335)DES 1 (0768) 2 (1270) 1 (0283)EBS 2 (0771) 3 (1283) 3 (0300)PS 5 (0782) 1 (1268) 2 (0291)RD 6 (0812) 6 (1390) 5 (0355)

balanced contribution ofDEGswas observed in the Bodymapdata set the number of DEGs with a middle expression levelis slightly higher than the number of very highly or weaklyexpressed genes In AML data set genes with an averageexpression level clearly predominate on the list of DEGsHere there are also the most significant differences in thenumber of DEGs between variously normalized data Themost restrictive method seemed to be TMM whereas thehighest numbers of DEGs were obtained by EBS and PSmethods The contribution of all groups of genes is the samefor each normalization method

MA plots available in Supplementary Materials (FigureS3) were generated for additional visualization of DEGsThey present the relationship between base mean and log2FCof the counts The results show that DEGs in each MA plotafter normalization are slightly more spread out compared tothe raw data However the location of DEG differs dependingon the normalization method for AML and Bodymap datasets In the case of AML data it is also worth pointing outthat we can observe more DEGs that are overexpressed thanunderexpressed

As it is difficult to evaluate which normalization methodismore suitable for a data set based onMAplot analysismoreprecise verification is necessary To this endwe calculated biasand variance values

32 Bias andVarianceCalculation Based ondetails describedin the previous section we selected housekeeping genes forwhich we calculated bias and variance values Following theidea described in the study [19] we assumed that the mostappropriate normalization method is that which generatesthe lowest bias and variance values for the control genesThe bias and variance values were calculated accordingto formulae (11) and (12) for all the housekeeping genesselected separately for each data set as described in Section 2

Then for each data set the mean of bias and variancevalues for each normalization method and for all controlgenes was calculated Tables 3 and 4 present the rankingof normalization methods based on these bias and variancevalues the method with the lowest bias or variance wasranked as 1 The highest bias and variance values at leastin the Cheung and Bodymap data sets were observed forthe unnormalized raw data (RD) included into the tablesfor comparison Taking into account all of these data setsthe conclusion is that the best results were obtained usingDES EBS and PS normalization methods and the methodswhich generated the highest bias and variance values wereTMM and UQ In the case of AML data application of TMMmethod led to increase of the bias and the variance presentin RD It is worth pointing out that the differences betweenthem are small in all data sets

33 Sensitivity and Specificity The sensitivity and the speci-ficity of the normalization methods were investigated usingthe AML data set based on our earlier experience withthe analysis of microarray data described in [20] as wellas evidence from literature [23 24] First the sets of geneswere selected as positive and negative controls The positivecontrols consisted of the genes that were strong candidatesfor DEGs For the set of positive controls we selected genesthat were validated by real-time polymerase chain reaction(PCR) analysis or described in the literature as overexpressed(or less frequently underexpressed) in AML or immaturehematopoietic cells [23] The negative controls included thegenes that are not DEGs (their expression levels should notdiffer between these two types of samples) [24] In total44 genes were chosen as the positive controls and 44 geneswere chosen as the negative controls The lists of genescan be found in Table S3 The sensitivity and specificityof the normalization methods were calculated respectively

BioMed Research International 7

Table 5 Sensitivity and specificity of the studied five normalizationmethods and RD applied to the AML data set

Methods TMM UQ DES EBS PS RDSensitivity () 114 205 182 455 409 182Rank of sensitivity 6 3 45 1 2 45Specificity () 977 841 977 523 659 977Rank of specificity 2 4 2 6 5 2

as a percentage of positive controls that were present andthe percentage of negative controls that were absent ineach list of differentially expressed genes The normalizationmethods with the highest values of specificity better shownondifferentially expressed genes while on the other handthemethodswith the highest sensitivity values indicate a highprobability of finding DEGs that are truly DEGs The resultsof this analysis are presented in Table 5

Table 5 shows that the specificity values for EBS andPS methods are substantially lower than for the remainingones The sensitivity values were less divergent between themethods and were generally low in the range between 10 and31 Furthermore for TMM and EBS methods we obtainedthe most divergent results the highest specificity but thelowest sensitivity for TMM and the opposite for EBS Inthe case of specificity we can see that most of the methodsproduce values of specificity over 80

34 Prediction Errors Table 6 presents a comparison of thefive normalization methods when for each data set thedifferent numbers of informative genes were used basedon five classifiers and LOOCV In each case we selectedthe number of differentially expressed genes as 75 of thenumber of samples Therefore for the Cheung Bodymapand AML data sets we have respectively 30 (075 lowast 41) 12(075 lowast 16) and 20 (074 lowast 27) differentially expressed genesThe results in the table suggest that in all data sets PS DESand EBS perform better than TMM and UQ

35 Diagnostic Plots Besides the analyticalmethods for com-parison of normalization methods we suggest using addi-tional determinants that may be helpful for the rejectionof the most commonly used methods that evidently fail Itis possible that the normalization methods yield differentresults for different data sets Therefore we suggest theapplication of the following workflow based on diagnosticplots to determine which normalization method is optimalfor a specific data set Here we focus only on the AML dataset In Figure 2(a) we can observe the differences introducedto normalization factors obtained by each normalizationmethod From this figure we can conclude that normalizationcoefficients determined with TMM and UQ methods grouptogether and divergent from the rest of normalization meth-ods

The results in Table 6 may be summarised consider-ing the average errors expressed through figures For theAML data set the performances of the five normalizationmethods can be ranked The percentages of classificationerrors obtained by normalization methods could be used

Table 6 Performances of the normalization methods with infor-mative genes based on five classifiers and LOOCV applied to theCheung Bodymap and AML data sets The first number in eachcell denotes the percentage of the average of five prediction errorsfrom five different classification methods The second number ineach cell that is in brackets is the percentage of the median of thefive prediction errors

TMM UQ DES EBS PS RD

Cheung 111 107 45 44 39 116(98) (73) (04) (00) (00) (122)

Bodymap 162 162 160 158 160 162(188) (188) (177) (167) (177) (188)

AML 81 80 77 74 73 83(74) (74) (74) (74) (74) (74)

to obtain a 95 confidence interval of the mean of theproportion of classification errors for the normalization Thecorresponding bar chart plotting confidence intervals is givenin Figure 2(b)This plot indicate that the TMMUQ andDESmethods outperform the two other methods with respect toclassification performance

36 Common DEGs To compare the number of DE genesand the number of common DE genes found among the nor-malization methods performed for a particular data set wegenerate balloon plots (Figure 2(c) and Figure S5) and Venndiagrams (see Figure S4) Balloon plots represent percentagesof the number of commonly detected differentially expressedgenes between the 119894th and 119895th methods First we calculatein the (119894 119895)th cell a proportion of common detection withrespect to the 119894th method

119875119894119895 =

119863119894119895

119863119894

sdot 100 (14)

where 119894 = 119895 119863119894119895 is the number of differentially expressedgenes commonly detected by the 119894th method and the 119895thmethod and 119863119894 is the number of differentially expressedgenes detected by the 119894th method Next we take the averagepercentage value of common genes between each pair ofmethods The most preferable method is the one with thehighest number of common DEGs

For the AML data set (Figure 2(c)) the lowest number ofcommonDEGs with othermethods is produced by the TMMnormalization method and the best by the EBS methodThe lowest number of common DEGs was obtained for theEBSndashTMM and PSndashTMM comparisons From this analysiswe concluded that the set of genes identified as DEGs is notstable and depends on both themethod of data normalizationand the data set itself However in the case of AML data setthere is a set of 227 genes common for all tested methods(Supplementary Figure S4) and these genes can be consideredas the strongest candidates for DEGs (Figure S4) In the caseof the Cheung data set we noted that most of the methodsappear to perform similarly For the Bodymap data set allnormalization methods yielded slightly lower percentages ofcommonly detected DEGs than in the case of the Cheung

8 BioMed Research International

050075100125150

1 3 5 7 9 11 13 15 17 19 21 23 25 27Samples

Nor

mal

izat

ion

coeffi

cien

t

MethodsTMMUQDES

EBSPSRD

(a)

6

9

12

15

TMM UQ

DES EB

S PS RD

Methods

Clas

sifica

tion

erro

r

(b)

60 68 53 54 70

60 80 68 70 76

68 80 60 62 90

53 68 60 94 60

54 70 62 94 60

70 76 90 60 60RDPS

EBSDESUQ

TMM

TMM UQ

DES EB

S PS RD

Methods

Met

hods

0

25

50

75

()

(c)

0

50

100

150

TMM UQ

DES RD EB

S PS

(d)

Figure 2 Diagnostic plots for the AML data set Besides the five normalization methods raw data (RD) were also included in the plots as abenchmark (a) presents calculated normalization factor values across the samples by eachmethodThe samples are ordered by theminimumvalues of normalization factors (b) shows 95 confidence intervals for the mean of the percentages of classification errors calculated for eachmethod based on five selected classifiers (c) shows the numbers of common DE genes across each pair of normalization methods The sizeand shading of the circles represent the average percentage value of common genes between each pair of methods (d) presents the results ofclustering of the normalization methods based on 20 common DE genes found by each of these methods A dendrogram was created withhierarchical clustering based on Wardrsquos method

data set For the raw data (RD) the number of common genesfor any pair of methods was below 50 (see Figure S5) TheVenn diagrams for the Cheung and Bodymap data sets alsoconfirmed this tendency (Figure S4)

37 Grouping of Normalization Methods Clustering isanother approach to compare the performance of the normal-ization methods Based on the similarities between the DEgene lists we generated dendrograms to easily observe whichmethods group together The exact measure of similarity wasthe DE gene rank In the case of each data set for five variantsof normalized data as well as for raw data (RD) particularsets of differentially expressed genes were obtained First wechose genes common to all six DEG lists Since we obtained20 genes in common between six methods we performedthe clustering of the normalization methods based onthese common DEGs Then for all methods we rankedthese genes thus obtaining ranking lists of genes Basedon these lists we computed the distance matrix by usingEuclidian distance and plot dendrograms (Figure 2(d)) Thedendrograms were constructed from hierarchical clusteringusing Wardrsquos methodThis criterion is another approach thatcompares the DEGs lists The previous criterion (commonDEGs) gives us the information about the percentage of DEGin common per pair of methods whereas this criterion gives

the information about the similarity between the methodsbased on the order of common DEGs in each list

38 Summary of Ranks Combining all of the criteriadescribed in the previous sections we would like to deter-mine which of the five tested normalization methods wouldbe appropriate for the AML data set Table 7 summarises theranks obtained based on the bias and variance values theprediction errors sensitivity specificity and the number ofcommon DEGs for the AML data set In the case when allthe criteria included in Table 7 are equally important theinvestigator can base the decision on the final rank calculatedas the mean of the ranks established separately for eachcriterion using chosen normalization methods

4 Discussion

In practice normalization of high-throughput data stillremains an important question and has received a lot of atten-tion in the literatureThe increasing number of normalizationmethods makes it difficult for scientists to decide whichmethod should be used for which particular data set Basedon the results presented in this paper as well as in [25] we canconclude that normalization affects differential expressionanalysis therefore an important aspect is how to choose

BioMed Research International 9

Table 7 Summary of comparison results for the five normalizationmethods under considerationThefinal rank is based on the bias andvariance values the prediction errors sensitivity specificity valuesand the number of common DEGs for AML data

TMM UQ DES EBS PSBias 50 40 10 30 20Variance 50 40 10 30 20Sensitivity 50 30 40 10 20Specificity 15 30 15 50 40Prediction errors 50 40 30 20 10Common DEGs 50 20 10 40 30

the most sensible method for the data In this paper we haveshown that depending on the data structure the influenceof normalization differs (see Figures 1 and 2(d)) In our workwe have demonstrated that based on some of these criteriathe choice of normalization method can be more suitableand robust and can be made more automatically Coefficientssuch as bias and variance can be considered as criteria for acomparison of normalization methods It is worth pointingout that in our investigation in most cases including thenormalization reduces the bias and variance values comparedto raw data which confirms the need of normalizationWhenthe differences between the bias and variance values aresignificant the usage of ranks of these values reflects the realdifferences more precisely However in our investigation thedifferences between obtained bias and variance values forall methods in all data sets are small In such case usingranks does not reflect the true differences between methodsand additional criteria are neededThe diagnostic plots couldserve as such additional determinants and may be helpful forthe rejection of the most leading methods that evidently failOur study indicates that the use of TMM method in mostcases is displayed poorly This conclusion is not in agreementwith the evaluation made by [26 27] Their study indicatedthat the use of TMM method led to good performance onsimulated data sets One reason for the disagreement of theseresults can be due to different approaches of comparison andusage of criteria not considered by other authors Anotherreason of differences in the conclusions could arise from thenumber of biological replicates used in our study or could berelated to the particular data sets used in these papers

Furthermore we find that other conclusions described in[26] are consistentwith our own resultsThese results confirmthe satisfactory results of the DESeq method In Table 7 wepresented the summary of ranks for the AML data set It ispossible that the normalization of other data sets can yieldresults that are different from those obtained with the AMLdata set (in this paper we have shown that depending onthe data structure the influence of normalization differs)In general each criterion proposed in the paper focuses ondifferent aspects of comparison Depending on the mainobjectives of the research some of the criteria could be moreuseful for example if impact lies in good prediction basedon chosen genes the important aspect will be predictionerrors However in some cases the results of criteria areinconclusive In such situation we suggest the application of

the following workflow to determine which normalizationmethod is optimal for a specific data set (i) normalize thedata using considered methods (ii) calculate the ldquobiasrdquo andldquovariancerdquo and rank the methods based on these values (iii)after each normalization perform differential analysis anddetermine DEG lists found by each normalization method(iv) select a subset of genes that can serve as positive andnegative controls to investigate the sensitivity and specificityof normalization methods and rank the methods basedon these criteria (v) calculate the percentage of the meanof the prediction errors obtained using chosen classifiersfor DEGs found by each normalization method and rankthem (vi) draw Venn diagrams or balloon plots based onthe number of differentially expressed genes and rank themethods based on the number of common DEG valuesand (vii) based on the summary of ranks choose the mostappropriate normalization method of the investigated dataset We notice here that the normalization method caninfluence the expression results leading to erroneous DEanalysis therefore it is very important to put effort at thisstage of the analysis Finally we wanted to draw attentionthat our paper does not indicate clearly which normalizationmethod is the best but it adds a new look at how to choose thenormalization for RNA-Seq data analysis to avoid erroneousDE analysis It can be applied not only tomethods concerningsequencing depth but the proposed algorithm is also suitableto compare normalizations that take into account othersources of unwanted variation

5 Conclusions

In the study new coefficients such as bias and variancewere proposed as objective criteria for a comparison of nor-malization methods In conclusion our results suggest thatdepending on the RNA-seq data structure and the appliedmethod the influence of normalization differs Howeverthe presented criteria in particular bias and variance cansupport the choice of normalization method optimal for aspecific data set

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank the reviewer for the thor-ough evaluation of the paper and valuable comments Thiswork was supported by the Ministry of Science and HigherEducation of the Republic of Poland by KNOW program

References

[1] S Zhao W-P Fung-Leung A Bittner K Ngo and X LiuldquoComparison of RNA-Seq and microarray in transcriptomeprofiling of activated T cellsrdquo PLoS ONE vol 9 no 1 ArticleID e78644 2014

10 BioMed Research International

[2] E T Wang R Sandberg S Luo et al ldquoAlternative isoformregulation in human tissue transcriptomesrdquoNature vol 456 no7221 pp 470ndash476 2008

[3] Q Pan O Shai L J Lee B J Frey and B J Blencowe ldquoDeepsurveying of alternative splicing complexity in the human tran-scriptome by high-throughput sequencingrdquo Nature Geneticsvol 40 no 12 pp 1413ndash1415 2008

[4] W M Landau and P Liu ldquoDispersion estimation and its effecton test performance in RNA-seq data analysis a simulation-based comparison of methodsrdquo PLoS ONE vol 8 no 12 ArticleID e81415 2013

[5] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008

[6] A Oshlack andM JWakefield ldquoTranscript length bias in RNA-seq data confounds systems biologyrdquo Biology Direct vol 4article 14 2009

[7] J K Pickrell J C Marioni A A Pai et al ldquoUnderstandingmechanisms underlying human gene expression variation withRNA sequencingrdquoNature vol 464 no 7289 pp 768ndash772 2010

[8] D Risso K Schwartz G Sherlock et al ldquoGC-content normal-ization for RNA-seqrdquo Tech Rep 291 Division of BiostatisticsUniversity of California Berkeley 2011

[9] J T Leek R B Scharpf H C Bravo et al ldquoTacklingthe widespread and critical impact of batch effects in high-throughput datardquo Nature Reviews Genetics vol 11 no 10 pp733ndash739 2010

[10] J H Bullard E Purdom K D Hansen and S Dudoit ldquoEvalu-ation of statistical methods for normalization and differentialexpression in mRNA-Seq experimentsrdquo BMC Bioinformaticsvol 11 article 94 2010

[11] M D Robinson and A Oshlack ldquoA scaling normalizationmethod for differential expression analysis of RNA-seq datardquoGenome Biology vol 11 no 3 article r25 2010

[12] M D Robinson D J McCarthy and G K Smyth ldquoEdgeRa bioconductor package for differential expression analysis ofdigital gene expression datardquo Bioinformatics vol 26 no 1 pp139ndash140 2010

[13] S Anders and W Huber ldquoDifferential expression analysis forsequence count datardquo Genome Biology vol 11 no 10 articleR106 2010

[14] N Leng J Dawson J Thomson et al ldquoEBSeq an empiricalbayes hierarchical model for inference in RNA-seq experi-mentsrdquo Tech Rep 226 University of Wisconsin 2012

[15] J Li D M Witten I M Johnstone and R TibshiranildquoNormalization testing and false discovery rate estimation forRNA-sequencing datardquo Biostatistics vol 13 no 3 pp 523ndash5382012

[16] Y W Asmann B M Necela K R Kalari et al ldquoDetection ofredundant fusion transcripts as biomarkers or disease-specifictherapeutic targets in breast cancerrdquo Cancer Research vol 72no 8 pp 1921ndash1928 2012

[17] V G Cheung R R Nayak I X Wang et al ldquoPolymorphic cis-and Trans-Regulation of human gene expressionrdquo PLoS Biologyvol 8 no 9 Article ID e1000480 2010

[18] A C Frazee B Langmead and J T Leek ldquoRecount a multi-experiment resource of analysis-ready RNA-seq gene countdatasetsrdquo BMC Bioinformatics vol 12 article 449 2011

[19] C Argyropoulos A A Chatziioannou G Nikiforidis AMoustakas G Kollias and V Aidinis ldquoOperational criteria forselecting a cDNA microarray data normalization algorithmrdquoOncology Reports vol 15 no 4 pp 983ndash996 2006

[20] B Uszczynska J Zyprych-Walczak L Handschuh et al ldquoAnaly-sis of boutique arrays a universalmethod for the selection of theoptimal data normalization procedurerdquo International Journal ofMolecular Medicine vol 32 no 3 pp 668ndash684 2013

[21] D Chen Z Liu X Ma and D Hua ldquoSelecting genes by teststatisticsrdquo Journal of Biomedicine and Biotechnology vol 2005no 2 pp 132ndash138 2005

[22] RDevelopment Core Team R A Language and Environment forStatistical Computing R Foundation for Statistical ComputingVienna Austria 2011

[23] M Goswami N Hensel B D Smith et al ldquoExpression ofputative targets of immunotherapy in acute myeloid leukemiaand healthy tissuesrdquo Leukemia vol 28 no 5 pp 1167ndash1170 2014

[24] E Eisenberg and E Y Levanon ldquoHuman housekeeping genesrevisitedrdquo Trends in Genetics vol 29 no 10 pp 569ndash574 2013

[25] F Seyednasrollah A Laiho and L L Elo ldquoComparison ofsoftware packages for detecting differential expression in RNA-seq studiesrdquo Briefings in Bioinformatics vol 16 no 1 pp 59ndash702015

[26] M-A Dillies A Rau J Aubert et al ldquoA comprehensive evalu-ation of normalization methods for Illumina high-throughputRNA sequencing data analysisrdquo Briefings in Bioinformatics vol14 no 6 pp 671ndash683 2013

[27] E Maza P Frasse P Senin M Bouzayen and M ZouineldquoComparison of normalization methods for differential geneexpression analysis in RNA-Seq experiments a matter ofrelative size of studied transcriptomesrdquo Communicative ampIntegrative Biology vol 6 no 6 Article ID e25849 2013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 2: Research Article The Impact of Normalization Methods on ...downloads.hindawi.com/journals/bmri/2015/621690.pdf · normalization step di erent aspects based on the biology or informatics

2 BioMed Research International

to provide useful practical tips and construct clear guidelinesfor researchers who may be unsure about how to choosea normalization method To this end we propose to apply acombination of graphical and statistical methods to comparethe impact of the particular normalization on the resultsof DE analysis Our investigation concerns five normaliza-tion methods widely used for normalization of RNA-seqdata Trimmed Mean of119872-values Upper Quartile MedianQuantile and PoissonSeq normalization implemented inR packages edgeR (v324) DESeq (v1121) EBSeq (v131)and PoissonSeq 3 (v112) respectively The comparison wasbased on the analysis of three data sets Two of them arepublicly available data sets (Bodymap and Cheung data) andone is the data set (AML data) coming from one of ourprojects (not published yet) In this paper we outline a simpleand effective method for comparing different normalizationapproaches and we show how researchers can improve theresults of differential expression analysis by including in thenormalization step different aspects based on the biology orinformatics

2 Materials and Methods

In this section we describe the normalization methods tobe compared and the data sets used in our study Next wepropose the criteria for comparison of the impact of thenormalization methods on the results of DE analysis

21 Normalization Methods Since the emergence of RNA-seq technology a number of normalization methods havebeen developed In our work we mainly focus on a com-parison of five of the most popular normalization methodsused for DE analysis of RNA-seq data implemented in fourBioconductor packages TrimmedMean of119872-values (TMM)[11] and Upper Quartile (UQ) [10] both implemented inthe edgeR Bioconductor package [12] Median (DES) imple-mented in the DESeq Bioconductor package [13] Quantile(EBS) [10] implemented in the EBSeq Bioconductor package[14] and PoissonSeq (PS) normalization implemented in thePoissonSeq package [15] All packages are available fromCRAN (httpcranr-projectorgwebpackages) and Biocon-ductor (httpwwwbioconductororgpackagesrelease)

Because the basic source of variations between samples isthe difference in library size (RNA samplesmay be sequencedto different depths) each library is assigned a normalizationfactor There are several ways to calculate a normalizationfactorWe consider five differentmethods described as below

Let us assume that we have 119866 genes and 119898 samples Let119870119892119895 and119870119892119903 denote read counts for gene 119892 and samples 119895 and119903 respectively The 119889119895 is the scaling factor for the 119895th sample

The first presented method calculates a Trimmed Meanof119872-values between each pair of samples This method wasproposed by Robinson and Oshlack [11] and is based on thehypothesis that most genes are not differentially expressedThe authors defined a normalization factor for a studiedsample with a reference sample as follows

log2(119889

TMM119895

) =

sum119892isin1198661015840

119908119892119895119872119892119895

sum119892isin1198661015840

119908119892119895

(1)

where 119872119892119895 = log2((119870119892119895119873119895)(119870119892119903119873119903)) 119908119892119895 = (119873119895 minus

119870119892119895)119873119895119870119892119895 + (119873119903 minus 119870119892119903)119873119903119870119892119903 119870119892119895 119870119892119903 gt 0 119873119895 119873119903 arerespectively the total number of reads for sample 119895th andreference sample 119903 and 1198661015840 represents the set of genes withnot trimmed119872119892 and 119860119892 (absolute expression levels) values(according to [11] trimmedweightedmean is the average afterremoving the upper and lower percentage of the data 119872119892values are trimmed by 30 and 119860119892 values are trimmed by5) According to the assumption thatmost genes are notDE119889TMM119903

should be close to 1 and log2(119889TMM119903

) = 0The next method elaborated in [10] is Upper Quartile

(UQ) implemented in the edgeR package Here the scalefactor is calculated from the 75th percentile of the counts foreach library after removing transcripts which are zero in alllibraries It has the following form

119889UQ119895

= UQ(

119870119892119895

sum119866

119892=1119870119892119895) (2)

where UQ(119883) is the upper quartile of sample119883 which is 119895thsample with normalized counts and119870119892119895 gt 0

Anders and Huber [13] suggested the Median normal-ization method implemented in the DESeq Bioconductorpackage This method makes the same assumption as theTMM method (the majority of genes are not DE) A scalingfactor for a given 119895th sample takes the median of the ratiosof observed 119895th samplersquos counts to the geometric mean acrosssamples (ie a pseudoreference sample)

119889MED119895

= median119892119870119892119895

(prod119898

V=1119870119892V)1119898 (3)

It is also feasible to perform Quantile normalizationacross samples as is often done in the case of microarraydata [10] Here we used Quantile normalization that isimplemented in the EBSeq Bioconductor package [14] Thisnormalization method estimates the sequencing depth of anexperiment by an upper quartile of its counts and has thefollowing form

119889119876

119895= 10log10119876119895minus(1119898)sum

119898

V=1 log10119876V (4)

where 119876119895 119876V are the upper quartiles of the 119895th sample andVth sample respectively This method aligns the distributionof all samples

Finally we also tested a normalization method (PS)proposed in [15] implemented in the PoissonSeq package Ithas the following formula

119889PS119895=

sum119892isin11986610158401015840

119870119892119895

sum119892isin11986610158401015840

sum119898

119895=1119870119892119895 (5)

where11986610158401015840 is the set of genes found by goodness-of-fit statisticsof the form

GOF119892 =119898

sum

119895=1

(119870119892119895 minus 119889TC119895sdot sum119898

119895=1119870119892119895)2

119889TC119895sdot sum119898

119895=1119870119892119895 (6)

BioMed Research International 3

where

119889TC119895

=

sum119866

119892=1119870119892119895

sum119866

119892=1sum119898

119895=1119870119892119895(7)

is the Total Count (TC) scaling factor We chose genes tothe set 11986610158401015840 for which the values of GOF119892 are in interval(025 075)

Furthermore we compared the performance of all nor-malization methods mentioned above against unnormalizeddata denoted by ldquoraw datardquo (RD)

In addition another source of variation related to GC-content and batch effect was also considered Obtainedresults revealed that inclusion of GC content analysis did notcontribute to the proposed strategy of comparison and didnot influence the summary ranking (see Figure S5 and TableS4 in Supplementary Material available online at httpdxdoiorg1011552015621690) Therefore this normalizationstrategy was not included in further analysis

Theundesirable variation coming frombatch effects suchas sampling time different technology can be adjusted withthe help of methods implemented in sva package [9] Weonly had information about the batches in the AML datathat could be introduced by two sampling dates That is whywe only considered such approach for AML data Beforeincluding additional normalization we checked whether thepresence of batch effects in AML data sets existed Detectingthe presence of batch effects in AML data was accomplishedin two ways We have provided a hierarchical clustering (seeFigure S8) together with principal component analysis (seeFigure S9) Our results did not confirm existence of batcheffect connected with sampling date in the case of AMLdata Therefore this normalization strategy was neglected infurther analysis

22 Data Sources The five normalization methods werecompared based on the analysis of three real RNA-seq datasets Two data sets included in this study were obtained frompublicly available resources (Bodymap and Cheung data)and one was obtained from the project performed at theInstitute of Bioorganic Chemistry in Poznan (AML data)The Bodymap data set was published in [16] where the tran-scriptome of nontransformed human mammary epithelialcells in reference to Illumina Bodymap data collected fromnormal tissues was studied The Cheung data set [17] comesfrom the study of gene expression of human B cells fromindividuals belonging to large families The aim of the studywas the identification of polymorphic transregulators Bothof these data sets are deposited in the Recount database athttpbowtie-biosourceforgenetrecount [18]

The AML data set comes from our as yet unpublishedRNA-seq experiment We studied gene expression profiles in30 peripheral blood (PB) and bone marrow (BM) samplesobtained from 25 adult AML (acute myeloid leukaemia)patients cured at the Department of Haematology and BoneMarrow Transplantation Poznan University of Medical Sci-ences Two samples per lane were sequenced at the Instituteof Bioorganic Chemistry in Poznan using single-read flow-cell (SR-FC) and Genome Analyzer IIx (GAIIx Illumina)

As a control one BM sample and a pool of 12 PB samplesobtained from healthy volunteers were sequenced on oneadditional lane The libraries were prepared from up to4 120583g total RNA with the use of the TruSeq RNA SamplePreparation Kit v2 (Illumina) according to the instructionsof the manufacturer and validated with a DNA 1000 Chip(Agilent) Each library generated approximately 20 million50-nt-long reads processed in CASAVA FastQC and theNGS QC Toolkit The reads were mapped to the referencehuman genome UCSC hg19 with TopHat run (v206)

Before all calculations we filtered the data sets to obtaina similar amount of genes in each data set From the Cheungand Bodymap data sets we chose only those genes for whichthe mean of the counts for all samples was greater than0 whereas in the AML data set we chose 50 as the cut-off value for the mean of the counts in all samples Thesummarised information about each data set is presentedin Table 1 while the details can be found in SupplementaryTable S1 and Figure S1 As we can see the data sets variedwith the sample sizes and gene numbers as well as thegene expression levels In Cheung data set genes with lowlevels of expression predominate In Bodymap data set thebiggest group is constituted by the genes with high levels ofexpression while in the AML data set the biggest group isconstituted by the genes with medium levels of expressionSuch variety of data sets enables revealing differences ofperformance of normalization methods in case of data withdifferent structure of genes

23 Analytical Criteria for Comparison ofNormalization Methods

231 Selection of the Housekeeping Genes (HG) The ideaof using HG arose from the previous research concerningmicroarray experiment In that experiment dye-swappedmicroarrays were used Chosen dyes played the role ofhousekeeping genes In considered RNA-seq experiments wedo not have the lists of housekeeping genes straightforwardlyThat is why we decided to perform our investigation of theimpact of normalizationmethods based on analytical versionof HG lists genes similarly expressed across samples

The square root of the mean square error was used as ameasure The housekeeping genes were selected based on theraw data by using the following formula

MSE119892 =radicsum119898

119895=1 (119870119892119895 minus 119870119892sdot)2

119898

(8)

where 119870119892sdot is the mean of the counts for gene 119892 We decidedto choose housekeeping genes for all data sets based on aparticular cut-off value and applied a linear transformation ofthe results to the interval [0 1]usingmin-max normalization

MSE119892 =MSE119892 minusmin (MSE119892)

max (MSE119892) minusmin (MSE119892) (9)

Then we selected 1 of all genes with the lowest MSEvalues as the housekeeping genes Although the number of

4 BioMed Research International

Table 1 Summary of data set information

Data set Number of samples Number of genesNumber ofgenes (afterfiltering)

Min of averageabundance of genefrom all genes

Max of averageabundance of genefrom all genes

Number of HK(housekeeping genes)

Cheung 41 52 580 12 410 0024 90180 124Bodymap 16 52 580 13 131 0 934100 131AML 27 57 736 12 749 5004 482400 127

housekeeping genes was the same in each data set differentgenes were included for each set The tables and bar plotsconcerning housekeeping gene selection are provided inSupplementary Materials (Table S2 and Figure S2) Theseindicate the same facts as for the case with all genes taken intoaccount There is some variability in data sets with respect tonumber of HG genes for each abundance level of reads Inall data sets we can see that there is the highest number ofHG genes with number of reads in genes below 500 (relativemedium abundance level)

232 Bias and Variance To evaluate the normalizationmethods used for the processing of RNA-seq data we appliedthe bias and variance criterion proposed by Argyropoulos etal [19] for the analysis of double-channelmicroarray dataWeadjusted earlier this method for one-channel microarray data[20] In this paper we transform themethod to be suitable forthe RNA-seq data We can use the following formulae of biasand variance as proxies accuracy and precision respectively

bias119894 = radic1119898

119898

sum

119895=1

(log2 (119870119894119895

119870119894sdot

) minus True Log Ratio)2

(10)

variance119894

=1

119898 minus 1

119898

sum

119895=1

(log2 (119870119894119895

119870119894sdot

)minus log2 (119870119894119895

119870119894sdot

)

119894

)

2

(11)

where 119870119894119895 denotes read counts for 119894th control gene fromthe 119895th sample 119870119894sdot is the mean counts for the 119894th controlgene and log2(119870119894119895119870119894sdot) is the mean value of log2(119870119894119895119870119894sdot) forthe 119894th control gene and True Log Ratio is the true valueof log2(119870119894119895119870119894sdot) for the 119894th control gene from the definitionof HG The accuracy of a measurement is the degree ofcloseness of measurements of a quantity to the true valueThe precision of a measurement is the degree to whichrepeated measurements under unchanged conditions showthe same results Based on the definition each gene whichis considered as HG should have the same number of countsin each sample as well as mean value of counts for this geneThus the True Log Ratio is equal to 0 and the formula (10) forthe bias reduces to the root mean square error (RMSE)

bias119894 = radic1

119898

119898

sum

119895=1

(log2 (119870119894119895

119870119894sdot

))

2

(12)

The ratios (bias and variance) for each normalizationmethodwill be the mean of bias and variance values computed forall of the control (housekeeping) genes The normalizationmethod ldquoArdquo would be preferred over the method ldquoBrdquo if it isassociated with the smallest bias and variance [19]

233 Differential Expression Analysis Our goal was to findout how normalizationmethods affect differential expressionresults Therefore after application of each normalizationmethod differential expression analysis was performed usingthe edgeR method from the edgeR Bioconductor package[12] This method was chosen because it showed reliableperformance in wide range of experiments and it allowseasy inclusion of scaling factors into the statistical test TheedgeR method was specifically developed to model countdata dispersion and it is designed for overdispersed RNA-seqdata Briefly it is assumed that counts are a negative binomialdistribution according to

119870119892119895 sim 119873119861 (119889119895120582119892119895 120601119892) (13)

where 120582119892119895 is the expression level of gene 119892 in sample 119895 119889119895 isthe normalization factor in sample 119895 and 120601119892 is the dispersionof gene 119892 The method of differential expression analysisimplemented in the edgeR package extends Fisherrsquos exacttest

234 Prediction Errors Discriminant analysis was appliedto determine the effectiveness of the sample classificationbased on found DEGs identified in each data set after eachnormalization procedure Different classification methodsmay lead to different prediction errors We estimated theclassification errors based on the five classifiers naive Bayesneural network 119896-nearest neighbour and support vectormachines and random forest that are presented in Table 2Analyzing each data set we used leave-one-out cross valida-tion (LOOCV) to obtain the estimate of the predictor errorsof the test set that results from using different classifiers For adata set with 119899 samples this method involves 119899 separate runsFor each run a number of samples minus one data pointare used to train the model and then prediction is performedon the remaining data point The overall prediction erroris the sum of the errors for all 119899 runs [21] As input forthe classifiers in the discriminant analysis we chose theseinformative genes having high power of discrimination thatare differentially expressed The set of genes was obtainedby gene selection process from a test statistic after eachnormalization procedure

BioMed Research International 5

Table 2 Classifiers used in the calculations function in R and the name and the version of R package

Classification method Function in R R package VersionNaive Bayes NaiveBayesI MLInterfaces 1400Neural network nnetI MLInterfaces 1400119896ndashnearest neighbour knnI MLInterfaces 1400Support vector machines svmI MLInterfaces 1400Random forest randomForestI MLInterfaces 1400

Cheung

20

15

10

5

0

Num

ber o

f DEG

s

TMM UQ

DES EB

S PS RD

Methods

TMM UQ

DES EB

S PS RD

Methods

TMM UQ

DES EB

S PS RD

Methods

Bodymap

100

50

0

AML

3000

2000

1000

0

Abundance of geneslt100

100ndash5000gt5000

Abundance of geneslt100

100ndash5000gt5000

Abundance of geneslt100

100ndash5000gt5000

Figure 1 Bar plots of the DEGs with specified levels of count abundance in all studied data sets On the 119909-axis the methods of normalizationare featured whereas the 119910-axis represents the number of DEGs determined after each normalization procedure The bar colours representthe groups of genes of particular level of expression

All computations and diagnostic plots were performedwith R 302 [22]

3 Results

The aim of our study was to compare various normalizationapproaches and outline the workflow that will help to selectthe normalization method appropriate for a particular dataset We decided to test five of the most commonly usedmethods of RNA-seq data normalization Trimmed Meanof 119872-values (TMM) Upper Quartile (UQ) Median (DES)Quantile (EBS) and PoissonSeq (PS) described in detailin the Methods section All methods were tested on threedifferent data sets Bodymap Cheung and AML data Thedetails concerning data sets are described in Section 2

31 Differential Expression (DE) Analysis The main purposeof many RNA-seq experiments is to identify genes that aredifferentially expressed between the compared conditions

Analyzing three mentioned above data sets we checked thedirect impact of the normalization methods described aboveon the results of DE analysis After each normalizationwe determined the lists of differentially expressed genes(DEGs) using the statistic test from the edgeR package Inthe case of each data set we compared gene expression levelsbetween two types of biological samples and ranked the genesaccording to adjusted 119875 values Genes that had adjusted 119875

values lt 005 were selected as differentially expressedThe results of DE analysis can be compared based on the

number and content of DEGs Visualisation of the DEGs ispresented in Figure 1 The bar plots show the contributionof genes with particular expression levels in DEG lists forall data sets selected from data submitted to five testedmethods of normalization and raw data (RD) As we cansee the influence of normalization methods differs betweendata sets In the case of the Cheung data set all methodsresult in a similar number of DEGs and their significantpart constitutes genes with low levels of expression A more

6 BioMed Research International

Table 3 Rank of the bias values obtained using five normalization methods and RD for the normalization of the three tested data sets

Normalization method Cheung data set rank (bias value) Bodymap data set rank (bias value) AML data set rank (bias value)TMM 35 (0890) 4 (1123) 6 (0587)UQ 35 (0890) 5 (1139) 4 (0564)DES 1 (0885) 2 (1102) 1 (0512)EBS 2 (0887) 3 (1111) 3 (0532)PS 5 (0893) 1 (1098) 2 (0523)RD 6 (0908) 6 (1159) 5 (0581)

Table 4 Rank of the variance values obtained using the five normalization methods and RD for the normalization of the three tested datasets

Normalization method Cheung data set rank(variance value)

Bodymap data setrank (variance value)

AML data set rank(variance value)

TMM 4 (0779) 4 (1305) 6 (0364)UQ 3 (0778) 5 (1330) 4 (0335)DES 1 (0768) 2 (1270) 1 (0283)EBS 2 (0771) 3 (1283) 3 (0300)PS 5 (0782) 1 (1268) 2 (0291)RD 6 (0812) 6 (1390) 5 (0355)

balanced contribution ofDEGswas observed in the Bodymapdata set the number of DEGs with a middle expression levelis slightly higher than the number of very highly or weaklyexpressed genes In AML data set genes with an averageexpression level clearly predominate on the list of DEGsHere there are also the most significant differences in thenumber of DEGs between variously normalized data Themost restrictive method seemed to be TMM whereas thehighest numbers of DEGs were obtained by EBS and PSmethods The contribution of all groups of genes is the samefor each normalization method

MA plots available in Supplementary Materials (FigureS3) were generated for additional visualization of DEGsThey present the relationship between base mean and log2FCof the counts The results show that DEGs in each MA plotafter normalization are slightly more spread out compared tothe raw data However the location of DEG differs dependingon the normalization method for AML and Bodymap datasets In the case of AML data it is also worth pointing outthat we can observe more DEGs that are overexpressed thanunderexpressed

As it is difficult to evaluate which normalization methodismore suitable for a data set based onMAplot analysismoreprecise verification is necessary To this endwe calculated biasand variance values

32 Bias andVarianceCalculation Based ondetails describedin the previous section we selected housekeeping genes forwhich we calculated bias and variance values Following theidea described in the study [19] we assumed that the mostappropriate normalization method is that which generatesthe lowest bias and variance values for the control genesThe bias and variance values were calculated accordingto formulae (11) and (12) for all the housekeeping genesselected separately for each data set as described in Section 2

Then for each data set the mean of bias and variancevalues for each normalization method and for all controlgenes was calculated Tables 3 and 4 present the rankingof normalization methods based on these bias and variancevalues the method with the lowest bias or variance wasranked as 1 The highest bias and variance values at leastin the Cheung and Bodymap data sets were observed forthe unnormalized raw data (RD) included into the tablesfor comparison Taking into account all of these data setsthe conclusion is that the best results were obtained usingDES EBS and PS normalization methods and the methodswhich generated the highest bias and variance values wereTMM and UQ In the case of AML data application of TMMmethod led to increase of the bias and the variance presentin RD It is worth pointing out that the differences betweenthem are small in all data sets

33 Sensitivity and Specificity The sensitivity and the speci-ficity of the normalization methods were investigated usingthe AML data set based on our earlier experience withthe analysis of microarray data described in [20] as wellas evidence from literature [23 24] First the sets of geneswere selected as positive and negative controls The positivecontrols consisted of the genes that were strong candidatesfor DEGs For the set of positive controls we selected genesthat were validated by real-time polymerase chain reaction(PCR) analysis or described in the literature as overexpressed(or less frequently underexpressed) in AML or immaturehematopoietic cells [23] The negative controls included thegenes that are not DEGs (their expression levels should notdiffer between these two types of samples) [24] In total44 genes were chosen as the positive controls and 44 geneswere chosen as the negative controls The lists of genescan be found in Table S3 The sensitivity and specificityof the normalization methods were calculated respectively

BioMed Research International 7

Table 5 Sensitivity and specificity of the studied five normalizationmethods and RD applied to the AML data set

Methods TMM UQ DES EBS PS RDSensitivity () 114 205 182 455 409 182Rank of sensitivity 6 3 45 1 2 45Specificity () 977 841 977 523 659 977Rank of specificity 2 4 2 6 5 2

as a percentage of positive controls that were present andthe percentage of negative controls that were absent ineach list of differentially expressed genes The normalizationmethods with the highest values of specificity better shownondifferentially expressed genes while on the other handthemethodswith the highest sensitivity values indicate a highprobability of finding DEGs that are truly DEGs The resultsof this analysis are presented in Table 5

Table 5 shows that the specificity values for EBS andPS methods are substantially lower than for the remainingones The sensitivity values were less divergent between themethods and were generally low in the range between 10 and31 Furthermore for TMM and EBS methods we obtainedthe most divergent results the highest specificity but thelowest sensitivity for TMM and the opposite for EBS Inthe case of specificity we can see that most of the methodsproduce values of specificity over 80

34 Prediction Errors Table 6 presents a comparison of thefive normalization methods when for each data set thedifferent numbers of informative genes were used basedon five classifiers and LOOCV In each case we selectedthe number of differentially expressed genes as 75 of thenumber of samples Therefore for the Cheung Bodymapand AML data sets we have respectively 30 (075 lowast 41) 12(075 lowast 16) and 20 (074 lowast 27) differentially expressed genesThe results in the table suggest that in all data sets PS DESand EBS perform better than TMM and UQ

35 Diagnostic Plots Besides the analyticalmethods for com-parison of normalization methods we suggest using addi-tional determinants that may be helpful for the rejectionof the most commonly used methods that evidently fail Itis possible that the normalization methods yield differentresults for different data sets Therefore we suggest theapplication of the following workflow based on diagnosticplots to determine which normalization method is optimalfor a specific data set Here we focus only on the AML dataset In Figure 2(a) we can observe the differences introducedto normalization factors obtained by each normalizationmethod From this figure we can conclude that normalizationcoefficients determined with TMM and UQ methods grouptogether and divergent from the rest of normalization meth-ods

The results in Table 6 may be summarised consider-ing the average errors expressed through figures For theAML data set the performances of the five normalizationmethods can be ranked The percentages of classificationerrors obtained by normalization methods could be used

Table 6 Performances of the normalization methods with infor-mative genes based on five classifiers and LOOCV applied to theCheung Bodymap and AML data sets The first number in eachcell denotes the percentage of the average of five prediction errorsfrom five different classification methods The second number ineach cell that is in brackets is the percentage of the median of thefive prediction errors

TMM UQ DES EBS PS RD

Cheung 111 107 45 44 39 116(98) (73) (04) (00) (00) (122)

Bodymap 162 162 160 158 160 162(188) (188) (177) (167) (177) (188)

AML 81 80 77 74 73 83(74) (74) (74) (74) (74) (74)

to obtain a 95 confidence interval of the mean of theproportion of classification errors for the normalization Thecorresponding bar chart plotting confidence intervals is givenin Figure 2(b)This plot indicate that the TMMUQ andDESmethods outperform the two other methods with respect toclassification performance

36 Common DEGs To compare the number of DE genesand the number of common DE genes found among the nor-malization methods performed for a particular data set wegenerate balloon plots (Figure 2(c) and Figure S5) and Venndiagrams (see Figure S4) Balloon plots represent percentagesof the number of commonly detected differentially expressedgenes between the 119894th and 119895th methods First we calculatein the (119894 119895)th cell a proportion of common detection withrespect to the 119894th method

119875119894119895 =

119863119894119895

119863119894

sdot 100 (14)

where 119894 = 119895 119863119894119895 is the number of differentially expressedgenes commonly detected by the 119894th method and the 119895thmethod and 119863119894 is the number of differentially expressedgenes detected by the 119894th method Next we take the averagepercentage value of common genes between each pair ofmethods The most preferable method is the one with thehighest number of common DEGs

For the AML data set (Figure 2(c)) the lowest number ofcommonDEGs with othermethods is produced by the TMMnormalization method and the best by the EBS methodThe lowest number of common DEGs was obtained for theEBSndashTMM and PSndashTMM comparisons From this analysiswe concluded that the set of genes identified as DEGs is notstable and depends on both themethod of data normalizationand the data set itself However in the case of AML data setthere is a set of 227 genes common for all tested methods(Supplementary Figure S4) and these genes can be consideredas the strongest candidates for DEGs (Figure S4) In the caseof the Cheung data set we noted that most of the methodsappear to perform similarly For the Bodymap data set allnormalization methods yielded slightly lower percentages ofcommonly detected DEGs than in the case of the Cheung

8 BioMed Research International

050075100125150

1 3 5 7 9 11 13 15 17 19 21 23 25 27Samples

Nor

mal

izat

ion

coeffi

cien

t

MethodsTMMUQDES

EBSPSRD

(a)

6

9

12

15

TMM UQ

DES EB

S PS RD

Methods

Clas

sifica

tion

erro

r

(b)

60 68 53 54 70

60 80 68 70 76

68 80 60 62 90

53 68 60 94 60

54 70 62 94 60

70 76 90 60 60RDPS

EBSDESUQ

TMM

TMM UQ

DES EB

S PS RD

Methods

Met

hods

0

25

50

75

()

(c)

0

50

100

150

TMM UQ

DES RD EB

S PS

(d)

Figure 2 Diagnostic plots for the AML data set Besides the five normalization methods raw data (RD) were also included in the plots as abenchmark (a) presents calculated normalization factor values across the samples by eachmethodThe samples are ordered by theminimumvalues of normalization factors (b) shows 95 confidence intervals for the mean of the percentages of classification errors calculated for eachmethod based on five selected classifiers (c) shows the numbers of common DE genes across each pair of normalization methods The sizeand shading of the circles represent the average percentage value of common genes between each pair of methods (d) presents the results ofclustering of the normalization methods based on 20 common DE genes found by each of these methods A dendrogram was created withhierarchical clustering based on Wardrsquos method

data set For the raw data (RD) the number of common genesfor any pair of methods was below 50 (see Figure S5) TheVenn diagrams for the Cheung and Bodymap data sets alsoconfirmed this tendency (Figure S4)

37 Grouping of Normalization Methods Clustering isanother approach to compare the performance of the normal-ization methods Based on the similarities between the DEgene lists we generated dendrograms to easily observe whichmethods group together The exact measure of similarity wasthe DE gene rank In the case of each data set for five variantsof normalized data as well as for raw data (RD) particularsets of differentially expressed genes were obtained First wechose genes common to all six DEG lists Since we obtained20 genes in common between six methods we performedthe clustering of the normalization methods based onthese common DEGs Then for all methods we rankedthese genes thus obtaining ranking lists of genes Basedon these lists we computed the distance matrix by usingEuclidian distance and plot dendrograms (Figure 2(d)) Thedendrograms were constructed from hierarchical clusteringusing Wardrsquos methodThis criterion is another approach thatcompares the DEGs lists The previous criterion (commonDEGs) gives us the information about the percentage of DEGin common per pair of methods whereas this criterion gives

the information about the similarity between the methodsbased on the order of common DEGs in each list

38 Summary of Ranks Combining all of the criteriadescribed in the previous sections we would like to deter-mine which of the five tested normalization methods wouldbe appropriate for the AML data set Table 7 summarises theranks obtained based on the bias and variance values theprediction errors sensitivity specificity and the number ofcommon DEGs for the AML data set In the case when allthe criteria included in Table 7 are equally important theinvestigator can base the decision on the final rank calculatedas the mean of the ranks established separately for eachcriterion using chosen normalization methods

4 Discussion

In practice normalization of high-throughput data stillremains an important question and has received a lot of atten-tion in the literatureThe increasing number of normalizationmethods makes it difficult for scientists to decide whichmethod should be used for which particular data set Basedon the results presented in this paper as well as in [25] we canconclude that normalization affects differential expressionanalysis therefore an important aspect is how to choose

BioMed Research International 9

Table 7 Summary of comparison results for the five normalizationmethods under considerationThefinal rank is based on the bias andvariance values the prediction errors sensitivity specificity valuesand the number of common DEGs for AML data

TMM UQ DES EBS PSBias 50 40 10 30 20Variance 50 40 10 30 20Sensitivity 50 30 40 10 20Specificity 15 30 15 50 40Prediction errors 50 40 30 20 10Common DEGs 50 20 10 40 30

the most sensible method for the data In this paper we haveshown that depending on the data structure the influenceof normalization differs (see Figures 1 and 2(d)) In our workwe have demonstrated that based on some of these criteriathe choice of normalization method can be more suitableand robust and can be made more automatically Coefficientssuch as bias and variance can be considered as criteria for acomparison of normalization methods It is worth pointingout that in our investigation in most cases including thenormalization reduces the bias and variance values comparedto raw data which confirms the need of normalizationWhenthe differences between the bias and variance values aresignificant the usage of ranks of these values reflects the realdifferences more precisely However in our investigation thedifferences between obtained bias and variance values forall methods in all data sets are small In such case usingranks does not reflect the true differences between methodsand additional criteria are neededThe diagnostic plots couldserve as such additional determinants and may be helpful forthe rejection of the most leading methods that evidently failOur study indicates that the use of TMM method in mostcases is displayed poorly This conclusion is not in agreementwith the evaluation made by [26 27] Their study indicatedthat the use of TMM method led to good performance onsimulated data sets One reason for the disagreement of theseresults can be due to different approaches of comparison andusage of criteria not considered by other authors Anotherreason of differences in the conclusions could arise from thenumber of biological replicates used in our study or could berelated to the particular data sets used in these papers

Furthermore we find that other conclusions described in[26] are consistentwith our own resultsThese results confirmthe satisfactory results of the DESeq method In Table 7 wepresented the summary of ranks for the AML data set It ispossible that the normalization of other data sets can yieldresults that are different from those obtained with the AMLdata set (in this paper we have shown that depending onthe data structure the influence of normalization differs)In general each criterion proposed in the paper focuses ondifferent aspects of comparison Depending on the mainobjectives of the research some of the criteria could be moreuseful for example if impact lies in good prediction basedon chosen genes the important aspect will be predictionerrors However in some cases the results of criteria areinconclusive In such situation we suggest the application of

the following workflow to determine which normalizationmethod is optimal for a specific data set (i) normalize thedata using considered methods (ii) calculate the ldquobiasrdquo andldquovariancerdquo and rank the methods based on these values (iii)after each normalization perform differential analysis anddetermine DEG lists found by each normalization method(iv) select a subset of genes that can serve as positive andnegative controls to investigate the sensitivity and specificityof normalization methods and rank the methods basedon these criteria (v) calculate the percentage of the meanof the prediction errors obtained using chosen classifiersfor DEGs found by each normalization method and rankthem (vi) draw Venn diagrams or balloon plots based onthe number of differentially expressed genes and rank themethods based on the number of common DEG valuesand (vii) based on the summary of ranks choose the mostappropriate normalization method of the investigated dataset We notice here that the normalization method caninfluence the expression results leading to erroneous DEanalysis therefore it is very important to put effort at thisstage of the analysis Finally we wanted to draw attentionthat our paper does not indicate clearly which normalizationmethod is the best but it adds a new look at how to choose thenormalization for RNA-Seq data analysis to avoid erroneousDE analysis It can be applied not only tomethods concerningsequencing depth but the proposed algorithm is also suitableto compare normalizations that take into account othersources of unwanted variation

5 Conclusions

In the study new coefficients such as bias and variancewere proposed as objective criteria for a comparison of nor-malization methods In conclusion our results suggest thatdepending on the RNA-seq data structure and the appliedmethod the influence of normalization differs Howeverthe presented criteria in particular bias and variance cansupport the choice of normalization method optimal for aspecific data set

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank the reviewer for the thor-ough evaluation of the paper and valuable comments Thiswork was supported by the Ministry of Science and HigherEducation of the Republic of Poland by KNOW program

References

[1] S Zhao W-P Fung-Leung A Bittner K Ngo and X LiuldquoComparison of RNA-Seq and microarray in transcriptomeprofiling of activated T cellsrdquo PLoS ONE vol 9 no 1 ArticleID e78644 2014

10 BioMed Research International

[2] E T Wang R Sandberg S Luo et al ldquoAlternative isoformregulation in human tissue transcriptomesrdquoNature vol 456 no7221 pp 470ndash476 2008

[3] Q Pan O Shai L J Lee B J Frey and B J Blencowe ldquoDeepsurveying of alternative splicing complexity in the human tran-scriptome by high-throughput sequencingrdquo Nature Geneticsvol 40 no 12 pp 1413ndash1415 2008

[4] W M Landau and P Liu ldquoDispersion estimation and its effecton test performance in RNA-seq data analysis a simulation-based comparison of methodsrdquo PLoS ONE vol 8 no 12 ArticleID e81415 2013

[5] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008

[6] A Oshlack andM JWakefield ldquoTranscript length bias in RNA-seq data confounds systems biologyrdquo Biology Direct vol 4article 14 2009

[7] J K Pickrell J C Marioni A A Pai et al ldquoUnderstandingmechanisms underlying human gene expression variation withRNA sequencingrdquoNature vol 464 no 7289 pp 768ndash772 2010

[8] D Risso K Schwartz G Sherlock et al ldquoGC-content normal-ization for RNA-seqrdquo Tech Rep 291 Division of BiostatisticsUniversity of California Berkeley 2011

[9] J T Leek R B Scharpf H C Bravo et al ldquoTacklingthe widespread and critical impact of batch effects in high-throughput datardquo Nature Reviews Genetics vol 11 no 10 pp733ndash739 2010

[10] J H Bullard E Purdom K D Hansen and S Dudoit ldquoEvalu-ation of statistical methods for normalization and differentialexpression in mRNA-Seq experimentsrdquo BMC Bioinformaticsvol 11 article 94 2010

[11] M D Robinson and A Oshlack ldquoA scaling normalizationmethod for differential expression analysis of RNA-seq datardquoGenome Biology vol 11 no 3 article r25 2010

[12] M D Robinson D J McCarthy and G K Smyth ldquoEdgeRa bioconductor package for differential expression analysis ofdigital gene expression datardquo Bioinformatics vol 26 no 1 pp139ndash140 2010

[13] S Anders and W Huber ldquoDifferential expression analysis forsequence count datardquo Genome Biology vol 11 no 10 articleR106 2010

[14] N Leng J Dawson J Thomson et al ldquoEBSeq an empiricalbayes hierarchical model for inference in RNA-seq experi-mentsrdquo Tech Rep 226 University of Wisconsin 2012

[15] J Li D M Witten I M Johnstone and R TibshiranildquoNormalization testing and false discovery rate estimation forRNA-sequencing datardquo Biostatistics vol 13 no 3 pp 523ndash5382012

[16] Y W Asmann B M Necela K R Kalari et al ldquoDetection ofredundant fusion transcripts as biomarkers or disease-specifictherapeutic targets in breast cancerrdquo Cancer Research vol 72no 8 pp 1921ndash1928 2012

[17] V G Cheung R R Nayak I X Wang et al ldquoPolymorphic cis-and Trans-Regulation of human gene expressionrdquo PLoS Biologyvol 8 no 9 Article ID e1000480 2010

[18] A C Frazee B Langmead and J T Leek ldquoRecount a multi-experiment resource of analysis-ready RNA-seq gene countdatasetsrdquo BMC Bioinformatics vol 12 article 449 2011

[19] C Argyropoulos A A Chatziioannou G Nikiforidis AMoustakas G Kollias and V Aidinis ldquoOperational criteria forselecting a cDNA microarray data normalization algorithmrdquoOncology Reports vol 15 no 4 pp 983ndash996 2006

[20] B Uszczynska J Zyprych-Walczak L Handschuh et al ldquoAnaly-sis of boutique arrays a universalmethod for the selection of theoptimal data normalization procedurerdquo International Journal ofMolecular Medicine vol 32 no 3 pp 668ndash684 2013

[21] D Chen Z Liu X Ma and D Hua ldquoSelecting genes by teststatisticsrdquo Journal of Biomedicine and Biotechnology vol 2005no 2 pp 132ndash138 2005

[22] RDevelopment Core Team R A Language and Environment forStatistical Computing R Foundation for Statistical ComputingVienna Austria 2011

[23] M Goswami N Hensel B D Smith et al ldquoExpression ofputative targets of immunotherapy in acute myeloid leukemiaand healthy tissuesrdquo Leukemia vol 28 no 5 pp 1167ndash1170 2014

[24] E Eisenberg and E Y Levanon ldquoHuman housekeeping genesrevisitedrdquo Trends in Genetics vol 29 no 10 pp 569ndash574 2013

[25] F Seyednasrollah A Laiho and L L Elo ldquoComparison ofsoftware packages for detecting differential expression in RNA-seq studiesrdquo Briefings in Bioinformatics vol 16 no 1 pp 59ndash702015

[26] M-A Dillies A Rau J Aubert et al ldquoA comprehensive evalu-ation of normalization methods for Illumina high-throughputRNA sequencing data analysisrdquo Briefings in Bioinformatics vol14 no 6 pp 671ndash683 2013

[27] E Maza P Frasse P Senin M Bouzayen and M ZouineldquoComparison of normalization methods for differential geneexpression analysis in RNA-Seq experiments a matter ofrelative size of studied transcriptomesrdquo Communicative ampIntegrative Biology vol 6 no 6 Article ID e25849 2013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 3: Research Article The Impact of Normalization Methods on ...downloads.hindawi.com/journals/bmri/2015/621690.pdf · normalization step di erent aspects based on the biology or informatics

BioMed Research International 3

where

119889TC119895

=

sum119866

119892=1119870119892119895

sum119866

119892=1sum119898

119895=1119870119892119895(7)

is the Total Count (TC) scaling factor We chose genes tothe set 11986610158401015840 for which the values of GOF119892 are in interval(025 075)

Furthermore we compared the performance of all nor-malization methods mentioned above against unnormalizeddata denoted by ldquoraw datardquo (RD)

In addition another source of variation related to GC-content and batch effect was also considered Obtainedresults revealed that inclusion of GC content analysis did notcontribute to the proposed strategy of comparison and didnot influence the summary ranking (see Figure S5 and TableS4 in Supplementary Material available online at httpdxdoiorg1011552015621690) Therefore this normalizationstrategy was not included in further analysis

Theundesirable variation coming frombatch effects suchas sampling time different technology can be adjusted withthe help of methods implemented in sva package [9] Weonly had information about the batches in the AML datathat could be introduced by two sampling dates That is whywe only considered such approach for AML data Beforeincluding additional normalization we checked whether thepresence of batch effects in AML data sets existed Detectingthe presence of batch effects in AML data was accomplishedin two ways We have provided a hierarchical clustering (seeFigure S8) together with principal component analysis (seeFigure S9) Our results did not confirm existence of batcheffect connected with sampling date in the case of AMLdata Therefore this normalization strategy was neglected infurther analysis

22 Data Sources The five normalization methods werecompared based on the analysis of three real RNA-seq datasets Two data sets included in this study were obtained frompublicly available resources (Bodymap and Cheung data)and one was obtained from the project performed at theInstitute of Bioorganic Chemistry in Poznan (AML data)The Bodymap data set was published in [16] where the tran-scriptome of nontransformed human mammary epithelialcells in reference to Illumina Bodymap data collected fromnormal tissues was studied The Cheung data set [17] comesfrom the study of gene expression of human B cells fromindividuals belonging to large families The aim of the studywas the identification of polymorphic transregulators Bothof these data sets are deposited in the Recount database athttpbowtie-biosourceforgenetrecount [18]

The AML data set comes from our as yet unpublishedRNA-seq experiment We studied gene expression profiles in30 peripheral blood (PB) and bone marrow (BM) samplesobtained from 25 adult AML (acute myeloid leukaemia)patients cured at the Department of Haematology and BoneMarrow Transplantation Poznan University of Medical Sci-ences Two samples per lane were sequenced at the Instituteof Bioorganic Chemistry in Poznan using single-read flow-cell (SR-FC) and Genome Analyzer IIx (GAIIx Illumina)

As a control one BM sample and a pool of 12 PB samplesobtained from healthy volunteers were sequenced on oneadditional lane The libraries were prepared from up to4 120583g total RNA with the use of the TruSeq RNA SamplePreparation Kit v2 (Illumina) according to the instructionsof the manufacturer and validated with a DNA 1000 Chip(Agilent) Each library generated approximately 20 million50-nt-long reads processed in CASAVA FastQC and theNGS QC Toolkit The reads were mapped to the referencehuman genome UCSC hg19 with TopHat run (v206)

Before all calculations we filtered the data sets to obtaina similar amount of genes in each data set From the Cheungand Bodymap data sets we chose only those genes for whichthe mean of the counts for all samples was greater than0 whereas in the AML data set we chose 50 as the cut-off value for the mean of the counts in all samples Thesummarised information about each data set is presentedin Table 1 while the details can be found in SupplementaryTable S1 and Figure S1 As we can see the data sets variedwith the sample sizes and gene numbers as well as thegene expression levels In Cheung data set genes with lowlevels of expression predominate In Bodymap data set thebiggest group is constituted by the genes with high levels ofexpression while in the AML data set the biggest group isconstituted by the genes with medium levels of expressionSuch variety of data sets enables revealing differences ofperformance of normalization methods in case of data withdifferent structure of genes

23 Analytical Criteria for Comparison ofNormalization Methods

231 Selection of the Housekeeping Genes (HG) The ideaof using HG arose from the previous research concerningmicroarray experiment In that experiment dye-swappedmicroarrays were used Chosen dyes played the role ofhousekeeping genes In considered RNA-seq experiments wedo not have the lists of housekeeping genes straightforwardlyThat is why we decided to perform our investigation of theimpact of normalizationmethods based on analytical versionof HG lists genes similarly expressed across samples

The square root of the mean square error was used as ameasure The housekeeping genes were selected based on theraw data by using the following formula

MSE119892 =radicsum119898

119895=1 (119870119892119895 minus 119870119892sdot)2

119898

(8)

where 119870119892sdot is the mean of the counts for gene 119892 We decidedto choose housekeeping genes for all data sets based on aparticular cut-off value and applied a linear transformation ofthe results to the interval [0 1]usingmin-max normalization

MSE119892 =MSE119892 minusmin (MSE119892)

max (MSE119892) minusmin (MSE119892) (9)

Then we selected 1 of all genes with the lowest MSEvalues as the housekeeping genes Although the number of

4 BioMed Research International

Table 1 Summary of data set information

Data set Number of samples Number of genesNumber ofgenes (afterfiltering)

Min of averageabundance of genefrom all genes

Max of averageabundance of genefrom all genes

Number of HK(housekeeping genes)

Cheung 41 52 580 12 410 0024 90180 124Bodymap 16 52 580 13 131 0 934100 131AML 27 57 736 12 749 5004 482400 127

housekeeping genes was the same in each data set differentgenes were included for each set The tables and bar plotsconcerning housekeeping gene selection are provided inSupplementary Materials (Table S2 and Figure S2) Theseindicate the same facts as for the case with all genes taken intoaccount There is some variability in data sets with respect tonumber of HG genes for each abundance level of reads Inall data sets we can see that there is the highest number ofHG genes with number of reads in genes below 500 (relativemedium abundance level)

232 Bias and Variance To evaluate the normalizationmethods used for the processing of RNA-seq data we appliedthe bias and variance criterion proposed by Argyropoulos etal [19] for the analysis of double-channelmicroarray dataWeadjusted earlier this method for one-channel microarray data[20] In this paper we transform themethod to be suitable forthe RNA-seq data We can use the following formulae of biasand variance as proxies accuracy and precision respectively

bias119894 = radic1119898

119898

sum

119895=1

(log2 (119870119894119895

119870119894sdot

) minus True Log Ratio)2

(10)

variance119894

=1

119898 minus 1

119898

sum

119895=1

(log2 (119870119894119895

119870119894sdot

)minus log2 (119870119894119895

119870119894sdot

)

119894

)

2

(11)

where 119870119894119895 denotes read counts for 119894th control gene fromthe 119895th sample 119870119894sdot is the mean counts for the 119894th controlgene and log2(119870119894119895119870119894sdot) is the mean value of log2(119870119894119895119870119894sdot) forthe 119894th control gene and True Log Ratio is the true valueof log2(119870119894119895119870119894sdot) for the 119894th control gene from the definitionof HG The accuracy of a measurement is the degree ofcloseness of measurements of a quantity to the true valueThe precision of a measurement is the degree to whichrepeated measurements under unchanged conditions showthe same results Based on the definition each gene whichis considered as HG should have the same number of countsin each sample as well as mean value of counts for this geneThus the True Log Ratio is equal to 0 and the formula (10) forthe bias reduces to the root mean square error (RMSE)

bias119894 = radic1

119898

119898

sum

119895=1

(log2 (119870119894119895

119870119894sdot

))

2

(12)

The ratios (bias and variance) for each normalizationmethodwill be the mean of bias and variance values computed forall of the control (housekeeping) genes The normalizationmethod ldquoArdquo would be preferred over the method ldquoBrdquo if it isassociated with the smallest bias and variance [19]

233 Differential Expression Analysis Our goal was to findout how normalizationmethods affect differential expressionresults Therefore after application of each normalizationmethod differential expression analysis was performed usingthe edgeR method from the edgeR Bioconductor package[12] This method was chosen because it showed reliableperformance in wide range of experiments and it allowseasy inclusion of scaling factors into the statistical test TheedgeR method was specifically developed to model countdata dispersion and it is designed for overdispersed RNA-seqdata Briefly it is assumed that counts are a negative binomialdistribution according to

119870119892119895 sim 119873119861 (119889119895120582119892119895 120601119892) (13)

where 120582119892119895 is the expression level of gene 119892 in sample 119895 119889119895 isthe normalization factor in sample 119895 and 120601119892 is the dispersionof gene 119892 The method of differential expression analysisimplemented in the edgeR package extends Fisherrsquos exacttest

234 Prediction Errors Discriminant analysis was appliedto determine the effectiveness of the sample classificationbased on found DEGs identified in each data set after eachnormalization procedure Different classification methodsmay lead to different prediction errors We estimated theclassification errors based on the five classifiers naive Bayesneural network 119896-nearest neighbour and support vectormachines and random forest that are presented in Table 2Analyzing each data set we used leave-one-out cross valida-tion (LOOCV) to obtain the estimate of the predictor errorsof the test set that results from using different classifiers For adata set with 119899 samples this method involves 119899 separate runsFor each run a number of samples minus one data pointare used to train the model and then prediction is performedon the remaining data point The overall prediction erroris the sum of the errors for all 119899 runs [21] As input forthe classifiers in the discriminant analysis we chose theseinformative genes having high power of discrimination thatare differentially expressed The set of genes was obtainedby gene selection process from a test statistic after eachnormalization procedure

BioMed Research International 5

Table 2 Classifiers used in the calculations function in R and the name and the version of R package

Classification method Function in R R package VersionNaive Bayes NaiveBayesI MLInterfaces 1400Neural network nnetI MLInterfaces 1400119896ndashnearest neighbour knnI MLInterfaces 1400Support vector machines svmI MLInterfaces 1400Random forest randomForestI MLInterfaces 1400

Cheung

20

15

10

5

0

Num

ber o

f DEG

s

TMM UQ

DES EB

S PS RD

Methods

TMM UQ

DES EB

S PS RD

Methods

TMM UQ

DES EB

S PS RD

Methods

Bodymap

100

50

0

AML

3000

2000

1000

0

Abundance of geneslt100

100ndash5000gt5000

Abundance of geneslt100

100ndash5000gt5000

Abundance of geneslt100

100ndash5000gt5000

Figure 1 Bar plots of the DEGs with specified levels of count abundance in all studied data sets On the 119909-axis the methods of normalizationare featured whereas the 119910-axis represents the number of DEGs determined after each normalization procedure The bar colours representthe groups of genes of particular level of expression

All computations and diagnostic plots were performedwith R 302 [22]

3 Results

The aim of our study was to compare various normalizationapproaches and outline the workflow that will help to selectthe normalization method appropriate for a particular dataset We decided to test five of the most commonly usedmethods of RNA-seq data normalization Trimmed Meanof 119872-values (TMM) Upper Quartile (UQ) Median (DES)Quantile (EBS) and PoissonSeq (PS) described in detailin the Methods section All methods were tested on threedifferent data sets Bodymap Cheung and AML data Thedetails concerning data sets are described in Section 2

31 Differential Expression (DE) Analysis The main purposeof many RNA-seq experiments is to identify genes that aredifferentially expressed between the compared conditions

Analyzing three mentioned above data sets we checked thedirect impact of the normalization methods described aboveon the results of DE analysis After each normalizationwe determined the lists of differentially expressed genes(DEGs) using the statistic test from the edgeR package Inthe case of each data set we compared gene expression levelsbetween two types of biological samples and ranked the genesaccording to adjusted 119875 values Genes that had adjusted 119875

values lt 005 were selected as differentially expressedThe results of DE analysis can be compared based on the

number and content of DEGs Visualisation of the DEGs ispresented in Figure 1 The bar plots show the contributionof genes with particular expression levels in DEG lists forall data sets selected from data submitted to five testedmethods of normalization and raw data (RD) As we cansee the influence of normalization methods differs betweendata sets In the case of the Cheung data set all methodsresult in a similar number of DEGs and their significantpart constitutes genes with low levels of expression A more

6 BioMed Research International

Table 3 Rank of the bias values obtained using five normalization methods and RD for the normalization of the three tested data sets

Normalization method Cheung data set rank (bias value) Bodymap data set rank (bias value) AML data set rank (bias value)TMM 35 (0890) 4 (1123) 6 (0587)UQ 35 (0890) 5 (1139) 4 (0564)DES 1 (0885) 2 (1102) 1 (0512)EBS 2 (0887) 3 (1111) 3 (0532)PS 5 (0893) 1 (1098) 2 (0523)RD 6 (0908) 6 (1159) 5 (0581)

Table 4 Rank of the variance values obtained using the five normalization methods and RD for the normalization of the three tested datasets

Normalization method Cheung data set rank(variance value)

Bodymap data setrank (variance value)

AML data set rank(variance value)

TMM 4 (0779) 4 (1305) 6 (0364)UQ 3 (0778) 5 (1330) 4 (0335)DES 1 (0768) 2 (1270) 1 (0283)EBS 2 (0771) 3 (1283) 3 (0300)PS 5 (0782) 1 (1268) 2 (0291)RD 6 (0812) 6 (1390) 5 (0355)

balanced contribution ofDEGswas observed in the Bodymapdata set the number of DEGs with a middle expression levelis slightly higher than the number of very highly or weaklyexpressed genes In AML data set genes with an averageexpression level clearly predominate on the list of DEGsHere there are also the most significant differences in thenumber of DEGs between variously normalized data Themost restrictive method seemed to be TMM whereas thehighest numbers of DEGs were obtained by EBS and PSmethods The contribution of all groups of genes is the samefor each normalization method

MA plots available in Supplementary Materials (FigureS3) were generated for additional visualization of DEGsThey present the relationship between base mean and log2FCof the counts The results show that DEGs in each MA plotafter normalization are slightly more spread out compared tothe raw data However the location of DEG differs dependingon the normalization method for AML and Bodymap datasets In the case of AML data it is also worth pointing outthat we can observe more DEGs that are overexpressed thanunderexpressed

As it is difficult to evaluate which normalization methodismore suitable for a data set based onMAplot analysismoreprecise verification is necessary To this endwe calculated biasand variance values

32 Bias andVarianceCalculation Based ondetails describedin the previous section we selected housekeeping genes forwhich we calculated bias and variance values Following theidea described in the study [19] we assumed that the mostappropriate normalization method is that which generatesthe lowest bias and variance values for the control genesThe bias and variance values were calculated accordingto formulae (11) and (12) for all the housekeeping genesselected separately for each data set as described in Section 2

Then for each data set the mean of bias and variancevalues for each normalization method and for all controlgenes was calculated Tables 3 and 4 present the rankingof normalization methods based on these bias and variancevalues the method with the lowest bias or variance wasranked as 1 The highest bias and variance values at leastin the Cheung and Bodymap data sets were observed forthe unnormalized raw data (RD) included into the tablesfor comparison Taking into account all of these data setsthe conclusion is that the best results were obtained usingDES EBS and PS normalization methods and the methodswhich generated the highest bias and variance values wereTMM and UQ In the case of AML data application of TMMmethod led to increase of the bias and the variance presentin RD It is worth pointing out that the differences betweenthem are small in all data sets

33 Sensitivity and Specificity The sensitivity and the speci-ficity of the normalization methods were investigated usingthe AML data set based on our earlier experience withthe analysis of microarray data described in [20] as wellas evidence from literature [23 24] First the sets of geneswere selected as positive and negative controls The positivecontrols consisted of the genes that were strong candidatesfor DEGs For the set of positive controls we selected genesthat were validated by real-time polymerase chain reaction(PCR) analysis or described in the literature as overexpressed(or less frequently underexpressed) in AML or immaturehematopoietic cells [23] The negative controls included thegenes that are not DEGs (their expression levels should notdiffer between these two types of samples) [24] In total44 genes were chosen as the positive controls and 44 geneswere chosen as the negative controls The lists of genescan be found in Table S3 The sensitivity and specificityof the normalization methods were calculated respectively

BioMed Research International 7

Table 5 Sensitivity and specificity of the studied five normalizationmethods and RD applied to the AML data set

Methods TMM UQ DES EBS PS RDSensitivity () 114 205 182 455 409 182Rank of sensitivity 6 3 45 1 2 45Specificity () 977 841 977 523 659 977Rank of specificity 2 4 2 6 5 2

as a percentage of positive controls that were present andthe percentage of negative controls that were absent ineach list of differentially expressed genes The normalizationmethods with the highest values of specificity better shownondifferentially expressed genes while on the other handthemethodswith the highest sensitivity values indicate a highprobability of finding DEGs that are truly DEGs The resultsof this analysis are presented in Table 5

Table 5 shows that the specificity values for EBS andPS methods are substantially lower than for the remainingones The sensitivity values were less divergent between themethods and were generally low in the range between 10 and31 Furthermore for TMM and EBS methods we obtainedthe most divergent results the highest specificity but thelowest sensitivity for TMM and the opposite for EBS Inthe case of specificity we can see that most of the methodsproduce values of specificity over 80

34 Prediction Errors Table 6 presents a comparison of thefive normalization methods when for each data set thedifferent numbers of informative genes were used basedon five classifiers and LOOCV In each case we selectedthe number of differentially expressed genes as 75 of thenumber of samples Therefore for the Cheung Bodymapand AML data sets we have respectively 30 (075 lowast 41) 12(075 lowast 16) and 20 (074 lowast 27) differentially expressed genesThe results in the table suggest that in all data sets PS DESand EBS perform better than TMM and UQ

35 Diagnostic Plots Besides the analyticalmethods for com-parison of normalization methods we suggest using addi-tional determinants that may be helpful for the rejectionof the most commonly used methods that evidently fail Itis possible that the normalization methods yield differentresults for different data sets Therefore we suggest theapplication of the following workflow based on diagnosticplots to determine which normalization method is optimalfor a specific data set Here we focus only on the AML dataset In Figure 2(a) we can observe the differences introducedto normalization factors obtained by each normalizationmethod From this figure we can conclude that normalizationcoefficients determined with TMM and UQ methods grouptogether and divergent from the rest of normalization meth-ods

The results in Table 6 may be summarised consider-ing the average errors expressed through figures For theAML data set the performances of the five normalizationmethods can be ranked The percentages of classificationerrors obtained by normalization methods could be used

Table 6 Performances of the normalization methods with infor-mative genes based on five classifiers and LOOCV applied to theCheung Bodymap and AML data sets The first number in eachcell denotes the percentage of the average of five prediction errorsfrom five different classification methods The second number ineach cell that is in brackets is the percentage of the median of thefive prediction errors

TMM UQ DES EBS PS RD

Cheung 111 107 45 44 39 116(98) (73) (04) (00) (00) (122)

Bodymap 162 162 160 158 160 162(188) (188) (177) (167) (177) (188)

AML 81 80 77 74 73 83(74) (74) (74) (74) (74) (74)

to obtain a 95 confidence interval of the mean of theproportion of classification errors for the normalization Thecorresponding bar chart plotting confidence intervals is givenin Figure 2(b)This plot indicate that the TMMUQ andDESmethods outperform the two other methods with respect toclassification performance

36 Common DEGs To compare the number of DE genesand the number of common DE genes found among the nor-malization methods performed for a particular data set wegenerate balloon plots (Figure 2(c) and Figure S5) and Venndiagrams (see Figure S4) Balloon plots represent percentagesof the number of commonly detected differentially expressedgenes between the 119894th and 119895th methods First we calculatein the (119894 119895)th cell a proportion of common detection withrespect to the 119894th method

119875119894119895 =

119863119894119895

119863119894

sdot 100 (14)

where 119894 = 119895 119863119894119895 is the number of differentially expressedgenes commonly detected by the 119894th method and the 119895thmethod and 119863119894 is the number of differentially expressedgenes detected by the 119894th method Next we take the averagepercentage value of common genes between each pair ofmethods The most preferable method is the one with thehighest number of common DEGs

For the AML data set (Figure 2(c)) the lowest number ofcommonDEGs with othermethods is produced by the TMMnormalization method and the best by the EBS methodThe lowest number of common DEGs was obtained for theEBSndashTMM and PSndashTMM comparisons From this analysiswe concluded that the set of genes identified as DEGs is notstable and depends on both themethod of data normalizationand the data set itself However in the case of AML data setthere is a set of 227 genes common for all tested methods(Supplementary Figure S4) and these genes can be consideredas the strongest candidates for DEGs (Figure S4) In the caseof the Cheung data set we noted that most of the methodsappear to perform similarly For the Bodymap data set allnormalization methods yielded slightly lower percentages ofcommonly detected DEGs than in the case of the Cheung

8 BioMed Research International

050075100125150

1 3 5 7 9 11 13 15 17 19 21 23 25 27Samples

Nor

mal

izat

ion

coeffi

cien

t

MethodsTMMUQDES

EBSPSRD

(a)

6

9

12

15

TMM UQ

DES EB

S PS RD

Methods

Clas

sifica

tion

erro

r

(b)

60 68 53 54 70

60 80 68 70 76

68 80 60 62 90

53 68 60 94 60

54 70 62 94 60

70 76 90 60 60RDPS

EBSDESUQ

TMM

TMM UQ

DES EB

S PS RD

Methods

Met

hods

0

25

50

75

()

(c)

0

50

100

150

TMM UQ

DES RD EB

S PS

(d)

Figure 2 Diagnostic plots for the AML data set Besides the five normalization methods raw data (RD) were also included in the plots as abenchmark (a) presents calculated normalization factor values across the samples by eachmethodThe samples are ordered by theminimumvalues of normalization factors (b) shows 95 confidence intervals for the mean of the percentages of classification errors calculated for eachmethod based on five selected classifiers (c) shows the numbers of common DE genes across each pair of normalization methods The sizeand shading of the circles represent the average percentage value of common genes between each pair of methods (d) presents the results ofclustering of the normalization methods based on 20 common DE genes found by each of these methods A dendrogram was created withhierarchical clustering based on Wardrsquos method

data set For the raw data (RD) the number of common genesfor any pair of methods was below 50 (see Figure S5) TheVenn diagrams for the Cheung and Bodymap data sets alsoconfirmed this tendency (Figure S4)

37 Grouping of Normalization Methods Clustering isanother approach to compare the performance of the normal-ization methods Based on the similarities between the DEgene lists we generated dendrograms to easily observe whichmethods group together The exact measure of similarity wasthe DE gene rank In the case of each data set for five variantsof normalized data as well as for raw data (RD) particularsets of differentially expressed genes were obtained First wechose genes common to all six DEG lists Since we obtained20 genes in common between six methods we performedthe clustering of the normalization methods based onthese common DEGs Then for all methods we rankedthese genes thus obtaining ranking lists of genes Basedon these lists we computed the distance matrix by usingEuclidian distance and plot dendrograms (Figure 2(d)) Thedendrograms were constructed from hierarchical clusteringusing Wardrsquos methodThis criterion is another approach thatcompares the DEGs lists The previous criterion (commonDEGs) gives us the information about the percentage of DEGin common per pair of methods whereas this criterion gives

the information about the similarity between the methodsbased on the order of common DEGs in each list

38 Summary of Ranks Combining all of the criteriadescribed in the previous sections we would like to deter-mine which of the five tested normalization methods wouldbe appropriate for the AML data set Table 7 summarises theranks obtained based on the bias and variance values theprediction errors sensitivity specificity and the number ofcommon DEGs for the AML data set In the case when allthe criteria included in Table 7 are equally important theinvestigator can base the decision on the final rank calculatedas the mean of the ranks established separately for eachcriterion using chosen normalization methods

4 Discussion

In practice normalization of high-throughput data stillremains an important question and has received a lot of atten-tion in the literatureThe increasing number of normalizationmethods makes it difficult for scientists to decide whichmethod should be used for which particular data set Basedon the results presented in this paper as well as in [25] we canconclude that normalization affects differential expressionanalysis therefore an important aspect is how to choose

BioMed Research International 9

Table 7 Summary of comparison results for the five normalizationmethods under considerationThefinal rank is based on the bias andvariance values the prediction errors sensitivity specificity valuesand the number of common DEGs for AML data

TMM UQ DES EBS PSBias 50 40 10 30 20Variance 50 40 10 30 20Sensitivity 50 30 40 10 20Specificity 15 30 15 50 40Prediction errors 50 40 30 20 10Common DEGs 50 20 10 40 30

the most sensible method for the data In this paper we haveshown that depending on the data structure the influenceof normalization differs (see Figures 1 and 2(d)) In our workwe have demonstrated that based on some of these criteriathe choice of normalization method can be more suitableand robust and can be made more automatically Coefficientssuch as bias and variance can be considered as criteria for acomparison of normalization methods It is worth pointingout that in our investigation in most cases including thenormalization reduces the bias and variance values comparedto raw data which confirms the need of normalizationWhenthe differences between the bias and variance values aresignificant the usage of ranks of these values reflects the realdifferences more precisely However in our investigation thedifferences between obtained bias and variance values forall methods in all data sets are small In such case usingranks does not reflect the true differences between methodsand additional criteria are neededThe diagnostic plots couldserve as such additional determinants and may be helpful forthe rejection of the most leading methods that evidently failOur study indicates that the use of TMM method in mostcases is displayed poorly This conclusion is not in agreementwith the evaluation made by [26 27] Their study indicatedthat the use of TMM method led to good performance onsimulated data sets One reason for the disagreement of theseresults can be due to different approaches of comparison andusage of criteria not considered by other authors Anotherreason of differences in the conclusions could arise from thenumber of biological replicates used in our study or could berelated to the particular data sets used in these papers

Furthermore we find that other conclusions described in[26] are consistentwith our own resultsThese results confirmthe satisfactory results of the DESeq method In Table 7 wepresented the summary of ranks for the AML data set It ispossible that the normalization of other data sets can yieldresults that are different from those obtained with the AMLdata set (in this paper we have shown that depending onthe data structure the influence of normalization differs)In general each criterion proposed in the paper focuses ondifferent aspects of comparison Depending on the mainobjectives of the research some of the criteria could be moreuseful for example if impact lies in good prediction basedon chosen genes the important aspect will be predictionerrors However in some cases the results of criteria areinconclusive In such situation we suggest the application of

the following workflow to determine which normalizationmethod is optimal for a specific data set (i) normalize thedata using considered methods (ii) calculate the ldquobiasrdquo andldquovariancerdquo and rank the methods based on these values (iii)after each normalization perform differential analysis anddetermine DEG lists found by each normalization method(iv) select a subset of genes that can serve as positive andnegative controls to investigate the sensitivity and specificityof normalization methods and rank the methods basedon these criteria (v) calculate the percentage of the meanof the prediction errors obtained using chosen classifiersfor DEGs found by each normalization method and rankthem (vi) draw Venn diagrams or balloon plots based onthe number of differentially expressed genes and rank themethods based on the number of common DEG valuesand (vii) based on the summary of ranks choose the mostappropriate normalization method of the investigated dataset We notice here that the normalization method caninfluence the expression results leading to erroneous DEanalysis therefore it is very important to put effort at thisstage of the analysis Finally we wanted to draw attentionthat our paper does not indicate clearly which normalizationmethod is the best but it adds a new look at how to choose thenormalization for RNA-Seq data analysis to avoid erroneousDE analysis It can be applied not only tomethods concerningsequencing depth but the proposed algorithm is also suitableto compare normalizations that take into account othersources of unwanted variation

5 Conclusions

In the study new coefficients such as bias and variancewere proposed as objective criteria for a comparison of nor-malization methods In conclusion our results suggest thatdepending on the RNA-seq data structure and the appliedmethod the influence of normalization differs Howeverthe presented criteria in particular bias and variance cansupport the choice of normalization method optimal for aspecific data set

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank the reviewer for the thor-ough evaluation of the paper and valuable comments Thiswork was supported by the Ministry of Science and HigherEducation of the Republic of Poland by KNOW program

References

[1] S Zhao W-P Fung-Leung A Bittner K Ngo and X LiuldquoComparison of RNA-Seq and microarray in transcriptomeprofiling of activated T cellsrdquo PLoS ONE vol 9 no 1 ArticleID e78644 2014

10 BioMed Research International

[2] E T Wang R Sandberg S Luo et al ldquoAlternative isoformregulation in human tissue transcriptomesrdquoNature vol 456 no7221 pp 470ndash476 2008

[3] Q Pan O Shai L J Lee B J Frey and B J Blencowe ldquoDeepsurveying of alternative splicing complexity in the human tran-scriptome by high-throughput sequencingrdquo Nature Geneticsvol 40 no 12 pp 1413ndash1415 2008

[4] W M Landau and P Liu ldquoDispersion estimation and its effecton test performance in RNA-seq data analysis a simulation-based comparison of methodsrdquo PLoS ONE vol 8 no 12 ArticleID e81415 2013

[5] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008

[6] A Oshlack andM JWakefield ldquoTranscript length bias in RNA-seq data confounds systems biologyrdquo Biology Direct vol 4article 14 2009

[7] J K Pickrell J C Marioni A A Pai et al ldquoUnderstandingmechanisms underlying human gene expression variation withRNA sequencingrdquoNature vol 464 no 7289 pp 768ndash772 2010

[8] D Risso K Schwartz G Sherlock et al ldquoGC-content normal-ization for RNA-seqrdquo Tech Rep 291 Division of BiostatisticsUniversity of California Berkeley 2011

[9] J T Leek R B Scharpf H C Bravo et al ldquoTacklingthe widespread and critical impact of batch effects in high-throughput datardquo Nature Reviews Genetics vol 11 no 10 pp733ndash739 2010

[10] J H Bullard E Purdom K D Hansen and S Dudoit ldquoEvalu-ation of statistical methods for normalization and differentialexpression in mRNA-Seq experimentsrdquo BMC Bioinformaticsvol 11 article 94 2010

[11] M D Robinson and A Oshlack ldquoA scaling normalizationmethod for differential expression analysis of RNA-seq datardquoGenome Biology vol 11 no 3 article r25 2010

[12] M D Robinson D J McCarthy and G K Smyth ldquoEdgeRa bioconductor package for differential expression analysis ofdigital gene expression datardquo Bioinformatics vol 26 no 1 pp139ndash140 2010

[13] S Anders and W Huber ldquoDifferential expression analysis forsequence count datardquo Genome Biology vol 11 no 10 articleR106 2010

[14] N Leng J Dawson J Thomson et al ldquoEBSeq an empiricalbayes hierarchical model for inference in RNA-seq experi-mentsrdquo Tech Rep 226 University of Wisconsin 2012

[15] J Li D M Witten I M Johnstone and R TibshiranildquoNormalization testing and false discovery rate estimation forRNA-sequencing datardquo Biostatistics vol 13 no 3 pp 523ndash5382012

[16] Y W Asmann B M Necela K R Kalari et al ldquoDetection ofredundant fusion transcripts as biomarkers or disease-specifictherapeutic targets in breast cancerrdquo Cancer Research vol 72no 8 pp 1921ndash1928 2012

[17] V G Cheung R R Nayak I X Wang et al ldquoPolymorphic cis-and Trans-Regulation of human gene expressionrdquo PLoS Biologyvol 8 no 9 Article ID e1000480 2010

[18] A C Frazee B Langmead and J T Leek ldquoRecount a multi-experiment resource of analysis-ready RNA-seq gene countdatasetsrdquo BMC Bioinformatics vol 12 article 449 2011

[19] C Argyropoulos A A Chatziioannou G Nikiforidis AMoustakas G Kollias and V Aidinis ldquoOperational criteria forselecting a cDNA microarray data normalization algorithmrdquoOncology Reports vol 15 no 4 pp 983ndash996 2006

[20] B Uszczynska J Zyprych-Walczak L Handschuh et al ldquoAnaly-sis of boutique arrays a universalmethod for the selection of theoptimal data normalization procedurerdquo International Journal ofMolecular Medicine vol 32 no 3 pp 668ndash684 2013

[21] D Chen Z Liu X Ma and D Hua ldquoSelecting genes by teststatisticsrdquo Journal of Biomedicine and Biotechnology vol 2005no 2 pp 132ndash138 2005

[22] RDevelopment Core Team R A Language and Environment forStatistical Computing R Foundation for Statistical ComputingVienna Austria 2011

[23] M Goswami N Hensel B D Smith et al ldquoExpression ofputative targets of immunotherapy in acute myeloid leukemiaand healthy tissuesrdquo Leukemia vol 28 no 5 pp 1167ndash1170 2014

[24] E Eisenberg and E Y Levanon ldquoHuman housekeeping genesrevisitedrdquo Trends in Genetics vol 29 no 10 pp 569ndash574 2013

[25] F Seyednasrollah A Laiho and L L Elo ldquoComparison ofsoftware packages for detecting differential expression in RNA-seq studiesrdquo Briefings in Bioinformatics vol 16 no 1 pp 59ndash702015

[26] M-A Dillies A Rau J Aubert et al ldquoA comprehensive evalu-ation of normalization methods for Illumina high-throughputRNA sequencing data analysisrdquo Briefings in Bioinformatics vol14 no 6 pp 671ndash683 2013

[27] E Maza P Frasse P Senin M Bouzayen and M ZouineldquoComparison of normalization methods for differential geneexpression analysis in RNA-Seq experiments a matter ofrelative size of studied transcriptomesrdquo Communicative ampIntegrative Biology vol 6 no 6 Article ID e25849 2013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 4: Research Article The Impact of Normalization Methods on ...downloads.hindawi.com/journals/bmri/2015/621690.pdf · normalization step di erent aspects based on the biology or informatics

4 BioMed Research International

Table 1 Summary of data set information

Data set Number of samples Number of genesNumber ofgenes (afterfiltering)

Min of averageabundance of genefrom all genes

Max of averageabundance of genefrom all genes

Number of HK(housekeeping genes)

Cheung 41 52 580 12 410 0024 90180 124Bodymap 16 52 580 13 131 0 934100 131AML 27 57 736 12 749 5004 482400 127

housekeeping genes was the same in each data set differentgenes were included for each set The tables and bar plotsconcerning housekeeping gene selection are provided inSupplementary Materials (Table S2 and Figure S2) Theseindicate the same facts as for the case with all genes taken intoaccount There is some variability in data sets with respect tonumber of HG genes for each abundance level of reads Inall data sets we can see that there is the highest number ofHG genes with number of reads in genes below 500 (relativemedium abundance level)

232 Bias and Variance To evaluate the normalizationmethods used for the processing of RNA-seq data we appliedthe bias and variance criterion proposed by Argyropoulos etal [19] for the analysis of double-channelmicroarray dataWeadjusted earlier this method for one-channel microarray data[20] In this paper we transform themethod to be suitable forthe RNA-seq data We can use the following formulae of biasand variance as proxies accuracy and precision respectively

bias119894 = radic1119898

119898

sum

119895=1

(log2 (119870119894119895

119870119894sdot

) minus True Log Ratio)2

(10)

variance119894

=1

119898 minus 1

119898

sum

119895=1

(log2 (119870119894119895

119870119894sdot

)minus log2 (119870119894119895

119870119894sdot

)

119894

)

2

(11)

where 119870119894119895 denotes read counts for 119894th control gene fromthe 119895th sample 119870119894sdot is the mean counts for the 119894th controlgene and log2(119870119894119895119870119894sdot) is the mean value of log2(119870119894119895119870119894sdot) forthe 119894th control gene and True Log Ratio is the true valueof log2(119870119894119895119870119894sdot) for the 119894th control gene from the definitionof HG The accuracy of a measurement is the degree ofcloseness of measurements of a quantity to the true valueThe precision of a measurement is the degree to whichrepeated measurements under unchanged conditions showthe same results Based on the definition each gene whichis considered as HG should have the same number of countsin each sample as well as mean value of counts for this geneThus the True Log Ratio is equal to 0 and the formula (10) forthe bias reduces to the root mean square error (RMSE)

bias119894 = radic1

119898

119898

sum

119895=1

(log2 (119870119894119895

119870119894sdot

))

2

(12)

The ratios (bias and variance) for each normalizationmethodwill be the mean of bias and variance values computed forall of the control (housekeeping) genes The normalizationmethod ldquoArdquo would be preferred over the method ldquoBrdquo if it isassociated with the smallest bias and variance [19]

233 Differential Expression Analysis Our goal was to findout how normalizationmethods affect differential expressionresults Therefore after application of each normalizationmethod differential expression analysis was performed usingthe edgeR method from the edgeR Bioconductor package[12] This method was chosen because it showed reliableperformance in wide range of experiments and it allowseasy inclusion of scaling factors into the statistical test TheedgeR method was specifically developed to model countdata dispersion and it is designed for overdispersed RNA-seqdata Briefly it is assumed that counts are a negative binomialdistribution according to

119870119892119895 sim 119873119861 (119889119895120582119892119895 120601119892) (13)

where 120582119892119895 is the expression level of gene 119892 in sample 119895 119889119895 isthe normalization factor in sample 119895 and 120601119892 is the dispersionof gene 119892 The method of differential expression analysisimplemented in the edgeR package extends Fisherrsquos exacttest

234 Prediction Errors Discriminant analysis was appliedto determine the effectiveness of the sample classificationbased on found DEGs identified in each data set after eachnormalization procedure Different classification methodsmay lead to different prediction errors We estimated theclassification errors based on the five classifiers naive Bayesneural network 119896-nearest neighbour and support vectormachines and random forest that are presented in Table 2Analyzing each data set we used leave-one-out cross valida-tion (LOOCV) to obtain the estimate of the predictor errorsof the test set that results from using different classifiers For adata set with 119899 samples this method involves 119899 separate runsFor each run a number of samples minus one data pointare used to train the model and then prediction is performedon the remaining data point The overall prediction erroris the sum of the errors for all 119899 runs [21] As input forthe classifiers in the discriminant analysis we chose theseinformative genes having high power of discrimination thatare differentially expressed The set of genes was obtainedby gene selection process from a test statistic after eachnormalization procedure

BioMed Research International 5

Table 2 Classifiers used in the calculations function in R and the name and the version of R package

Classification method Function in R R package VersionNaive Bayes NaiveBayesI MLInterfaces 1400Neural network nnetI MLInterfaces 1400119896ndashnearest neighbour knnI MLInterfaces 1400Support vector machines svmI MLInterfaces 1400Random forest randomForestI MLInterfaces 1400

Cheung

20

15

10

5

0

Num

ber o

f DEG

s

TMM UQ

DES EB

S PS RD

Methods

TMM UQ

DES EB

S PS RD

Methods

TMM UQ

DES EB

S PS RD

Methods

Bodymap

100

50

0

AML

3000

2000

1000

0

Abundance of geneslt100

100ndash5000gt5000

Abundance of geneslt100

100ndash5000gt5000

Abundance of geneslt100

100ndash5000gt5000

Figure 1 Bar plots of the DEGs with specified levels of count abundance in all studied data sets On the 119909-axis the methods of normalizationare featured whereas the 119910-axis represents the number of DEGs determined after each normalization procedure The bar colours representthe groups of genes of particular level of expression

All computations and diagnostic plots were performedwith R 302 [22]

3 Results

The aim of our study was to compare various normalizationapproaches and outline the workflow that will help to selectthe normalization method appropriate for a particular dataset We decided to test five of the most commonly usedmethods of RNA-seq data normalization Trimmed Meanof 119872-values (TMM) Upper Quartile (UQ) Median (DES)Quantile (EBS) and PoissonSeq (PS) described in detailin the Methods section All methods were tested on threedifferent data sets Bodymap Cheung and AML data Thedetails concerning data sets are described in Section 2

31 Differential Expression (DE) Analysis The main purposeof many RNA-seq experiments is to identify genes that aredifferentially expressed between the compared conditions

Analyzing three mentioned above data sets we checked thedirect impact of the normalization methods described aboveon the results of DE analysis After each normalizationwe determined the lists of differentially expressed genes(DEGs) using the statistic test from the edgeR package Inthe case of each data set we compared gene expression levelsbetween two types of biological samples and ranked the genesaccording to adjusted 119875 values Genes that had adjusted 119875

values lt 005 were selected as differentially expressedThe results of DE analysis can be compared based on the

number and content of DEGs Visualisation of the DEGs ispresented in Figure 1 The bar plots show the contributionof genes with particular expression levels in DEG lists forall data sets selected from data submitted to five testedmethods of normalization and raw data (RD) As we cansee the influence of normalization methods differs betweendata sets In the case of the Cheung data set all methodsresult in a similar number of DEGs and their significantpart constitutes genes with low levels of expression A more

6 BioMed Research International

Table 3 Rank of the bias values obtained using five normalization methods and RD for the normalization of the three tested data sets

Normalization method Cheung data set rank (bias value) Bodymap data set rank (bias value) AML data set rank (bias value)TMM 35 (0890) 4 (1123) 6 (0587)UQ 35 (0890) 5 (1139) 4 (0564)DES 1 (0885) 2 (1102) 1 (0512)EBS 2 (0887) 3 (1111) 3 (0532)PS 5 (0893) 1 (1098) 2 (0523)RD 6 (0908) 6 (1159) 5 (0581)

Table 4 Rank of the variance values obtained using the five normalization methods and RD for the normalization of the three tested datasets

Normalization method Cheung data set rank(variance value)

Bodymap data setrank (variance value)

AML data set rank(variance value)

TMM 4 (0779) 4 (1305) 6 (0364)UQ 3 (0778) 5 (1330) 4 (0335)DES 1 (0768) 2 (1270) 1 (0283)EBS 2 (0771) 3 (1283) 3 (0300)PS 5 (0782) 1 (1268) 2 (0291)RD 6 (0812) 6 (1390) 5 (0355)

balanced contribution ofDEGswas observed in the Bodymapdata set the number of DEGs with a middle expression levelis slightly higher than the number of very highly or weaklyexpressed genes In AML data set genes with an averageexpression level clearly predominate on the list of DEGsHere there are also the most significant differences in thenumber of DEGs between variously normalized data Themost restrictive method seemed to be TMM whereas thehighest numbers of DEGs were obtained by EBS and PSmethods The contribution of all groups of genes is the samefor each normalization method

MA plots available in Supplementary Materials (FigureS3) were generated for additional visualization of DEGsThey present the relationship between base mean and log2FCof the counts The results show that DEGs in each MA plotafter normalization are slightly more spread out compared tothe raw data However the location of DEG differs dependingon the normalization method for AML and Bodymap datasets In the case of AML data it is also worth pointing outthat we can observe more DEGs that are overexpressed thanunderexpressed

As it is difficult to evaluate which normalization methodismore suitable for a data set based onMAplot analysismoreprecise verification is necessary To this endwe calculated biasand variance values

32 Bias andVarianceCalculation Based ondetails describedin the previous section we selected housekeeping genes forwhich we calculated bias and variance values Following theidea described in the study [19] we assumed that the mostappropriate normalization method is that which generatesthe lowest bias and variance values for the control genesThe bias and variance values were calculated accordingto formulae (11) and (12) for all the housekeeping genesselected separately for each data set as described in Section 2

Then for each data set the mean of bias and variancevalues for each normalization method and for all controlgenes was calculated Tables 3 and 4 present the rankingof normalization methods based on these bias and variancevalues the method with the lowest bias or variance wasranked as 1 The highest bias and variance values at leastin the Cheung and Bodymap data sets were observed forthe unnormalized raw data (RD) included into the tablesfor comparison Taking into account all of these data setsthe conclusion is that the best results were obtained usingDES EBS and PS normalization methods and the methodswhich generated the highest bias and variance values wereTMM and UQ In the case of AML data application of TMMmethod led to increase of the bias and the variance presentin RD It is worth pointing out that the differences betweenthem are small in all data sets

33 Sensitivity and Specificity The sensitivity and the speci-ficity of the normalization methods were investigated usingthe AML data set based on our earlier experience withthe analysis of microarray data described in [20] as wellas evidence from literature [23 24] First the sets of geneswere selected as positive and negative controls The positivecontrols consisted of the genes that were strong candidatesfor DEGs For the set of positive controls we selected genesthat were validated by real-time polymerase chain reaction(PCR) analysis or described in the literature as overexpressed(or less frequently underexpressed) in AML or immaturehematopoietic cells [23] The negative controls included thegenes that are not DEGs (their expression levels should notdiffer between these two types of samples) [24] In total44 genes were chosen as the positive controls and 44 geneswere chosen as the negative controls The lists of genescan be found in Table S3 The sensitivity and specificityof the normalization methods were calculated respectively

BioMed Research International 7

Table 5 Sensitivity and specificity of the studied five normalizationmethods and RD applied to the AML data set

Methods TMM UQ DES EBS PS RDSensitivity () 114 205 182 455 409 182Rank of sensitivity 6 3 45 1 2 45Specificity () 977 841 977 523 659 977Rank of specificity 2 4 2 6 5 2

as a percentage of positive controls that were present andthe percentage of negative controls that were absent ineach list of differentially expressed genes The normalizationmethods with the highest values of specificity better shownondifferentially expressed genes while on the other handthemethodswith the highest sensitivity values indicate a highprobability of finding DEGs that are truly DEGs The resultsof this analysis are presented in Table 5

Table 5 shows that the specificity values for EBS andPS methods are substantially lower than for the remainingones The sensitivity values were less divergent between themethods and were generally low in the range between 10 and31 Furthermore for TMM and EBS methods we obtainedthe most divergent results the highest specificity but thelowest sensitivity for TMM and the opposite for EBS Inthe case of specificity we can see that most of the methodsproduce values of specificity over 80

34 Prediction Errors Table 6 presents a comparison of thefive normalization methods when for each data set thedifferent numbers of informative genes were used basedon five classifiers and LOOCV In each case we selectedthe number of differentially expressed genes as 75 of thenumber of samples Therefore for the Cheung Bodymapand AML data sets we have respectively 30 (075 lowast 41) 12(075 lowast 16) and 20 (074 lowast 27) differentially expressed genesThe results in the table suggest that in all data sets PS DESand EBS perform better than TMM and UQ

35 Diagnostic Plots Besides the analyticalmethods for com-parison of normalization methods we suggest using addi-tional determinants that may be helpful for the rejectionof the most commonly used methods that evidently fail Itis possible that the normalization methods yield differentresults for different data sets Therefore we suggest theapplication of the following workflow based on diagnosticplots to determine which normalization method is optimalfor a specific data set Here we focus only on the AML dataset In Figure 2(a) we can observe the differences introducedto normalization factors obtained by each normalizationmethod From this figure we can conclude that normalizationcoefficients determined with TMM and UQ methods grouptogether and divergent from the rest of normalization meth-ods

The results in Table 6 may be summarised consider-ing the average errors expressed through figures For theAML data set the performances of the five normalizationmethods can be ranked The percentages of classificationerrors obtained by normalization methods could be used

Table 6 Performances of the normalization methods with infor-mative genes based on five classifiers and LOOCV applied to theCheung Bodymap and AML data sets The first number in eachcell denotes the percentage of the average of five prediction errorsfrom five different classification methods The second number ineach cell that is in brackets is the percentage of the median of thefive prediction errors

TMM UQ DES EBS PS RD

Cheung 111 107 45 44 39 116(98) (73) (04) (00) (00) (122)

Bodymap 162 162 160 158 160 162(188) (188) (177) (167) (177) (188)

AML 81 80 77 74 73 83(74) (74) (74) (74) (74) (74)

to obtain a 95 confidence interval of the mean of theproportion of classification errors for the normalization Thecorresponding bar chart plotting confidence intervals is givenin Figure 2(b)This plot indicate that the TMMUQ andDESmethods outperform the two other methods with respect toclassification performance

36 Common DEGs To compare the number of DE genesand the number of common DE genes found among the nor-malization methods performed for a particular data set wegenerate balloon plots (Figure 2(c) and Figure S5) and Venndiagrams (see Figure S4) Balloon plots represent percentagesof the number of commonly detected differentially expressedgenes between the 119894th and 119895th methods First we calculatein the (119894 119895)th cell a proportion of common detection withrespect to the 119894th method

119875119894119895 =

119863119894119895

119863119894

sdot 100 (14)

where 119894 = 119895 119863119894119895 is the number of differentially expressedgenes commonly detected by the 119894th method and the 119895thmethod and 119863119894 is the number of differentially expressedgenes detected by the 119894th method Next we take the averagepercentage value of common genes between each pair ofmethods The most preferable method is the one with thehighest number of common DEGs

For the AML data set (Figure 2(c)) the lowest number ofcommonDEGs with othermethods is produced by the TMMnormalization method and the best by the EBS methodThe lowest number of common DEGs was obtained for theEBSndashTMM and PSndashTMM comparisons From this analysiswe concluded that the set of genes identified as DEGs is notstable and depends on both themethod of data normalizationand the data set itself However in the case of AML data setthere is a set of 227 genes common for all tested methods(Supplementary Figure S4) and these genes can be consideredas the strongest candidates for DEGs (Figure S4) In the caseof the Cheung data set we noted that most of the methodsappear to perform similarly For the Bodymap data set allnormalization methods yielded slightly lower percentages ofcommonly detected DEGs than in the case of the Cheung

8 BioMed Research International

050075100125150

1 3 5 7 9 11 13 15 17 19 21 23 25 27Samples

Nor

mal

izat

ion

coeffi

cien

t

MethodsTMMUQDES

EBSPSRD

(a)

6

9

12

15

TMM UQ

DES EB

S PS RD

Methods

Clas

sifica

tion

erro

r

(b)

60 68 53 54 70

60 80 68 70 76

68 80 60 62 90

53 68 60 94 60

54 70 62 94 60

70 76 90 60 60RDPS

EBSDESUQ

TMM

TMM UQ

DES EB

S PS RD

Methods

Met

hods

0

25

50

75

()

(c)

0

50

100

150

TMM UQ

DES RD EB

S PS

(d)

Figure 2 Diagnostic plots for the AML data set Besides the five normalization methods raw data (RD) were also included in the plots as abenchmark (a) presents calculated normalization factor values across the samples by eachmethodThe samples are ordered by theminimumvalues of normalization factors (b) shows 95 confidence intervals for the mean of the percentages of classification errors calculated for eachmethod based on five selected classifiers (c) shows the numbers of common DE genes across each pair of normalization methods The sizeand shading of the circles represent the average percentage value of common genes between each pair of methods (d) presents the results ofclustering of the normalization methods based on 20 common DE genes found by each of these methods A dendrogram was created withhierarchical clustering based on Wardrsquos method

data set For the raw data (RD) the number of common genesfor any pair of methods was below 50 (see Figure S5) TheVenn diagrams for the Cheung and Bodymap data sets alsoconfirmed this tendency (Figure S4)

37 Grouping of Normalization Methods Clustering isanother approach to compare the performance of the normal-ization methods Based on the similarities between the DEgene lists we generated dendrograms to easily observe whichmethods group together The exact measure of similarity wasthe DE gene rank In the case of each data set for five variantsof normalized data as well as for raw data (RD) particularsets of differentially expressed genes were obtained First wechose genes common to all six DEG lists Since we obtained20 genes in common between six methods we performedthe clustering of the normalization methods based onthese common DEGs Then for all methods we rankedthese genes thus obtaining ranking lists of genes Basedon these lists we computed the distance matrix by usingEuclidian distance and plot dendrograms (Figure 2(d)) Thedendrograms were constructed from hierarchical clusteringusing Wardrsquos methodThis criterion is another approach thatcompares the DEGs lists The previous criterion (commonDEGs) gives us the information about the percentage of DEGin common per pair of methods whereas this criterion gives

the information about the similarity between the methodsbased on the order of common DEGs in each list

38 Summary of Ranks Combining all of the criteriadescribed in the previous sections we would like to deter-mine which of the five tested normalization methods wouldbe appropriate for the AML data set Table 7 summarises theranks obtained based on the bias and variance values theprediction errors sensitivity specificity and the number ofcommon DEGs for the AML data set In the case when allthe criteria included in Table 7 are equally important theinvestigator can base the decision on the final rank calculatedas the mean of the ranks established separately for eachcriterion using chosen normalization methods

4 Discussion

In practice normalization of high-throughput data stillremains an important question and has received a lot of atten-tion in the literatureThe increasing number of normalizationmethods makes it difficult for scientists to decide whichmethod should be used for which particular data set Basedon the results presented in this paper as well as in [25] we canconclude that normalization affects differential expressionanalysis therefore an important aspect is how to choose

BioMed Research International 9

Table 7 Summary of comparison results for the five normalizationmethods under considerationThefinal rank is based on the bias andvariance values the prediction errors sensitivity specificity valuesand the number of common DEGs for AML data

TMM UQ DES EBS PSBias 50 40 10 30 20Variance 50 40 10 30 20Sensitivity 50 30 40 10 20Specificity 15 30 15 50 40Prediction errors 50 40 30 20 10Common DEGs 50 20 10 40 30

the most sensible method for the data In this paper we haveshown that depending on the data structure the influenceof normalization differs (see Figures 1 and 2(d)) In our workwe have demonstrated that based on some of these criteriathe choice of normalization method can be more suitableand robust and can be made more automatically Coefficientssuch as bias and variance can be considered as criteria for acomparison of normalization methods It is worth pointingout that in our investigation in most cases including thenormalization reduces the bias and variance values comparedto raw data which confirms the need of normalizationWhenthe differences between the bias and variance values aresignificant the usage of ranks of these values reflects the realdifferences more precisely However in our investigation thedifferences between obtained bias and variance values forall methods in all data sets are small In such case usingranks does not reflect the true differences between methodsand additional criteria are neededThe diagnostic plots couldserve as such additional determinants and may be helpful forthe rejection of the most leading methods that evidently failOur study indicates that the use of TMM method in mostcases is displayed poorly This conclusion is not in agreementwith the evaluation made by [26 27] Their study indicatedthat the use of TMM method led to good performance onsimulated data sets One reason for the disagreement of theseresults can be due to different approaches of comparison andusage of criteria not considered by other authors Anotherreason of differences in the conclusions could arise from thenumber of biological replicates used in our study or could berelated to the particular data sets used in these papers

Furthermore we find that other conclusions described in[26] are consistentwith our own resultsThese results confirmthe satisfactory results of the DESeq method In Table 7 wepresented the summary of ranks for the AML data set It ispossible that the normalization of other data sets can yieldresults that are different from those obtained with the AMLdata set (in this paper we have shown that depending onthe data structure the influence of normalization differs)In general each criterion proposed in the paper focuses ondifferent aspects of comparison Depending on the mainobjectives of the research some of the criteria could be moreuseful for example if impact lies in good prediction basedon chosen genes the important aspect will be predictionerrors However in some cases the results of criteria areinconclusive In such situation we suggest the application of

the following workflow to determine which normalizationmethod is optimal for a specific data set (i) normalize thedata using considered methods (ii) calculate the ldquobiasrdquo andldquovariancerdquo and rank the methods based on these values (iii)after each normalization perform differential analysis anddetermine DEG lists found by each normalization method(iv) select a subset of genes that can serve as positive andnegative controls to investigate the sensitivity and specificityof normalization methods and rank the methods basedon these criteria (v) calculate the percentage of the meanof the prediction errors obtained using chosen classifiersfor DEGs found by each normalization method and rankthem (vi) draw Venn diagrams or balloon plots based onthe number of differentially expressed genes and rank themethods based on the number of common DEG valuesand (vii) based on the summary of ranks choose the mostappropriate normalization method of the investigated dataset We notice here that the normalization method caninfluence the expression results leading to erroneous DEanalysis therefore it is very important to put effort at thisstage of the analysis Finally we wanted to draw attentionthat our paper does not indicate clearly which normalizationmethod is the best but it adds a new look at how to choose thenormalization for RNA-Seq data analysis to avoid erroneousDE analysis It can be applied not only tomethods concerningsequencing depth but the proposed algorithm is also suitableto compare normalizations that take into account othersources of unwanted variation

5 Conclusions

In the study new coefficients such as bias and variancewere proposed as objective criteria for a comparison of nor-malization methods In conclusion our results suggest thatdepending on the RNA-seq data structure and the appliedmethod the influence of normalization differs Howeverthe presented criteria in particular bias and variance cansupport the choice of normalization method optimal for aspecific data set

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank the reviewer for the thor-ough evaluation of the paper and valuable comments Thiswork was supported by the Ministry of Science and HigherEducation of the Republic of Poland by KNOW program

References

[1] S Zhao W-P Fung-Leung A Bittner K Ngo and X LiuldquoComparison of RNA-Seq and microarray in transcriptomeprofiling of activated T cellsrdquo PLoS ONE vol 9 no 1 ArticleID e78644 2014

10 BioMed Research International

[2] E T Wang R Sandberg S Luo et al ldquoAlternative isoformregulation in human tissue transcriptomesrdquoNature vol 456 no7221 pp 470ndash476 2008

[3] Q Pan O Shai L J Lee B J Frey and B J Blencowe ldquoDeepsurveying of alternative splicing complexity in the human tran-scriptome by high-throughput sequencingrdquo Nature Geneticsvol 40 no 12 pp 1413ndash1415 2008

[4] W M Landau and P Liu ldquoDispersion estimation and its effecton test performance in RNA-seq data analysis a simulation-based comparison of methodsrdquo PLoS ONE vol 8 no 12 ArticleID e81415 2013

[5] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008

[6] A Oshlack andM JWakefield ldquoTranscript length bias in RNA-seq data confounds systems biologyrdquo Biology Direct vol 4article 14 2009

[7] J K Pickrell J C Marioni A A Pai et al ldquoUnderstandingmechanisms underlying human gene expression variation withRNA sequencingrdquoNature vol 464 no 7289 pp 768ndash772 2010

[8] D Risso K Schwartz G Sherlock et al ldquoGC-content normal-ization for RNA-seqrdquo Tech Rep 291 Division of BiostatisticsUniversity of California Berkeley 2011

[9] J T Leek R B Scharpf H C Bravo et al ldquoTacklingthe widespread and critical impact of batch effects in high-throughput datardquo Nature Reviews Genetics vol 11 no 10 pp733ndash739 2010

[10] J H Bullard E Purdom K D Hansen and S Dudoit ldquoEvalu-ation of statistical methods for normalization and differentialexpression in mRNA-Seq experimentsrdquo BMC Bioinformaticsvol 11 article 94 2010

[11] M D Robinson and A Oshlack ldquoA scaling normalizationmethod for differential expression analysis of RNA-seq datardquoGenome Biology vol 11 no 3 article r25 2010

[12] M D Robinson D J McCarthy and G K Smyth ldquoEdgeRa bioconductor package for differential expression analysis ofdigital gene expression datardquo Bioinformatics vol 26 no 1 pp139ndash140 2010

[13] S Anders and W Huber ldquoDifferential expression analysis forsequence count datardquo Genome Biology vol 11 no 10 articleR106 2010

[14] N Leng J Dawson J Thomson et al ldquoEBSeq an empiricalbayes hierarchical model for inference in RNA-seq experi-mentsrdquo Tech Rep 226 University of Wisconsin 2012

[15] J Li D M Witten I M Johnstone and R TibshiranildquoNormalization testing and false discovery rate estimation forRNA-sequencing datardquo Biostatistics vol 13 no 3 pp 523ndash5382012

[16] Y W Asmann B M Necela K R Kalari et al ldquoDetection ofredundant fusion transcripts as biomarkers or disease-specifictherapeutic targets in breast cancerrdquo Cancer Research vol 72no 8 pp 1921ndash1928 2012

[17] V G Cheung R R Nayak I X Wang et al ldquoPolymorphic cis-and Trans-Regulation of human gene expressionrdquo PLoS Biologyvol 8 no 9 Article ID e1000480 2010

[18] A C Frazee B Langmead and J T Leek ldquoRecount a multi-experiment resource of analysis-ready RNA-seq gene countdatasetsrdquo BMC Bioinformatics vol 12 article 449 2011

[19] C Argyropoulos A A Chatziioannou G Nikiforidis AMoustakas G Kollias and V Aidinis ldquoOperational criteria forselecting a cDNA microarray data normalization algorithmrdquoOncology Reports vol 15 no 4 pp 983ndash996 2006

[20] B Uszczynska J Zyprych-Walczak L Handschuh et al ldquoAnaly-sis of boutique arrays a universalmethod for the selection of theoptimal data normalization procedurerdquo International Journal ofMolecular Medicine vol 32 no 3 pp 668ndash684 2013

[21] D Chen Z Liu X Ma and D Hua ldquoSelecting genes by teststatisticsrdquo Journal of Biomedicine and Biotechnology vol 2005no 2 pp 132ndash138 2005

[22] RDevelopment Core Team R A Language and Environment forStatistical Computing R Foundation for Statistical ComputingVienna Austria 2011

[23] M Goswami N Hensel B D Smith et al ldquoExpression ofputative targets of immunotherapy in acute myeloid leukemiaand healthy tissuesrdquo Leukemia vol 28 no 5 pp 1167ndash1170 2014

[24] E Eisenberg and E Y Levanon ldquoHuman housekeeping genesrevisitedrdquo Trends in Genetics vol 29 no 10 pp 569ndash574 2013

[25] F Seyednasrollah A Laiho and L L Elo ldquoComparison ofsoftware packages for detecting differential expression in RNA-seq studiesrdquo Briefings in Bioinformatics vol 16 no 1 pp 59ndash702015

[26] M-A Dillies A Rau J Aubert et al ldquoA comprehensive evalu-ation of normalization methods for Illumina high-throughputRNA sequencing data analysisrdquo Briefings in Bioinformatics vol14 no 6 pp 671ndash683 2013

[27] E Maza P Frasse P Senin M Bouzayen and M ZouineldquoComparison of normalization methods for differential geneexpression analysis in RNA-Seq experiments a matter ofrelative size of studied transcriptomesrdquo Communicative ampIntegrative Biology vol 6 no 6 Article ID e25849 2013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 5: Research Article The Impact of Normalization Methods on ...downloads.hindawi.com/journals/bmri/2015/621690.pdf · normalization step di erent aspects based on the biology or informatics

BioMed Research International 5

Table 2 Classifiers used in the calculations function in R and the name and the version of R package

Classification method Function in R R package VersionNaive Bayes NaiveBayesI MLInterfaces 1400Neural network nnetI MLInterfaces 1400119896ndashnearest neighbour knnI MLInterfaces 1400Support vector machines svmI MLInterfaces 1400Random forest randomForestI MLInterfaces 1400

Cheung

20

15

10

5

0

Num

ber o

f DEG

s

TMM UQ

DES EB

S PS RD

Methods

TMM UQ

DES EB

S PS RD

Methods

TMM UQ

DES EB

S PS RD

Methods

Bodymap

100

50

0

AML

3000

2000

1000

0

Abundance of geneslt100

100ndash5000gt5000

Abundance of geneslt100

100ndash5000gt5000

Abundance of geneslt100

100ndash5000gt5000

Figure 1 Bar plots of the DEGs with specified levels of count abundance in all studied data sets On the 119909-axis the methods of normalizationare featured whereas the 119910-axis represents the number of DEGs determined after each normalization procedure The bar colours representthe groups of genes of particular level of expression

All computations and diagnostic plots were performedwith R 302 [22]

3 Results

The aim of our study was to compare various normalizationapproaches and outline the workflow that will help to selectthe normalization method appropriate for a particular dataset We decided to test five of the most commonly usedmethods of RNA-seq data normalization Trimmed Meanof 119872-values (TMM) Upper Quartile (UQ) Median (DES)Quantile (EBS) and PoissonSeq (PS) described in detailin the Methods section All methods were tested on threedifferent data sets Bodymap Cheung and AML data Thedetails concerning data sets are described in Section 2

31 Differential Expression (DE) Analysis The main purposeof many RNA-seq experiments is to identify genes that aredifferentially expressed between the compared conditions

Analyzing three mentioned above data sets we checked thedirect impact of the normalization methods described aboveon the results of DE analysis After each normalizationwe determined the lists of differentially expressed genes(DEGs) using the statistic test from the edgeR package Inthe case of each data set we compared gene expression levelsbetween two types of biological samples and ranked the genesaccording to adjusted 119875 values Genes that had adjusted 119875

values lt 005 were selected as differentially expressedThe results of DE analysis can be compared based on the

number and content of DEGs Visualisation of the DEGs ispresented in Figure 1 The bar plots show the contributionof genes with particular expression levels in DEG lists forall data sets selected from data submitted to five testedmethods of normalization and raw data (RD) As we cansee the influence of normalization methods differs betweendata sets In the case of the Cheung data set all methodsresult in a similar number of DEGs and their significantpart constitutes genes with low levels of expression A more

6 BioMed Research International

Table 3 Rank of the bias values obtained using five normalization methods and RD for the normalization of the three tested data sets

Normalization method Cheung data set rank (bias value) Bodymap data set rank (bias value) AML data set rank (bias value)TMM 35 (0890) 4 (1123) 6 (0587)UQ 35 (0890) 5 (1139) 4 (0564)DES 1 (0885) 2 (1102) 1 (0512)EBS 2 (0887) 3 (1111) 3 (0532)PS 5 (0893) 1 (1098) 2 (0523)RD 6 (0908) 6 (1159) 5 (0581)

Table 4 Rank of the variance values obtained using the five normalization methods and RD for the normalization of the three tested datasets

Normalization method Cheung data set rank(variance value)

Bodymap data setrank (variance value)

AML data set rank(variance value)

TMM 4 (0779) 4 (1305) 6 (0364)UQ 3 (0778) 5 (1330) 4 (0335)DES 1 (0768) 2 (1270) 1 (0283)EBS 2 (0771) 3 (1283) 3 (0300)PS 5 (0782) 1 (1268) 2 (0291)RD 6 (0812) 6 (1390) 5 (0355)

balanced contribution ofDEGswas observed in the Bodymapdata set the number of DEGs with a middle expression levelis slightly higher than the number of very highly or weaklyexpressed genes In AML data set genes with an averageexpression level clearly predominate on the list of DEGsHere there are also the most significant differences in thenumber of DEGs between variously normalized data Themost restrictive method seemed to be TMM whereas thehighest numbers of DEGs were obtained by EBS and PSmethods The contribution of all groups of genes is the samefor each normalization method

MA plots available in Supplementary Materials (FigureS3) were generated for additional visualization of DEGsThey present the relationship between base mean and log2FCof the counts The results show that DEGs in each MA plotafter normalization are slightly more spread out compared tothe raw data However the location of DEG differs dependingon the normalization method for AML and Bodymap datasets In the case of AML data it is also worth pointing outthat we can observe more DEGs that are overexpressed thanunderexpressed

As it is difficult to evaluate which normalization methodismore suitable for a data set based onMAplot analysismoreprecise verification is necessary To this endwe calculated biasand variance values

32 Bias andVarianceCalculation Based ondetails describedin the previous section we selected housekeeping genes forwhich we calculated bias and variance values Following theidea described in the study [19] we assumed that the mostappropriate normalization method is that which generatesthe lowest bias and variance values for the control genesThe bias and variance values were calculated accordingto formulae (11) and (12) for all the housekeeping genesselected separately for each data set as described in Section 2

Then for each data set the mean of bias and variancevalues for each normalization method and for all controlgenes was calculated Tables 3 and 4 present the rankingof normalization methods based on these bias and variancevalues the method with the lowest bias or variance wasranked as 1 The highest bias and variance values at leastin the Cheung and Bodymap data sets were observed forthe unnormalized raw data (RD) included into the tablesfor comparison Taking into account all of these data setsthe conclusion is that the best results were obtained usingDES EBS and PS normalization methods and the methodswhich generated the highest bias and variance values wereTMM and UQ In the case of AML data application of TMMmethod led to increase of the bias and the variance presentin RD It is worth pointing out that the differences betweenthem are small in all data sets

33 Sensitivity and Specificity The sensitivity and the speci-ficity of the normalization methods were investigated usingthe AML data set based on our earlier experience withthe analysis of microarray data described in [20] as wellas evidence from literature [23 24] First the sets of geneswere selected as positive and negative controls The positivecontrols consisted of the genes that were strong candidatesfor DEGs For the set of positive controls we selected genesthat were validated by real-time polymerase chain reaction(PCR) analysis or described in the literature as overexpressed(or less frequently underexpressed) in AML or immaturehematopoietic cells [23] The negative controls included thegenes that are not DEGs (their expression levels should notdiffer between these two types of samples) [24] In total44 genes were chosen as the positive controls and 44 geneswere chosen as the negative controls The lists of genescan be found in Table S3 The sensitivity and specificityof the normalization methods were calculated respectively

BioMed Research International 7

Table 5 Sensitivity and specificity of the studied five normalizationmethods and RD applied to the AML data set

Methods TMM UQ DES EBS PS RDSensitivity () 114 205 182 455 409 182Rank of sensitivity 6 3 45 1 2 45Specificity () 977 841 977 523 659 977Rank of specificity 2 4 2 6 5 2

as a percentage of positive controls that were present andthe percentage of negative controls that were absent ineach list of differentially expressed genes The normalizationmethods with the highest values of specificity better shownondifferentially expressed genes while on the other handthemethodswith the highest sensitivity values indicate a highprobability of finding DEGs that are truly DEGs The resultsof this analysis are presented in Table 5

Table 5 shows that the specificity values for EBS andPS methods are substantially lower than for the remainingones The sensitivity values were less divergent between themethods and were generally low in the range between 10 and31 Furthermore for TMM and EBS methods we obtainedthe most divergent results the highest specificity but thelowest sensitivity for TMM and the opposite for EBS Inthe case of specificity we can see that most of the methodsproduce values of specificity over 80

34 Prediction Errors Table 6 presents a comparison of thefive normalization methods when for each data set thedifferent numbers of informative genes were used basedon five classifiers and LOOCV In each case we selectedthe number of differentially expressed genes as 75 of thenumber of samples Therefore for the Cheung Bodymapand AML data sets we have respectively 30 (075 lowast 41) 12(075 lowast 16) and 20 (074 lowast 27) differentially expressed genesThe results in the table suggest that in all data sets PS DESand EBS perform better than TMM and UQ

35 Diagnostic Plots Besides the analyticalmethods for com-parison of normalization methods we suggest using addi-tional determinants that may be helpful for the rejectionof the most commonly used methods that evidently fail Itis possible that the normalization methods yield differentresults for different data sets Therefore we suggest theapplication of the following workflow based on diagnosticplots to determine which normalization method is optimalfor a specific data set Here we focus only on the AML dataset In Figure 2(a) we can observe the differences introducedto normalization factors obtained by each normalizationmethod From this figure we can conclude that normalizationcoefficients determined with TMM and UQ methods grouptogether and divergent from the rest of normalization meth-ods

The results in Table 6 may be summarised consider-ing the average errors expressed through figures For theAML data set the performances of the five normalizationmethods can be ranked The percentages of classificationerrors obtained by normalization methods could be used

Table 6 Performances of the normalization methods with infor-mative genes based on five classifiers and LOOCV applied to theCheung Bodymap and AML data sets The first number in eachcell denotes the percentage of the average of five prediction errorsfrom five different classification methods The second number ineach cell that is in brackets is the percentage of the median of thefive prediction errors

TMM UQ DES EBS PS RD

Cheung 111 107 45 44 39 116(98) (73) (04) (00) (00) (122)

Bodymap 162 162 160 158 160 162(188) (188) (177) (167) (177) (188)

AML 81 80 77 74 73 83(74) (74) (74) (74) (74) (74)

to obtain a 95 confidence interval of the mean of theproportion of classification errors for the normalization Thecorresponding bar chart plotting confidence intervals is givenin Figure 2(b)This plot indicate that the TMMUQ andDESmethods outperform the two other methods with respect toclassification performance

36 Common DEGs To compare the number of DE genesand the number of common DE genes found among the nor-malization methods performed for a particular data set wegenerate balloon plots (Figure 2(c) and Figure S5) and Venndiagrams (see Figure S4) Balloon plots represent percentagesof the number of commonly detected differentially expressedgenes between the 119894th and 119895th methods First we calculatein the (119894 119895)th cell a proportion of common detection withrespect to the 119894th method

119875119894119895 =

119863119894119895

119863119894

sdot 100 (14)

where 119894 = 119895 119863119894119895 is the number of differentially expressedgenes commonly detected by the 119894th method and the 119895thmethod and 119863119894 is the number of differentially expressedgenes detected by the 119894th method Next we take the averagepercentage value of common genes between each pair ofmethods The most preferable method is the one with thehighest number of common DEGs

For the AML data set (Figure 2(c)) the lowest number ofcommonDEGs with othermethods is produced by the TMMnormalization method and the best by the EBS methodThe lowest number of common DEGs was obtained for theEBSndashTMM and PSndashTMM comparisons From this analysiswe concluded that the set of genes identified as DEGs is notstable and depends on both themethod of data normalizationand the data set itself However in the case of AML data setthere is a set of 227 genes common for all tested methods(Supplementary Figure S4) and these genes can be consideredas the strongest candidates for DEGs (Figure S4) In the caseof the Cheung data set we noted that most of the methodsappear to perform similarly For the Bodymap data set allnormalization methods yielded slightly lower percentages ofcommonly detected DEGs than in the case of the Cheung

8 BioMed Research International

050075100125150

1 3 5 7 9 11 13 15 17 19 21 23 25 27Samples

Nor

mal

izat

ion

coeffi

cien

t

MethodsTMMUQDES

EBSPSRD

(a)

6

9

12

15

TMM UQ

DES EB

S PS RD

Methods

Clas

sifica

tion

erro

r

(b)

60 68 53 54 70

60 80 68 70 76

68 80 60 62 90

53 68 60 94 60

54 70 62 94 60

70 76 90 60 60RDPS

EBSDESUQ

TMM

TMM UQ

DES EB

S PS RD

Methods

Met

hods

0

25

50

75

()

(c)

0

50

100

150

TMM UQ

DES RD EB

S PS

(d)

Figure 2 Diagnostic plots for the AML data set Besides the five normalization methods raw data (RD) were also included in the plots as abenchmark (a) presents calculated normalization factor values across the samples by eachmethodThe samples are ordered by theminimumvalues of normalization factors (b) shows 95 confidence intervals for the mean of the percentages of classification errors calculated for eachmethod based on five selected classifiers (c) shows the numbers of common DE genes across each pair of normalization methods The sizeand shading of the circles represent the average percentage value of common genes between each pair of methods (d) presents the results ofclustering of the normalization methods based on 20 common DE genes found by each of these methods A dendrogram was created withhierarchical clustering based on Wardrsquos method

data set For the raw data (RD) the number of common genesfor any pair of methods was below 50 (see Figure S5) TheVenn diagrams for the Cheung and Bodymap data sets alsoconfirmed this tendency (Figure S4)

37 Grouping of Normalization Methods Clustering isanother approach to compare the performance of the normal-ization methods Based on the similarities between the DEgene lists we generated dendrograms to easily observe whichmethods group together The exact measure of similarity wasthe DE gene rank In the case of each data set for five variantsof normalized data as well as for raw data (RD) particularsets of differentially expressed genes were obtained First wechose genes common to all six DEG lists Since we obtained20 genes in common between six methods we performedthe clustering of the normalization methods based onthese common DEGs Then for all methods we rankedthese genes thus obtaining ranking lists of genes Basedon these lists we computed the distance matrix by usingEuclidian distance and plot dendrograms (Figure 2(d)) Thedendrograms were constructed from hierarchical clusteringusing Wardrsquos methodThis criterion is another approach thatcompares the DEGs lists The previous criterion (commonDEGs) gives us the information about the percentage of DEGin common per pair of methods whereas this criterion gives

the information about the similarity between the methodsbased on the order of common DEGs in each list

38 Summary of Ranks Combining all of the criteriadescribed in the previous sections we would like to deter-mine which of the five tested normalization methods wouldbe appropriate for the AML data set Table 7 summarises theranks obtained based on the bias and variance values theprediction errors sensitivity specificity and the number ofcommon DEGs for the AML data set In the case when allthe criteria included in Table 7 are equally important theinvestigator can base the decision on the final rank calculatedas the mean of the ranks established separately for eachcriterion using chosen normalization methods

4 Discussion

In practice normalization of high-throughput data stillremains an important question and has received a lot of atten-tion in the literatureThe increasing number of normalizationmethods makes it difficult for scientists to decide whichmethod should be used for which particular data set Basedon the results presented in this paper as well as in [25] we canconclude that normalization affects differential expressionanalysis therefore an important aspect is how to choose

BioMed Research International 9

Table 7 Summary of comparison results for the five normalizationmethods under considerationThefinal rank is based on the bias andvariance values the prediction errors sensitivity specificity valuesand the number of common DEGs for AML data

TMM UQ DES EBS PSBias 50 40 10 30 20Variance 50 40 10 30 20Sensitivity 50 30 40 10 20Specificity 15 30 15 50 40Prediction errors 50 40 30 20 10Common DEGs 50 20 10 40 30

the most sensible method for the data In this paper we haveshown that depending on the data structure the influenceof normalization differs (see Figures 1 and 2(d)) In our workwe have demonstrated that based on some of these criteriathe choice of normalization method can be more suitableand robust and can be made more automatically Coefficientssuch as bias and variance can be considered as criteria for acomparison of normalization methods It is worth pointingout that in our investigation in most cases including thenormalization reduces the bias and variance values comparedto raw data which confirms the need of normalizationWhenthe differences between the bias and variance values aresignificant the usage of ranks of these values reflects the realdifferences more precisely However in our investigation thedifferences between obtained bias and variance values forall methods in all data sets are small In such case usingranks does not reflect the true differences between methodsand additional criteria are neededThe diagnostic plots couldserve as such additional determinants and may be helpful forthe rejection of the most leading methods that evidently failOur study indicates that the use of TMM method in mostcases is displayed poorly This conclusion is not in agreementwith the evaluation made by [26 27] Their study indicatedthat the use of TMM method led to good performance onsimulated data sets One reason for the disagreement of theseresults can be due to different approaches of comparison andusage of criteria not considered by other authors Anotherreason of differences in the conclusions could arise from thenumber of biological replicates used in our study or could berelated to the particular data sets used in these papers

Furthermore we find that other conclusions described in[26] are consistentwith our own resultsThese results confirmthe satisfactory results of the DESeq method In Table 7 wepresented the summary of ranks for the AML data set It ispossible that the normalization of other data sets can yieldresults that are different from those obtained with the AMLdata set (in this paper we have shown that depending onthe data structure the influence of normalization differs)In general each criterion proposed in the paper focuses ondifferent aspects of comparison Depending on the mainobjectives of the research some of the criteria could be moreuseful for example if impact lies in good prediction basedon chosen genes the important aspect will be predictionerrors However in some cases the results of criteria areinconclusive In such situation we suggest the application of

the following workflow to determine which normalizationmethod is optimal for a specific data set (i) normalize thedata using considered methods (ii) calculate the ldquobiasrdquo andldquovariancerdquo and rank the methods based on these values (iii)after each normalization perform differential analysis anddetermine DEG lists found by each normalization method(iv) select a subset of genes that can serve as positive andnegative controls to investigate the sensitivity and specificityof normalization methods and rank the methods basedon these criteria (v) calculate the percentage of the meanof the prediction errors obtained using chosen classifiersfor DEGs found by each normalization method and rankthem (vi) draw Venn diagrams or balloon plots based onthe number of differentially expressed genes and rank themethods based on the number of common DEG valuesand (vii) based on the summary of ranks choose the mostappropriate normalization method of the investigated dataset We notice here that the normalization method caninfluence the expression results leading to erroneous DEanalysis therefore it is very important to put effort at thisstage of the analysis Finally we wanted to draw attentionthat our paper does not indicate clearly which normalizationmethod is the best but it adds a new look at how to choose thenormalization for RNA-Seq data analysis to avoid erroneousDE analysis It can be applied not only tomethods concerningsequencing depth but the proposed algorithm is also suitableto compare normalizations that take into account othersources of unwanted variation

5 Conclusions

In the study new coefficients such as bias and variancewere proposed as objective criteria for a comparison of nor-malization methods In conclusion our results suggest thatdepending on the RNA-seq data structure and the appliedmethod the influence of normalization differs Howeverthe presented criteria in particular bias and variance cansupport the choice of normalization method optimal for aspecific data set

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank the reviewer for the thor-ough evaluation of the paper and valuable comments Thiswork was supported by the Ministry of Science and HigherEducation of the Republic of Poland by KNOW program

References

[1] S Zhao W-P Fung-Leung A Bittner K Ngo and X LiuldquoComparison of RNA-Seq and microarray in transcriptomeprofiling of activated T cellsrdquo PLoS ONE vol 9 no 1 ArticleID e78644 2014

10 BioMed Research International

[2] E T Wang R Sandberg S Luo et al ldquoAlternative isoformregulation in human tissue transcriptomesrdquoNature vol 456 no7221 pp 470ndash476 2008

[3] Q Pan O Shai L J Lee B J Frey and B J Blencowe ldquoDeepsurveying of alternative splicing complexity in the human tran-scriptome by high-throughput sequencingrdquo Nature Geneticsvol 40 no 12 pp 1413ndash1415 2008

[4] W M Landau and P Liu ldquoDispersion estimation and its effecton test performance in RNA-seq data analysis a simulation-based comparison of methodsrdquo PLoS ONE vol 8 no 12 ArticleID e81415 2013

[5] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008

[6] A Oshlack andM JWakefield ldquoTranscript length bias in RNA-seq data confounds systems biologyrdquo Biology Direct vol 4article 14 2009

[7] J K Pickrell J C Marioni A A Pai et al ldquoUnderstandingmechanisms underlying human gene expression variation withRNA sequencingrdquoNature vol 464 no 7289 pp 768ndash772 2010

[8] D Risso K Schwartz G Sherlock et al ldquoGC-content normal-ization for RNA-seqrdquo Tech Rep 291 Division of BiostatisticsUniversity of California Berkeley 2011

[9] J T Leek R B Scharpf H C Bravo et al ldquoTacklingthe widespread and critical impact of batch effects in high-throughput datardquo Nature Reviews Genetics vol 11 no 10 pp733ndash739 2010

[10] J H Bullard E Purdom K D Hansen and S Dudoit ldquoEvalu-ation of statistical methods for normalization and differentialexpression in mRNA-Seq experimentsrdquo BMC Bioinformaticsvol 11 article 94 2010

[11] M D Robinson and A Oshlack ldquoA scaling normalizationmethod for differential expression analysis of RNA-seq datardquoGenome Biology vol 11 no 3 article r25 2010

[12] M D Robinson D J McCarthy and G K Smyth ldquoEdgeRa bioconductor package for differential expression analysis ofdigital gene expression datardquo Bioinformatics vol 26 no 1 pp139ndash140 2010

[13] S Anders and W Huber ldquoDifferential expression analysis forsequence count datardquo Genome Biology vol 11 no 10 articleR106 2010

[14] N Leng J Dawson J Thomson et al ldquoEBSeq an empiricalbayes hierarchical model for inference in RNA-seq experi-mentsrdquo Tech Rep 226 University of Wisconsin 2012

[15] J Li D M Witten I M Johnstone and R TibshiranildquoNormalization testing and false discovery rate estimation forRNA-sequencing datardquo Biostatistics vol 13 no 3 pp 523ndash5382012

[16] Y W Asmann B M Necela K R Kalari et al ldquoDetection ofredundant fusion transcripts as biomarkers or disease-specifictherapeutic targets in breast cancerrdquo Cancer Research vol 72no 8 pp 1921ndash1928 2012

[17] V G Cheung R R Nayak I X Wang et al ldquoPolymorphic cis-and Trans-Regulation of human gene expressionrdquo PLoS Biologyvol 8 no 9 Article ID e1000480 2010

[18] A C Frazee B Langmead and J T Leek ldquoRecount a multi-experiment resource of analysis-ready RNA-seq gene countdatasetsrdquo BMC Bioinformatics vol 12 article 449 2011

[19] C Argyropoulos A A Chatziioannou G Nikiforidis AMoustakas G Kollias and V Aidinis ldquoOperational criteria forselecting a cDNA microarray data normalization algorithmrdquoOncology Reports vol 15 no 4 pp 983ndash996 2006

[20] B Uszczynska J Zyprych-Walczak L Handschuh et al ldquoAnaly-sis of boutique arrays a universalmethod for the selection of theoptimal data normalization procedurerdquo International Journal ofMolecular Medicine vol 32 no 3 pp 668ndash684 2013

[21] D Chen Z Liu X Ma and D Hua ldquoSelecting genes by teststatisticsrdquo Journal of Biomedicine and Biotechnology vol 2005no 2 pp 132ndash138 2005

[22] RDevelopment Core Team R A Language and Environment forStatistical Computing R Foundation for Statistical ComputingVienna Austria 2011

[23] M Goswami N Hensel B D Smith et al ldquoExpression ofputative targets of immunotherapy in acute myeloid leukemiaand healthy tissuesrdquo Leukemia vol 28 no 5 pp 1167ndash1170 2014

[24] E Eisenberg and E Y Levanon ldquoHuman housekeeping genesrevisitedrdquo Trends in Genetics vol 29 no 10 pp 569ndash574 2013

[25] F Seyednasrollah A Laiho and L L Elo ldquoComparison ofsoftware packages for detecting differential expression in RNA-seq studiesrdquo Briefings in Bioinformatics vol 16 no 1 pp 59ndash702015

[26] M-A Dillies A Rau J Aubert et al ldquoA comprehensive evalu-ation of normalization methods for Illumina high-throughputRNA sequencing data analysisrdquo Briefings in Bioinformatics vol14 no 6 pp 671ndash683 2013

[27] E Maza P Frasse P Senin M Bouzayen and M ZouineldquoComparison of normalization methods for differential geneexpression analysis in RNA-Seq experiments a matter ofrelative size of studied transcriptomesrdquo Communicative ampIntegrative Biology vol 6 no 6 Article ID e25849 2013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 6: Research Article The Impact of Normalization Methods on ...downloads.hindawi.com/journals/bmri/2015/621690.pdf · normalization step di erent aspects based on the biology or informatics

6 BioMed Research International

Table 3 Rank of the bias values obtained using five normalization methods and RD for the normalization of the three tested data sets

Normalization method Cheung data set rank (bias value) Bodymap data set rank (bias value) AML data set rank (bias value)TMM 35 (0890) 4 (1123) 6 (0587)UQ 35 (0890) 5 (1139) 4 (0564)DES 1 (0885) 2 (1102) 1 (0512)EBS 2 (0887) 3 (1111) 3 (0532)PS 5 (0893) 1 (1098) 2 (0523)RD 6 (0908) 6 (1159) 5 (0581)

Table 4 Rank of the variance values obtained using the five normalization methods and RD for the normalization of the three tested datasets

Normalization method Cheung data set rank(variance value)

Bodymap data setrank (variance value)

AML data set rank(variance value)

TMM 4 (0779) 4 (1305) 6 (0364)UQ 3 (0778) 5 (1330) 4 (0335)DES 1 (0768) 2 (1270) 1 (0283)EBS 2 (0771) 3 (1283) 3 (0300)PS 5 (0782) 1 (1268) 2 (0291)RD 6 (0812) 6 (1390) 5 (0355)

balanced contribution ofDEGswas observed in the Bodymapdata set the number of DEGs with a middle expression levelis slightly higher than the number of very highly or weaklyexpressed genes In AML data set genes with an averageexpression level clearly predominate on the list of DEGsHere there are also the most significant differences in thenumber of DEGs between variously normalized data Themost restrictive method seemed to be TMM whereas thehighest numbers of DEGs were obtained by EBS and PSmethods The contribution of all groups of genes is the samefor each normalization method

MA plots available in Supplementary Materials (FigureS3) were generated for additional visualization of DEGsThey present the relationship between base mean and log2FCof the counts The results show that DEGs in each MA plotafter normalization are slightly more spread out compared tothe raw data However the location of DEG differs dependingon the normalization method for AML and Bodymap datasets In the case of AML data it is also worth pointing outthat we can observe more DEGs that are overexpressed thanunderexpressed

As it is difficult to evaluate which normalization methodismore suitable for a data set based onMAplot analysismoreprecise verification is necessary To this endwe calculated biasand variance values

32 Bias andVarianceCalculation Based ondetails describedin the previous section we selected housekeeping genes forwhich we calculated bias and variance values Following theidea described in the study [19] we assumed that the mostappropriate normalization method is that which generatesthe lowest bias and variance values for the control genesThe bias and variance values were calculated accordingto formulae (11) and (12) for all the housekeeping genesselected separately for each data set as described in Section 2

Then for each data set the mean of bias and variancevalues for each normalization method and for all controlgenes was calculated Tables 3 and 4 present the rankingof normalization methods based on these bias and variancevalues the method with the lowest bias or variance wasranked as 1 The highest bias and variance values at leastin the Cheung and Bodymap data sets were observed forthe unnormalized raw data (RD) included into the tablesfor comparison Taking into account all of these data setsthe conclusion is that the best results were obtained usingDES EBS and PS normalization methods and the methodswhich generated the highest bias and variance values wereTMM and UQ In the case of AML data application of TMMmethod led to increase of the bias and the variance presentin RD It is worth pointing out that the differences betweenthem are small in all data sets

33 Sensitivity and Specificity The sensitivity and the speci-ficity of the normalization methods were investigated usingthe AML data set based on our earlier experience withthe analysis of microarray data described in [20] as wellas evidence from literature [23 24] First the sets of geneswere selected as positive and negative controls The positivecontrols consisted of the genes that were strong candidatesfor DEGs For the set of positive controls we selected genesthat were validated by real-time polymerase chain reaction(PCR) analysis or described in the literature as overexpressed(or less frequently underexpressed) in AML or immaturehematopoietic cells [23] The negative controls included thegenes that are not DEGs (their expression levels should notdiffer between these two types of samples) [24] In total44 genes were chosen as the positive controls and 44 geneswere chosen as the negative controls The lists of genescan be found in Table S3 The sensitivity and specificityof the normalization methods were calculated respectively

BioMed Research International 7

Table 5 Sensitivity and specificity of the studied five normalizationmethods and RD applied to the AML data set

Methods TMM UQ DES EBS PS RDSensitivity () 114 205 182 455 409 182Rank of sensitivity 6 3 45 1 2 45Specificity () 977 841 977 523 659 977Rank of specificity 2 4 2 6 5 2

as a percentage of positive controls that were present andthe percentage of negative controls that were absent ineach list of differentially expressed genes The normalizationmethods with the highest values of specificity better shownondifferentially expressed genes while on the other handthemethodswith the highest sensitivity values indicate a highprobability of finding DEGs that are truly DEGs The resultsof this analysis are presented in Table 5

Table 5 shows that the specificity values for EBS andPS methods are substantially lower than for the remainingones The sensitivity values were less divergent between themethods and were generally low in the range between 10 and31 Furthermore for TMM and EBS methods we obtainedthe most divergent results the highest specificity but thelowest sensitivity for TMM and the opposite for EBS Inthe case of specificity we can see that most of the methodsproduce values of specificity over 80

34 Prediction Errors Table 6 presents a comparison of thefive normalization methods when for each data set thedifferent numbers of informative genes were used basedon five classifiers and LOOCV In each case we selectedthe number of differentially expressed genes as 75 of thenumber of samples Therefore for the Cheung Bodymapand AML data sets we have respectively 30 (075 lowast 41) 12(075 lowast 16) and 20 (074 lowast 27) differentially expressed genesThe results in the table suggest that in all data sets PS DESand EBS perform better than TMM and UQ

35 Diagnostic Plots Besides the analyticalmethods for com-parison of normalization methods we suggest using addi-tional determinants that may be helpful for the rejectionof the most commonly used methods that evidently fail Itis possible that the normalization methods yield differentresults for different data sets Therefore we suggest theapplication of the following workflow based on diagnosticplots to determine which normalization method is optimalfor a specific data set Here we focus only on the AML dataset In Figure 2(a) we can observe the differences introducedto normalization factors obtained by each normalizationmethod From this figure we can conclude that normalizationcoefficients determined with TMM and UQ methods grouptogether and divergent from the rest of normalization meth-ods

The results in Table 6 may be summarised consider-ing the average errors expressed through figures For theAML data set the performances of the five normalizationmethods can be ranked The percentages of classificationerrors obtained by normalization methods could be used

Table 6 Performances of the normalization methods with infor-mative genes based on five classifiers and LOOCV applied to theCheung Bodymap and AML data sets The first number in eachcell denotes the percentage of the average of five prediction errorsfrom five different classification methods The second number ineach cell that is in brackets is the percentage of the median of thefive prediction errors

TMM UQ DES EBS PS RD

Cheung 111 107 45 44 39 116(98) (73) (04) (00) (00) (122)

Bodymap 162 162 160 158 160 162(188) (188) (177) (167) (177) (188)

AML 81 80 77 74 73 83(74) (74) (74) (74) (74) (74)

to obtain a 95 confidence interval of the mean of theproportion of classification errors for the normalization Thecorresponding bar chart plotting confidence intervals is givenin Figure 2(b)This plot indicate that the TMMUQ andDESmethods outperform the two other methods with respect toclassification performance

36 Common DEGs To compare the number of DE genesand the number of common DE genes found among the nor-malization methods performed for a particular data set wegenerate balloon plots (Figure 2(c) and Figure S5) and Venndiagrams (see Figure S4) Balloon plots represent percentagesof the number of commonly detected differentially expressedgenes between the 119894th and 119895th methods First we calculatein the (119894 119895)th cell a proportion of common detection withrespect to the 119894th method

119875119894119895 =

119863119894119895

119863119894

sdot 100 (14)

where 119894 = 119895 119863119894119895 is the number of differentially expressedgenes commonly detected by the 119894th method and the 119895thmethod and 119863119894 is the number of differentially expressedgenes detected by the 119894th method Next we take the averagepercentage value of common genes between each pair ofmethods The most preferable method is the one with thehighest number of common DEGs

For the AML data set (Figure 2(c)) the lowest number ofcommonDEGs with othermethods is produced by the TMMnormalization method and the best by the EBS methodThe lowest number of common DEGs was obtained for theEBSndashTMM and PSndashTMM comparisons From this analysiswe concluded that the set of genes identified as DEGs is notstable and depends on both themethod of data normalizationand the data set itself However in the case of AML data setthere is a set of 227 genes common for all tested methods(Supplementary Figure S4) and these genes can be consideredas the strongest candidates for DEGs (Figure S4) In the caseof the Cheung data set we noted that most of the methodsappear to perform similarly For the Bodymap data set allnormalization methods yielded slightly lower percentages ofcommonly detected DEGs than in the case of the Cheung

8 BioMed Research International

050075100125150

1 3 5 7 9 11 13 15 17 19 21 23 25 27Samples

Nor

mal

izat

ion

coeffi

cien

t

MethodsTMMUQDES

EBSPSRD

(a)

6

9

12

15

TMM UQ

DES EB

S PS RD

Methods

Clas

sifica

tion

erro

r

(b)

60 68 53 54 70

60 80 68 70 76

68 80 60 62 90

53 68 60 94 60

54 70 62 94 60

70 76 90 60 60RDPS

EBSDESUQ

TMM

TMM UQ

DES EB

S PS RD

Methods

Met

hods

0

25

50

75

()

(c)

0

50

100

150

TMM UQ

DES RD EB

S PS

(d)

Figure 2 Diagnostic plots for the AML data set Besides the five normalization methods raw data (RD) were also included in the plots as abenchmark (a) presents calculated normalization factor values across the samples by eachmethodThe samples are ordered by theminimumvalues of normalization factors (b) shows 95 confidence intervals for the mean of the percentages of classification errors calculated for eachmethod based on five selected classifiers (c) shows the numbers of common DE genes across each pair of normalization methods The sizeand shading of the circles represent the average percentage value of common genes between each pair of methods (d) presents the results ofclustering of the normalization methods based on 20 common DE genes found by each of these methods A dendrogram was created withhierarchical clustering based on Wardrsquos method

data set For the raw data (RD) the number of common genesfor any pair of methods was below 50 (see Figure S5) TheVenn diagrams for the Cheung and Bodymap data sets alsoconfirmed this tendency (Figure S4)

37 Grouping of Normalization Methods Clustering isanother approach to compare the performance of the normal-ization methods Based on the similarities between the DEgene lists we generated dendrograms to easily observe whichmethods group together The exact measure of similarity wasthe DE gene rank In the case of each data set for five variantsof normalized data as well as for raw data (RD) particularsets of differentially expressed genes were obtained First wechose genes common to all six DEG lists Since we obtained20 genes in common between six methods we performedthe clustering of the normalization methods based onthese common DEGs Then for all methods we rankedthese genes thus obtaining ranking lists of genes Basedon these lists we computed the distance matrix by usingEuclidian distance and plot dendrograms (Figure 2(d)) Thedendrograms were constructed from hierarchical clusteringusing Wardrsquos methodThis criterion is another approach thatcompares the DEGs lists The previous criterion (commonDEGs) gives us the information about the percentage of DEGin common per pair of methods whereas this criterion gives

the information about the similarity between the methodsbased on the order of common DEGs in each list

38 Summary of Ranks Combining all of the criteriadescribed in the previous sections we would like to deter-mine which of the five tested normalization methods wouldbe appropriate for the AML data set Table 7 summarises theranks obtained based on the bias and variance values theprediction errors sensitivity specificity and the number ofcommon DEGs for the AML data set In the case when allthe criteria included in Table 7 are equally important theinvestigator can base the decision on the final rank calculatedas the mean of the ranks established separately for eachcriterion using chosen normalization methods

4 Discussion

In practice normalization of high-throughput data stillremains an important question and has received a lot of atten-tion in the literatureThe increasing number of normalizationmethods makes it difficult for scientists to decide whichmethod should be used for which particular data set Basedon the results presented in this paper as well as in [25] we canconclude that normalization affects differential expressionanalysis therefore an important aspect is how to choose

BioMed Research International 9

Table 7 Summary of comparison results for the five normalizationmethods under considerationThefinal rank is based on the bias andvariance values the prediction errors sensitivity specificity valuesand the number of common DEGs for AML data

TMM UQ DES EBS PSBias 50 40 10 30 20Variance 50 40 10 30 20Sensitivity 50 30 40 10 20Specificity 15 30 15 50 40Prediction errors 50 40 30 20 10Common DEGs 50 20 10 40 30

the most sensible method for the data In this paper we haveshown that depending on the data structure the influenceof normalization differs (see Figures 1 and 2(d)) In our workwe have demonstrated that based on some of these criteriathe choice of normalization method can be more suitableand robust and can be made more automatically Coefficientssuch as bias and variance can be considered as criteria for acomparison of normalization methods It is worth pointingout that in our investigation in most cases including thenormalization reduces the bias and variance values comparedto raw data which confirms the need of normalizationWhenthe differences between the bias and variance values aresignificant the usage of ranks of these values reflects the realdifferences more precisely However in our investigation thedifferences between obtained bias and variance values forall methods in all data sets are small In such case usingranks does not reflect the true differences between methodsand additional criteria are neededThe diagnostic plots couldserve as such additional determinants and may be helpful forthe rejection of the most leading methods that evidently failOur study indicates that the use of TMM method in mostcases is displayed poorly This conclusion is not in agreementwith the evaluation made by [26 27] Their study indicatedthat the use of TMM method led to good performance onsimulated data sets One reason for the disagreement of theseresults can be due to different approaches of comparison andusage of criteria not considered by other authors Anotherreason of differences in the conclusions could arise from thenumber of biological replicates used in our study or could berelated to the particular data sets used in these papers

Furthermore we find that other conclusions described in[26] are consistentwith our own resultsThese results confirmthe satisfactory results of the DESeq method In Table 7 wepresented the summary of ranks for the AML data set It ispossible that the normalization of other data sets can yieldresults that are different from those obtained with the AMLdata set (in this paper we have shown that depending onthe data structure the influence of normalization differs)In general each criterion proposed in the paper focuses ondifferent aspects of comparison Depending on the mainobjectives of the research some of the criteria could be moreuseful for example if impact lies in good prediction basedon chosen genes the important aspect will be predictionerrors However in some cases the results of criteria areinconclusive In such situation we suggest the application of

the following workflow to determine which normalizationmethod is optimal for a specific data set (i) normalize thedata using considered methods (ii) calculate the ldquobiasrdquo andldquovariancerdquo and rank the methods based on these values (iii)after each normalization perform differential analysis anddetermine DEG lists found by each normalization method(iv) select a subset of genes that can serve as positive andnegative controls to investigate the sensitivity and specificityof normalization methods and rank the methods basedon these criteria (v) calculate the percentage of the meanof the prediction errors obtained using chosen classifiersfor DEGs found by each normalization method and rankthem (vi) draw Venn diagrams or balloon plots based onthe number of differentially expressed genes and rank themethods based on the number of common DEG valuesand (vii) based on the summary of ranks choose the mostappropriate normalization method of the investigated dataset We notice here that the normalization method caninfluence the expression results leading to erroneous DEanalysis therefore it is very important to put effort at thisstage of the analysis Finally we wanted to draw attentionthat our paper does not indicate clearly which normalizationmethod is the best but it adds a new look at how to choose thenormalization for RNA-Seq data analysis to avoid erroneousDE analysis It can be applied not only tomethods concerningsequencing depth but the proposed algorithm is also suitableto compare normalizations that take into account othersources of unwanted variation

5 Conclusions

In the study new coefficients such as bias and variancewere proposed as objective criteria for a comparison of nor-malization methods In conclusion our results suggest thatdepending on the RNA-seq data structure and the appliedmethod the influence of normalization differs Howeverthe presented criteria in particular bias and variance cansupport the choice of normalization method optimal for aspecific data set

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank the reviewer for the thor-ough evaluation of the paper and valuable comments Thiswork was supported by the Ministry of Science and HigherEducation of the Republic of Poland by KNOW program

References

[1] S Zhao W-P Fung-Leung A Bittner K Ngo and X LiuldquoComparison of RNA-Seq and microarray in transcriptomeprofiling of activated T cellsrdquo PLoS ONE vol 9 no 1 ArticleID e78644 2014

10 BioMed Research International

[2] E T Wang R Sandberg S Luo et al ldquoAlternative isoformregulation in human tissue transcriptomesrdquoNature vol 456 no7221 pp 470ndash476 2008

[3] Q Pan O Shai L J Lee B J Frey and B J Blencowe ldquoDeepsurveying of alternative splicing complexity in the human tran-scriptome by high-throughput sequencingrdquo Nature Geneticsvol 40 no 12 pp 1413ndash1415 2008

[4] W M Landau and P Liu ldquoDispersion estimation and its effecton test performance in RNA-seq data analysis a simulation-based comparison of methodsrdquo PLoS ONE vol 8 no 12 ArticleID e81415 2013

[5] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008

[6] A Oshlack andM JWakefield ldquoTranscript length bias in RNA-seq data confounds systems biologyrdquo Biology Direct vol 4article 14 2009

[7] J K Pickrell J C Marioni A A Pai et al ldquoUnderstandingmechanisms underlying human gene expression variation withRNA sequencingrdquoNature vol 464 no 7289 pp 768ndash772 2010

[8] D Risso K Schwartz G Sherlock et al ldquoGC-content normal-ization for RNA-seqrdquo Tech Rep 291 Division of BiostatisticsUniversity of California Berkeley 2011

[9] J T Leek R B Scharpf H C Bravo et al ldquoTacklingthe widespread and critical impact of batch effects in high-throughput datardquo Nature Reviews Genetics vol 11 no 10 pp733ndash739 2010

[10] J H Bullard E Purdom K D Hansen and S Dudoit ldquoEvalu-ation of statistical methods for normalization and differentialexpression in mRNA-Seq experimentsrdquo BMC Bioinformaticsvol 11 article 94 2010

[11] M D Robinson and A Oshlack ldquoA scaling normalizationmethod for differential expression analysis of RNA-seq datardquoGenome Biology vol 11 no 3 article r25 2010

[12] M D Robinson D J McCarthy and G K Smyth ldquoEdgeRa bioconductor package for differential expression analysis ofdigital gene expression datardquo Bioinformatics vol 26 no 1 pp139ndash140 2010

[13] S Anders and W Huber ldquoDifferential expression analysis forsequence count datardquo Genome Biology vol 11 no 10 articleR106 2010

[14] N Leng J Dawson J Thomson et al ldquoEBSeq an empiricalbayes hierarchical model for inference in RNA-seq experi-mentsrdquo Tech Rep 226 University of Wisconsin 2012

[15] J Li D M Witten I M Johnstone and R TibshiranildquoNormalization testing and false discovery rate estimation forRNA-sequencing datardquo Biostatistics vol 13 no 3 pp 523ndash5382012

[16] Y W Asmann B M Necela K R Kalari et al ldquoDetection ofredundant fusion transcripts as biomarkers or disease-specifictherapeutic targets in breast cancerrdquo Cancer Research vol 72no 8 pp 1921ndash1928 2012

[17] V G Cheung R R Nayak I X Wang et al ldquoPolymorphic cis-and Trans-Regulation of human gene expressionrdquo PLoS Biologyvol 8 no 9 Article ID e1000480 2010

[18] A C Frazee B Langmead and J T Leek ldquoRecount a multi-experiment resource of analysis-ready RNA-seq gene countdatasetsrdquo BMC Bioinformatics vol 12 article 449 2011

[19] C Argyropoulos A A Chatziioannou G Nikiforidis AMoustakas G Kollias and V Aidinis ldquoOperational criteria forselecting a cDNA microarray data normalization algorithmrdquoOncology Reports vol 15 no 4 pp 983ndash996 2006

[20] B Uszczynska J Zyprych-Walczak L Handschuh et al ldquoAnaly-sis of boutique arrays a universalmethod for the selection of theoptimal data normalization procedurerdquo International Journal ofMolecular Medicine vol 32 no 3 pp 668ndash684 2013

[21] D Chen Z Liu X Ma and D Hua ldquoSelecting genes by teststatisticsrdquo Journal of Biomedicine and Biotechnology vol 2005no 2 pp 132ndash138 2005

[22] RDevelopment Core Team R A Language and Environment forStatistical Computing R Foundation for Statistical ComputingVienna Austria 2011

[23] M Goswami N Hensel B D Smith et al ldquoExpression ofputative targets of immunotherapy in acute myeloid leukemiaand healthy tissuesrdquo Leukemia vol 28 no 5 pp 1167ndash1170 2014

[24] E Eisenberg and E Y Levanon ldquoHuman housekeeping genesrevisitedrdquo Trends in Genetics vol 29 no 10 pp 569ndash574 2013

[25] F Seyednasrollah A Laiho and L L Elo ldquoComparison ofsoftware packages for detecting differential expression in RNA-seq studiesrdquo Briefings in Bioinformatics vol 16 no 1 pp 59ndash702015

[26] M-A Dillies A Rau J Aubert et al ldquoA comprehensive evalu-ation of normalization methods for Illumina high-throughputRNA sequencing data analysisrdquo Briefings in Bioinformatics vol14 no 6 pp 671ndash683 2013

[27] E Maza P Frasse P Senin M Bouzayen and M ZouineldquoComparison of normalization methods for differential geneexpression analysis in RNA-Seq experiments a matter ofrelative size of studied transcriptomesrdquo Communicative ampIntegrative Biology vol 6 no 6 Article ID e25849 2013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 7: Research Article The Impact of Normalization Methods on ...downloads.hindawi.com/journals/bmri/2015/621690.pdf · normalization step di erent aspects based on the biology or informatics

BioMed Research International 7

Table 5 Sensitivity and specificity of the studied five normalizationmethods and RD applied to the AML data set

Methods TMM UQ DES EBS PS RDSensitivity () 114 205 182 455 409 182Rank of sensitivity 6 3 45 1 2 45Specificity () 977 841 977 523 659 977Rank of specificity 2 4 2 6 5 2

as a percentage of positive controls that were present andthe percentage of negative controls that were absent ineach list of differentially expressed genes The normalizationmethods with the highest values of specificity better shownondifferentially expressed genes while on the other handthemethodswith the highest sensitivity values indicate a highprobability of finding DEGs that are truly DEGs The resultsof this analysis are presented in Table 5

Table 5 shows that the specificity values for EBS andPS methods are substantially lower than for the remainingones The sensitivity values were less divergent between themethods and were generally low in the range between 10 and31 Furthermore for TMM and EBS methods we obtainedthe most divergent results the highest specificity but thelowest sensitivity for TMM and the opposite for EBS Inthe case of specificity we can see that most of the methodsproduce values of specificity over 80

34 Prediction Errors Table 6 presents a comparison of thefive normalization methods when for each data set thedifferent numbers of informative genes were used basedon five classifiers and LOOCV In each case we selectedthe number of differentially expressed genes as 75 of thenumber of samples Therefore for the Cheung Bodymapand AML data sets we have respectively 30 (075 lowast 41) 12(075 lowast 16) and 20 (074 lowast 27) differentially expressed genesThe results in the table suggest that in all data sets PS DESand EBS perform better than TMM and UQ

35 Diagnostic Plots Besides the analyticalmethods for com-parison of normalization methods we suggest using addi-tional determinants that may be helpful for the rejectionof the most commonly used methods that evidently fail Itis possible that the normalization methods yield differentresults for different data sets Therefore we suggest theapplication of the following workflow based on diagnosticplots to determine which normalization method is optimalfor a specific data set Here we focus only on the AML dataset In Figure 2(a) we can observe the differences introducedto normalization factors obtained by each normalizationmethod From this figure we can conclude that normalizationcoefficients determined with TMM and UQ methods grouptogether and divergent from the rest of normalization meth-ods

The results in Table 6 may be summarised consider-ing the average errors expressed through figures For theAML data set the performances of the five normalizationmethods can be ranked The percentages of classificationerrors obtained by normalization methods could be used

Table 6 Performances of the normalization methods with infor-mative genes based on five classifiers and LOOCV applied to theCheung Bodymap and AML data sets The first number in eachcell denotes the percentage of the average of five prediction errorsfrom five different classification methods The second number ineach cell that is in brackets is the percentage of the median of thefive prediction errors

TMM UQ DES EBS PS RD

Cheung 111 107 45 44 39 116(98) (73) (04) (00) (00) (122)

Bodymap 162 162 160 158 160 162(188) (188) (177) (167) (177) (188)

AML 81 80 77 74 73 83(74) (74) (74) (74) (74) (74)

to obtain a 95 confidence interval of the mean of theproportion of classification errors for the normalization Thecorresponding bar chart plotting confidence intervals is givenin Figure 2(b)This plot indicate that the TMMUQ andDESmethods outperform the two other methods with respect toclassification performance

36 Common DEGs To compare the number of DE genesand the number of common DE genes found among the nor-malization methods performed for a particular data set wegenerate balloon plots (Figure 2(c) and Figure S5) and Venndiagrams (see Figure S4) Balloon plots represent percentagesof the number of commonly detected differentially expressedgenes between the 119894th and 119895th methods First we calculatein the (119894 119895)th cell a proportion of common detection withrespect to the 119894th method

119875119894119895 =

119863119894119895

119863119894

sdot 100 (14)

where 119894 = 119895 119863119894119895 is the number of differentially expressedgenes commonly detected by the 119894th method and the 119895thmethod and 119863119894 is the number of differentially expressedgenes detected by the 119894th method Next we take the averagepercentage value of common genes between each pair ofmethods The most preferable method is the one with thehighest number of common DEGs

For the AML data set (Figure 2(c)) the lowest number ofcommonDEGs with othermethods is produced by the TMMnormalization method and the best by the EBS methodThe lowest number of common DEGs was obtained for theEBSndashTMM and PSndashTMM comparisons From this analysiswe concluded that the set of genes identified as DEGs is notstable and depends on both themethod of data normalizationand the data set itself However in the case of AML data setthere is a set of 227 genes common for all tested methods(Supplementary Figure S4) and these genes can be consideredas the strongest candidates for DEGs (Figure S4) In the caseof the Cheung data set we noted that most of the methodsappear to perform similarly For the Bodymap data set allnormalization methods yielded slightly lower percentages ofcommonly detected DEGs than in the case of the Cheung

8 BioMed Research International

050075100125150

1 3 5 7 9 11 13 15 17 19 21 23 25 27Samples

Nor

mal

izat

ion

coeffi

cien

t

MethodsTMMUQDES

EBSPSRD

(a)

6

9

12

15

TMM UQ

DES EB

S PS RD

Methods

Clas

sifica

tion

erro

r

(b)

60 68 53 54 70

60 80 68 70 76

68 80 60 62 90

53 68 60 94 60

54 70 62 94 60

70 76 90 60 60RDPS

EBSDESUQ

TMM

TMM UQ

DES EB

S PS RD

Methods

Met

hods

0

25

50

75

()

(c)

0

50

100

150

TMM UQ

DES RD EB

S PS

(d)

Figure 2 Diagnostic plots for the AML data set Besides the five normalization methods raw data (RD) were also included in the plots as abenchmark (a) presents calculated normalization factor values across the samples by eachmethodThe samples are ordered by theminimumvalues of normalization factors (b) shows 95 confidence intervals for the mean of the percentages of classification errors calculated for eachmethod based on five selected classifiers (c) shows the numbers of common DE genes across each pair of normalization methods The sizeand shading of the circles represent the average percentage value of common genes between each pair of methods (d) presents the results ofclustering of the normalization methods based on 20 common DE genes found by each of these methods A dendrogram was created withhierarchical clustering based on Wardrsquos method

data set For the raw data (RD) the number of common genesfor any pair of methods was below 50 (see Figure S5) TheVenn diagrams for the Cheung and Bodymap data sets alsoconfirmed this tendency (Figure S4)

37 Grouping of Normalization Methods Clustering isanother approach to compare the performance of the normal-ization methods Based on the similarities between the DEgene lists we generated dendrograms to easily observe whichmethods group together The exact measure of similarity wasthe DE gene rank In the case of each data set for five variantsof normalized data as well as for raw data (RD) particularsets of differentially expressed genes were obtained First wechose genes common to all six DEG lists Since we obtained20 genes in common between six methods we performedthe clustering of the normalization methods based onthese common DEGs Then for all methods we rankedthese genes thus obtaining ranking lists of genes Basedon these lists we computed the distance matrix by usingEuclidian distance and plot dendrograms (Figure 2(d)) Thedendrograms were constructed from hierarchical clusteringusing Wardrsquos methodThis criterion is another approach thatcompares the DEGs lists The previous criterion (commonDEGs) gives us the information about the percentage of DEGin common per pair of methods whereas this criterion gives

the information about the similarity between the methodsbased on the order of common DEGs in each list

38 Summary of Ranks Combining all of the criteriadescribed in the previous sections we would like to deter-mine which of the five tested normalization methods wouldbe appropriate for the AML data set Table 7 summarises theranks obtained based on the bias and variance values theprediction errors sensitivity specificity and the number ofcommon DEGs for the AML data set In the case when allthe criteria included in Table 7 are equally important theinvestigator can base the decision on the final rank calculatedas the mean of the ranks established separately for eachcriterion using chosen normalization methods

4 Discussion

In practice normalization of high-throughput data stillremains an important question and has received a lot of atten-tion in the literatureThe increasing number of normalizationmethods makes it difficult for scientists to decide whichmethod should be used for which particular data set Basedon the results presented in this paper as well as in [25] we canconclude that normalization affects differential expressionanalysis therefore an important aspect is how to choose

BioMed Research International 9

Table 7 Summary of comparison results for the five normalizationmethods under considerationThefinal rank is based on the bias andvariance values the prediction errors sensitivity specificity valuesand the number of common DEGs for AML data

TMM UQ DES EBS PSBias 50 40 10 30 20Variance 50 40 10 30 20Sensitivity 50 30 40 10 20Specificity 15 30 15 50 40Prediction errors 50 40 30 20 10Common DEGs 50 20 10 40 30

the most sensible method for the data In this paper we haveshown that depending on the data structure the influenceof normalization differs (see Figures 1 and 2(d)) In our workwe have demonstrated that based on some of these criteriathe choice of normalization method can be more suitableand robust and can be made more automatically Coefficientssuch as bias and variance can be considered as criteria for acomparison of normalization methods It is worth pointingout that in our investigation in most cases including thenormalization reduces the bias and variance values comparedto raw data which confirms the need of normalizationWhenthe differences between the bias and variance values aresignificant the usage of ranks of these values reflects the realdifferences more precisely However in our investigation thedifferences between obtained bias and variance values forall methods in all data sets are small In such case usingranks does not reflect the true differences between methodsand additional criteria are neededThe diagnostic plots couldserve as such additional determinants and may be helpful forthe rejection of the most leading methods that evidently failOur study indicates that the use of TMM method in mostcases is displayed poorly This conclusion is not in agreementwith the evaluation made by [26 27] Their study indicatedthat the use of TMM method led to good performance onsimulated data sets One reason for the disagreement of theseresults can be due to different approaches of comparison andusage of criteria not considered by other authors Anotherreason of differences in the conclusions could arise from thenumber of biological replicates used in our study or could berelated to the particular data sets used in these papers

Furthermore we find that other conclusions described in[26] are consistentwith our own resultsThese results confirmthe satisfactory results of the DESeq method In Table 7 wepresented the summary of ranks for the AML data set It ispossible that the normalization of other data sets can yieldresults that are different from those obtained with the AMLdata set (in this paper we have shown that depending onthe data structure the influence of normalization differs)In general each criterion proposed in the paper focuses ondifferent aspects of comparison Depending on the mainobjectives of the research some of the criteria could be moreuseful for example if impact lies in good prediction basedon chosen genes the important aspect will be predictionerrors However in some cases the results of criteria areinconclusive In such situation we suggest the application of

the following workflow to determine which normalizationmethod is optimal for a specific data set (i) normalize thedata using considered methods (ii) calculate the ldquobiasrdquo andldquovariancerdquo and rank the methods based on these values (iii)after each normalization perform differential analysis anddetermine DEG lists found by each normalization method(iv) select a subset of genes that can serve as positive andnegative controls to investigate the sensitivity and specificityof normalization methods and rank the methods basedon these criteria (v) calculate the percentage of the meanof the prediction errors obtained using chosen classifiersfor DEGs found by each normalization method and rankthem (vi) draw Venn diagrams or balloon plots based onthe number of differentially expressed genes and rank themethods based on the number of common DEG valuesand (vii) based on the summary of ranks choose the mostappropriate normalization method of the investigated dataset We notice here that the normalization method caninfluence the expression results leading to erroneous DEanalysis therefore it is very important to put effort at thisstage of the analysis Finally we wanted to draw attentionthat our paper does not indicate clearly which normalizationmethod is the best but it adds a new look at how to choose thenormalization for RNA-Seq data analysis to avoid erroneousDE analysis It can be applied not only tomethods concerningsequencing depth but the proposed algorithm is also suitableto compare normalizations that take into account othersources of unwanted variation

5 Conclusions

In the study new coefficients such as bias and variancewere proposed as objective criteria for a comparison of nor-malization methods In conclusion our results suggest thatdepending on the RNA-seq data structure and the appliedmethod the influence of normalization differs Howeverthe presented criteria in particular bias and variance cansupport the choice of normalization method optimal for aspecific data set

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank the reviewer for the thor-ough evaluation of the paper and valuable comments Thiswork was supported by the Ministry of Science and HigherEducation of the Republic of Poland by KNOW program

References

[1] S Zhao W-P Fung-Leung A Bittner K Ngo and X LiuldquoComparison of RNA-Seq and microarray in transcriptomeprofiling of activated T cellsrdquo PLoS ONE vol 9 no 1 ArticleID e78644 2014

10 BioMed Research International

[2] E T Wang R Sandberg S Luo et al ldquoAlternative isoformregulation in human tissue transcriptomesrdquoNature vol 456 no7221 pp 470ndash476 2008

[3] Q Pan O Shai L J Lee B J Frey and B J Blencowe ldquoDeepsurveying of alternative splicing complexity in the human tran-scriptome by high-throughput sequencingrdquo Nature Geneticsvol 40 no 12 pp 1413ndash1415 2008

[4] W M Landau and P Liu ldquoDispersion estimation and its effecton test performance in RNA-seq data analysis a simulation-based comparison of methodsrdquo PLoS ONE vol 8 no 12 ArticleID e81415 2013

[5] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008

[6] A Oshlack andM JWakefield ldquoTranscript length bias in RNA-seq data confounds systems biologyrdquo Biology Direct vol 4article 14 2009

[7] J K Pickrell J C Marioni A A Pai et al ldquoUnderstandingmechanisms underlying human gene expression variation withRNA sequencingrdquoNature vol 464 no 7289 pp 768ndash772 2010

[8] D Risso K Schwartz G Sherlock et al ldquoGC-content normal-ization for RNA-seqrdquo Tech Rep 291 Division of BiostatisticsUniversity of California Berkeley 2011

[9] J T Leek R B Scharpf H C Bravo et al ldquoTacklingthe widespread and critical impact of batch effects in high-throughput datardquo Nature Reviews Genetics vol 11 no 10 pp733ndash739 2010

[10] J H Bullard E Purdom K D Hansen and S Dudoit ldquoEvalu-ation of statistical methods for normalization and differentialexpression in mRNA-Seq experimentsrdquo BMC Bioinformaticsvol 11 article 94 2010

[11] M D Robinson and A Oshlack ldquoA scaling normalizationmethod for differential expression analysis of RNA-seq datardquoGenome Biology vol 11 no 3 article r25 2010

[12] M D Robinson D J McCarthy and G K Smyth ldquoEdgeRa bioconductor package for differential expression analysis ofdigital gene expression datardquo Bioinformatics vol 26 no 1 pp139ndash140 2010

[13] S Anders and W Huber ldquoDifferential expression analysis forsequence count datardquo Genome Biology vol 11 no 10 articleR106 2010

[14] N Leng J Dawson J Thomson et al ldquoEBSeq an empiricalbayes hierarchical model for inference in RNA-seq experi-mentsrdquo Tech Rep 226 University of Wisconsin 2012

[15] J Li D M Witten I M Johnstone and R TibshiranildquoNormalization testing and false discovery rate estimation forRNA-sequencing datardquo Biostatistics vol 13 no 3 pp 523ndash5382012

[16] Y W Asmann B M Necela K R Kalari et al ldquoDetection ofredundant fusion transcripts as biomarkers or disease-specifictherapeutic targets in breast cancerrdquo Cancer Research vol 72no 8 pp 1921ndash1928 2012

[17] V G Cheung R R Nayak I X Wang et al ldquoPolymorphic cis-and Trans-Regulation of human gene expressionrdquo PLoS Biologyvol 8 no 9 Article ID e1000480 2010

[18] A C Frazee B Langmead and J T Leek ldquoRecount a multi-experiment resource of analysis-ready RNA-seq gene countdatasetsrdquo BMC Bioinformatics vol 12 article 449 2011

[19] C Argyropoulos A A Chatziioannou G Nikiforidis AMoustakas G Kollias and V Aidinis ldquoOperational criteria forselecting a cDNA microarray data normalization algorithmrdquoOncology Reports vol 15 no 4 pp 983ndash996 2006

[20] B Uszczynska J Zyprych-Walczak L Handschuh et al ldquoAnaly-sis of boutique arrays a universalmethod for the selection of theoptimal data normalization procedurerdquo International Journal ofMolecular Medicine vol 32 no 3 pp 668ndash684 2013

[21] D Chen Z Liu X Ma and D Hua ldquoSelecting genes by teststatisticsrdquo Journal of Biomedicine and Biotechnology vol 2005no 2 pp 132ndash138 2005

[22] RDevelopment Core Team R A Language and Environment forStatistical Computing R Foundation for Statistical ComputingVienna Austria 2011

[23] M Goswami N Hensel B D Smith et al ldquoExpression ofputative targets of immunotherapy in acute myeloid leukemiaand healthy tissuesrdquo Leukemia vol 28 no 5 pp 1167ndash1170 2014

[24] E Eisenberg and E Y Levanon ldquoHuman housekeeping genesrevisitedrdquo Trends in Genetics vol 29 no 10 pp 569ndash574 2013

[25] F Seyednasrollah A Laiho and L L Elo ldquoComparison ofsoftware packages for detecting differential expression in RNA-seq studiesrdquo Briefings in Bioinformatics vol 16 no 1 pp 59ndash702015

[26] M-A Dillies A Rau J Aubert et al ldquoA comprehensive evalu-ation of normalization methods for Illumina high-throughputRNA sequencing data analysisrdquo Briefings in Bioinformatics vol14 no 6 pp 671ndash683 2013

[27] E Maza P Frasse P Senin M Bouzayen and M ZouineldquoComparison of normalization methods for differential geneexpression analysis in RNA-Seq experiments a matter ofrelative size of studied transcriptomesrdquo Communicative ampIntegrative Biology vol 6 no 6 Article ID e25849 2013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 8: Research Article The Impact of Normalization Methods on ...downloads.hindawi.com/journals/bmri/2015/621690.pdf · normalization step di erent aspects based on the biology or informatics

8 BioMed Research International

050075100125150

1 3 5 7 9 11 13 15 17 19 21 23 25 27Samples

Nor

mal

izat

ion

coeffi

cien

t

MethodsTMMUQDES

EBSPSRD

(a)

6

9

12

15

TMM UQ

DES EB

S PS RD

Methods

Clas

sifica

tion

erro

r

(b)

60 68 53 54 70

60 80 68 70 76

68 80 60 62 90

53 68 60 94 60

54 70 62 94 60

70 76 90 60 60RDPS

EBSDESUQ

TMM

TMM UQ

DES EB

S PS RD

Methods

Met

hods

0

25

50

75

()

(c)

0

50

100

150

TMM UQ

DES RD EB

S PS

(d)

Figure 2 Diagnostic plots for the AML data set Besides the five normalization methods raw data (RD) were also included in the plots as abenchmark (a) presents calculated normalization factor values across the samples by eachmethodThe samples are ordered by theminimumvalues of normalization factors (b) shows 95 confidence intervals for the mean of the percentages of classification errors calculated for eachmethod based on five selected classifiers (c) shows the numbers of common DE genes across each pair of normalization methods The sizeand shading of the circles represent the average percentage value of common genes between each pair of methods (d) presents the results ofclustering of the normalization methods based on 20 common DE genes found by each of these methods A dendrogram was created withhierarchical clustering based on Wardrsquos method

data set For the raw data (RD) the number of common genesfor any pair of methods was below 50 (see Figure S5) TheVenn diagrams for the Cheung and Bodymap data sets alsoconfirmed this tendency (Figure S4)

37 Grouping of Normalization Methods Clustering isanother approach to compare the performance of the normal-ization methods Based on the similarities between the DEgene lists we generated dendrograms to easily observe whichmethods group together The exact measure of similarity wasthe DE gene rank In the case of each data set for five variantsof normalized data as well as for raw data (RD) particularsets of differentially expressed genes were obtained First wechose genes common to all six DEG lists Since we obtained20 genes in common between six methods we performedthe clustering of the normalization methods based onthese common DEGs Then for all methods we rankedthese genes thus obtaining ranking lists of genes Basedon these lists we computed the distance matrix by usingEuclidian distance and plot dendrograms (Figure 2(d)) Thedendrograms were constructed from hierarchical clusteringusing Wardrsquos methodThis criterion is another approach thatcompares the DEGs lists The previous criterion (commonDEGs) gives us the information about the percentage of DEGin common per pair of methods whereas this criterion gives

the information about the similarity between the methodsbased on the order of common DEGs in each list

38 Summary of Ranks Combining all of the criteriadescribed in the previous sections we would like to deter-mine which of the five tested normalization methods wouldbe appropriate for the AML data set Table 7 summarises theranks obtained based on the bias and variance values theprediction errors sensitivity specificity and the number ofcommon DEGs for the AML data set In the case when allthe criteria included in Table 7 are equally important theinvestigator can base the decision on the final rank calculatedas the mean of the ranks established separately for eachcriterion using chosen normalization methods

4 Discussion

In practice normalization of high-throughput data stillremains an important question and has received a lot of atten-tion in the literatureThe increasing number of normalizationmethods makes it difficult for scientists to decide whichmethod should be used for which particular data set Basedon the results presented in this paper as well as in [25] we canconclude that normalization affects differential expressionanalysis therefore an important aspect is how to choose

BioMed Research International 9

Table 7 Summary of comparison results for the five normalizationmethods under considerationThefinal rank is based on the bias andvariance values the prediction errors sensitivity specificity valuesand the number of common DEGs for AML data

TMM UQ DES EBS PSBias 50 40 10 30 20Variance 50 40 10 30 20Sensitivity 50 30 40 10 20Specificity 15 30 15 50 40Prediction errors 50 40 30 20 10Common DEGs 50 20 10 40 30

the most sensible method for the data In this paper we haveshown that depending on the data structure the influenceof normalization differs (see Figures 1 and 2(d)) In our workwe have demonstrated that based on some of these criteriathe choice of normalization method can be more suitableand robust and can be made more automatically Coefficientssuch as bias and variance can be considered as criteria for acomparison of normalization methods It is worth pointingout that in our investigation in most cases including thenormalization reduces the bias and variance values comparedto raw data which confirms the need of normalizationWhenthe differences between the bias and variance values aresignificant the usage of ranks of these values reflects the realdifferences more precisely However in our investigation thedifferences between obtained bias and variance values forall methods in all data sets are small In such case usingranks does not reflect the true differences between methodsand additional criteria are neededThe diagnostic plots couldserve as such additional determinants and may be helpful forthe rejection of the most leading methods that evidently failOur study indicates that the use of TMM method in mostcases is displayed poorly This conclusion is not in agreementwith the evaluation made by [26 27] Their study indicatedthat the use of TMM method led to good performance onsimulated data sets One reason for the disagreement of theseresults can be due to different approaches of comparison andusage of criteria not considered by other authors Anotherreason of differences in the conclusions could arise from thenumber of biological replicates used in our study or could berelated to the particular data sets used in these papers

Furthermore we find that other conclusions described in[26] are consistentwith our own resultsThese results confirmthe satisfactory results of the DESeq method In Table 7 wepresented the summary of ranks for the AML data set It ispossible that the normalization of other data sets can yieldresults that are different from those obtained with the AMLdata set (in this paper we have shown that depending onthe data structure the influence of normalization differs)In general each criterion proposed in the paper focuses ondifferent aspects of comparison Depending on the mainobjectives of the research some of the criteria could be moreuseful for example if impact lies in good prediction basedon chosen genes the important aspect will be predictionerrors However in some cases the results of criteria areinconclusive In such situation we suggest the application of

the following workflow to determine which normalizationmethod is optimal for a specific data set (i) normalize thedata using considered methods (ii) calculate the ldquobiasrdquo andldquovariancerdquo and rank the methods based on these values (iii)after each normalization perform differential analysis anddetermine DEG lists found by each normalization method(iv) select a subset of genes that can serve as positive andnegative controls to investigate the sensitivity and specificityof normalization methods and rank the methods basedon these criteria (v) calculate the percentage of the meanof the prediction errors obtained using chosen classifiersfor DEGs found by each normalization method and rankthem (vi) draw Venn diagrams or balloon plots based onthe number of differentially expressed genes and rank themethods based on the number of common DEG valuesand (vii) based on the summary of ranks choose the mostappropriate normalization method of the investigated dataset We notice here that the normalization method caninfluence the expression results leading to erroneous DEanalysis therefore it is very important to put effort at thisstage of the analysis Finally we wanted to draw attentionthat our paper does not indicate clearly which normalizationmethod is the best but it adds a new look at how to choose thenormalization for RNA-Seq data analysis to avoid erroneousDE analysis It can be applied not only tomethods concerningsequencing depth but the proposed algorithm is also suitableto compare normalizations that take into account othersources of unwanted variation

5 Conclusions

In the study new coefficients such as bias and variancewere proposed as objective criteria for a comparison of nor-malization methods In conclusion our results suggest thatdepending on the RNA-seq data structure and the appliedmethod the influence of normalization differs Howeverthe presented criteria in particular bias and variance cansupport the choice of normalization method optimal for aspecific data set

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank the reviewer for the thor-ough evaluation of the paper and valuable comments Thiswork was supported by the Ministry of Science and HigherEducation of the Republic of Poland by KNOW program

References

[1] S Zhao W-P Fung-Leung A Bittner K Ngo and X LiuldquoComparison of RNA-Seq and microarray in transcriptomeprofiling of activated T cellsrdquo PLoS ONE vol 9 no 1 ArticleID e78644 2014

10 BioMed Research International

[2] E T Wang R Sandberg S Luo et al ldquoAlternative isoformregulation in human tissue transcriptomesrdquoNature vol 456 no7221 pp 470ndash476 2008

[3] Q Pan O Shai L J Lee B J Frey and B J Blencowe ldquoDeepsurveying of alternative splicing complexity in the human tran-scriptome by high-throughput sequencingrdquo Nature Geneticsvol 40 no 12 pp 1413ndash1415 2008

[4] W M Landau and P Liu ldquoDispersion estimation and its effecton test performance in RNA-seq data analysis a simulation-based comparison of methodsrdquo PLoS ONE vol 8 no 12 ArticleID e81415 2013

[5] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008

[6] A Oshlack andM JWakefield ldquoTranscript length bias in RNA-seq data confounds systems biologyrdquo Biology Direct vol 4article 14 2009

[7] J K Pickrell J C Marioni A A Pai et al ldquoUnderstandingmechanisms underlying human gene expression variation withRNA sequencingrdquoNature vol 464 no 7289 pp 768ndash772 2010

[8] D Risso K Schwartz G Sherlock et al ldquoGC-content normal-ization for RNA-seqrdquo Tech Rep 291 Division of BiostatisticsUniversity of California Berkeley 2011

[9] J T Leek R B Scharpf H C Bravo et al ldquoTacklingthe widespread and critical impact of batch effects in high-throughput datardquo Nature Reviews Genetics vol 11 no 10 pp733ndash739 2010

[10] J H Bullard E Purdom K D Hansen and S Dudoit ldquoEvalu-ation of statistical methods for normalization and differentialexpression in mRNA-Seq experimentsrdquo BMC Bioinformaticsvol 11 article 94 2010

[11] M D Robinson and A Oshlack ldquoA scaling normalizationmethod for differential expression analysis of RNA-seq datardquoGenome Biology vol 11 no 3 article r25 2010

[12] M D Robinson D J McCarthy and G K Smyth ldquoEdgeRa bioconductor package for differential expression analysis ofdigital gene expression datardquo Bioinformatics vol 26 no 1 pp139ndash140 2010

[13] S Anders and W Huber ldquoDifferential expression analysis forsequence count datardquo Genome Biology vol 11 no 10 articleR106 2010

[14] N Leng J Dawson J Thomson et al ldquoEBSeq an empiricalbayes hierarchical model for inference in RNA-seq experi-mentsrdquo Tech Rep 226 University of Wisconsin 2012

[15] J Li D M Witten I M Johnstone and R TibshiranildquoNormalization testing and false discovery rate estimation forRNA-sequencing datardquo Biostatistics vol 13 no 3 pp 523ndash5382012

[16] Y W Asmann B M Necela K R Kalari et al ldquoDetection ofredundant fusion transcripts as biomarkers or disease-specifictherapeutic targets in breast cancerrdquo Cancer Research vol 72no 8 pp 1921ndash1928 2012

[17] V G Cheung R R Nayak I X Wang et al ldquoPolymorphic cis-and Trans-Regulation of human gene expressionrdquo PLoS Biologyvol 8 no 9 Article ID e1000480 2010

[18] A C Frazee B Langmead and J T Leek ldquoRecount a multi-experiment resource of analysis-ready RNA-seq gene countdatasetsrdquo BMC Bioinformatics vol 12 article 449 2011

[19] C Argyropoulos A A Chatziioannou G Nikiforidis AMoustakas G Kollias and V Aidinis ldquoOperational criteria forselecting a cDNA microarray data normalization algorithmrdquoOncology Reports vol 15 no 4 pp 983ndash996 2006

[20] B Uszczynska J Zyprych-Walczak L Handschuh et al ldquoAnaly-sis of boutique arrays a universalmethod for the selection of theoptimal data normalization procedurerdquo International Journal ofMolecular Medicine vol 32 no 3 pp 668ndash684 2013

[21] D Chen Z Liu X Ma and D Hua ldquoSelecting genes by teststatisticsrdquo Journal of Biomedicine and Biotechnology vol 2005no 2 pp 132ndash138 2005

[22] RDevelopment Core Team R A Language and Environment forStatistical Computing R Foundation for Statistical ComputingVienna Austria 2011

[23] M Goswami N Hensel B D Smith et al ldquoExpression ofputative targets of immunotherapy in acute myeloid leukemiaand healthy tissuesrdquo Leukemia vol 28 no 5 pp 1167ndash1170 2014

[24] E Eisenberg and E Y Levanon ldquoHuman housekeeping genesrevisitedrdquo Trends in Genetics vol 29 no 10 pp 569ndash574 2013

[25] F Seyednasrollah A Laiho and L L Elo ldquoComparison ofsoftware packages for detecting differential expression in RNA-seq studiesrdquo Briefings in Bioinformatics vol 16 no 1 pp 59ndash702015

[26] M-A Dillies A Rau J Aubert et al ldquoA comprehensive evalu-ation of normalization methods for Illumina high-throughputRNA sequencing data analysisrdquo Briefings in Bioinformatics vol14 no 6 pp 671ndash683 2013

[27] E Maza P Frasse P Senin M Bouzayen and M ZouineldquoComparison of normalization methods for differential geneexpression analysis in RNA-Seq experiments a matter ofrelative size of studied transcriptomesrdquo Communicative ampIntegrative Biology vol 6 no 6 Article ID e25849 2013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 9: Research Article The Impact of Normalization Methods on ...downloads.hindawi.com/journals/bmri/2015/621690.pdf · normalization step di erent aspects based on the biology or informatics

BioMed Research International 9

Table 7 Summary of comparison results for the five normalizationmethods under considerationThefinal rank is based on the bias andvariance values the prediction errors sensitivity specificity valuesand the number of common DEGs for AML data

TMM UQ DES EBS PSBias 50 40 10 30 20Variance 50 40 10 30 20Sensitivity 50 30 40 10 20Specificity 15 30 15 50 40Prediction errors 50 40 30 20 10Common DEGs 50 20 10 40 30

the most sensible method for the data In this paper we haveshown that depending on the data structure the influenceof normalization differs (see Figures 1 and 2(d)) In our workwe have demonstrated that based on some of these criteriathe choice of normalization method can be more suitableand robust and can be made more automatically Coefficientssuch as bias and variance can be considered as criteria for acomparison of normalization methods It is worth pointingout that in our investigation in most cases including thenormalization reduces the bias and variance values comparedto raw data which confirms the need of normalizationWhenthe differences between the bias and variance values aresignificant the usage of ranks of these values reflects the realdifferences more precisely However in our investigation thedifferences between obtained bias and variance values forall methods in all data sets are small In such case usingranks does not reflect the true differences between methodsand additional criteria are neededThe diagnostic plots couldserve as such additional determinants and may be helpful forthe rejection of the most leading methods that evidently failOur study indicates that the use of TMM method in mostcases is displayed poorly This conclusion is not in agreementwith the evaluation made by [26 27] Their study indicatedthat the use of TMM method led to good performance onsimulated data sets One reason for the disagreement of theseresults can be due to different approaches of comparison andusage of criteria not considered by other authors Anotherreason of differences in the conclusions could arise from thenumber of biological replicates used in our study or could berelated to the particular data sets used in these papers

Furthermore we find that other conclusions described in[26] are consistentwith our own resultsThese results confirmthe satisfactory results of the DESeq method In Table 7 wepresented the summary of ranks for the AML data set It ispossible that the normalization of other data sets can yieldresults that are different from those obtained with the AMLdata set (in this paper we have shown that depending onthe data structure the influence of normalization differs)In general each criterion proposed in the paper focuses ondifferent aspects of comparison Depending on the mainobjectives of the research some of the criteria could be moreuseful for example if impact lies in good prediction basedon chosen genes the important aspect will be predictionerrors However in some cases the results of criteria areinconclusive In such situation we suggest the application of

the following workflow to determine which normalizationmethod is optimal for a specific data set (i) normalize thedata using considered methods (ii) calculate the ldquobiasrdquo andldquovariancerdquo and rank the methods based on these values (iii)after each normalization perform differential analysis anddetermine DEG lists found by each normalization method(iv) select a subset of genes that can serve as positive andnegative controls to investigate the sensitivity and specificityof normalization methods and rank the methods basedon these criteria (v) calculate the percentage of the meanof the prediction errors obtained using chosen classifiersfor DEGs found by each normalization method and rankthem (vi) draw Venn diagrams or balloon plots based onthe number of differentially expressed genes and rank themethods based on the number of common DEG valuesand (vii) based on the summary of ranks choose the mostappropriate normalization method of the investigated dataset We notice here that the normalization method caninfluence the expression results leading to erroneous DEanalysis therefore it is very important to put effort at thisstage of the analysis Finally we wanted to draw attentionthat our paper does not indicate clearly which normalizationmethod is the best but it adds a new look at how to choose thenormalization for RNA-Seq data analysis to avoid erroneousDE analysis It can be applied not only tomethods concerningsequencing depth but the proposed algorithm is also suitableto compare normalizations that take into account othersources of unwanted variation

5 Conclusions

In the study new coefficients such as bias and variancewere proposed as objective criteria for a comparison of nor-malization methods In conclusion our results suggest thatdepending on the RNA-seq data structure and the appliedmethod the influence of normalization differs Howeverthe presented criteria in particular bias and variance cansupport the choice of normalization method optimal for aspecific data set

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank the reviewer for the thor-ough evaluation of the paper and valuable comments Thiswork was supported by the Ministry of Science and HigherEducation of the Republic of Poland by KNOW program

References

[1] S Zhao W-P Fung-Leung A Bittner K Ngo and X LiuldquoComparison of RNA-Seq and microarray in transcriptomeprofiling of activated T cellsrdquo PLoS ONE vol 9 no 1 ArticleID e78644 2014

10 BioMed Research International

[2] E T Wang R Sandberg S Luo et al ldquoAlternative isoformregulation in human tissue transcriptomesrdquoNature vol 456 no7221 pp 470ndash476 2008

[3] Q Pan O Shai L J Lee B J Frey and B J Blencowe ldquoDeepsurveying of alternative splicing complexity in the human tran-scriptome by high-throughput sequencingrdquo Nature Geneticsvol 40 no 12 pp 1413ndash1415 2008

[4] W M Landau and P Liu ldquoDispersion estimation and its effecton test performance in RNA-seq data analysis a simulation-based comparison of methodsrdquo PLoS ONE vol 8 no 12 ArticleID e81415 2013

[5] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008

[6] A Oshlack andM JWakefield ldquoTranscript length bias in RNA-seq data confounds systems biologyrdquo Biology Direct vol 4article 14 2009

[7] J K Pickrell J C Marioni A A Pai et al ldquoUnderstandingmechanisms underlying human gene expression variation withRNA sequencingrdquoNature vol 464 no 7289 pp 768ndash772 2010

[8] D Risso K Schwartz G Sherlock et al ldquoGC-content normal-ization for RNA-seqrdquo Tech Rep 291 Division of BiostatisticsUniversity of California Berkeley 2011

[9] J T Leek R B Scharpf H C Bravo et al ldquoTacklingthe widespread and critical impact of batch effects in high-throughput datardquo Nature Reviews Genetics vol 11 no 10 pp733ndash739 2010

[10] J H Bullard E Purdom K D Hansen and S Dudoit ldquoEvalu-ation of statistical methods for normalization and differentialexpression in mRNA-Seq experimentsrdquo BMC Bioinformaticsvol 11 article 94 2010

[11] M D Robinson and A Oshlack ldquoA scaling normalizationmethod for differential expression analysis of RNA-seq datardquoGenome Biology vol 11 no 3 article r25 2010

[12] M D Robinson D J McCarthy and G K Smyth ldquoEdgeRa bioconductor package for differential expression analysis ofdigital gene expression datardquo Bioinformatics vol 26 no 1 pp139ndash140 2010

[13] S Anders and W Huber ldquoDifferential expression analysis forsequence count datardquo Genome Biology vol 11 no 10 articleR106 2010

[14] N Leng J Dawson J Thomson et al ldquoEBSeq an empiricalbayes hierarchical model for inference in RNA-seq experi-mentsrdquo Tech Rep 226 University of Wisconsin 2012

[15] J Li D M Witten I M Johnstone and R TibshiranildquoNormalization testing and false discovery rate estimation forRNA-sequencing datardquo Biostatistics vol 13 no 3 pp 523ndash5382012

[16] Y W Asmann B M Necela K R Kalari et al ldquoDetection ofredundant fusion transcripts as biomarkers or disease-specifictherapeutic targets in breast cancerrdquo Cancer Research vol 72no 8 pp 1921ndash1928 2012

[17] V G Cheung R R Nayak I X Wang et al ldquoPolymorphic cis-and Trans-Regulation of human gene expressionrdquo PLoS Biologyvol 8 no 9 Article ID e1000480 2010

[18] A C Frazee B Langmead and J T Leek ldquoRecount a multi-experiment resource of analysis-ready RNA-seq gene countdatasetsrdquo BMC Bioinformatics vol 12 article 449 2011

[19] C Argyropoulos A A Chatziioannou G Nikiforidis AMoustakas G Kollias and V Aidinis ldquoOperational criteria forselecting a cDNA microarray data normalization algorithmrdquoOncology Reports vol 15 no 4 pp 983ndash996 2006

[20] B Uszczynska J Zyprych-Walczak L Handschuh et al ldquoAnaly-sis of boutique arrays a universalmethod for the selection of theoptimal data normalization procedurerdquo International Journal ofMolecular Medicine vol 32 no 3 pp 668ndash684 2013

[21] D Chen Z Liu X Ma and D Hua ldquoSelecting genes by teststatisticsrdquo Journal of Biomedicine and Biotechnology vol 2005no 2 pp 132ndash138 2005

[22] RDevelopment Core Team R A Language and Environment forStatistical Computing R Foundation for Statistical ComputingVienna Austria 2011

[23] M Goswami N Hensel B D Smith et al ldquoExpression ofputative targets of immunotherapy in acute myeloid leukemiaand healthy tissuesrdquo Leukemia vol 28 no 5 pp 1167ndash1170 2014

[24] E Eisenberg and E Y Levanon ldquoHuman housekeeping genesrevisitedrdquo Trends in Genetics vol 29 no 10 pp 569ndash574 2013

[25] F Seyednasrollah A Laiho and L L Elo ldquoComparison ofsoftware packages for detecting differential expression in RNA-seq studiesrdquo Briefings in Bioinformatics vol 16 no 1 pp 59ndash702015

[26] M-A Dillies A Rau J Aubert et al ldquoA comprehensive evalu-ation of normalization methods for Illumina high-throughputRNA sequencing data analysisrdquo Briefings in Bioinformatics vol14 no 6 pp 671ndash683 2013

[27] E Maza P Frasse P Senin M Bouzayen and M ZouineldquoComparison of normalization methods for differential geneexpression analysis in RNA-Seq experiments a matter ofrelative size of studied transcriptomesrdquo Communicative ampIntegrative Biology vol 6 no 6 Article ID e25849 2013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 10: Research Article The Impact of Normalization Methods on ...downloads.hindawi.com/journals/bmri/2015/621690.pdf · normalization step di erent aspects based on the biology or informatics

10 BioMed Research International

[2] E T Wang R Sandberg S Luo et al ldquoAlternative isoformregulation in human tissue transcriptomesrdquoNature vol 456 no7221 pp 470ndash476 2008

[3] Q Pan O Shai L J Lee B J Frey and B J Blencowe ldquoDeepsurveying of alternative splicing complexity in the human tran-scriptome by high-throughput sequencingrdquo Nature Geneticsvol 40 no 12 pp 1413ndash1415 2008

[4] W M Landau and P Liu ldquoDispersion estimation and its effecton test performance in RNA-seq data analysis a simulation-based comparison of methodsrdquo PLoS ONE vol 8 no 12 ArticleID e81415 2013

[5] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008

[6] A Oshlack andM JWakefield ldquoTranscript length bias in RNA-seq data confounds systems biologyrdquo Biology Direct vol 4article 14 2009

[7] J K Pickrell J C Marioni A A Pai et al ldquoUnderstandingmechanisms underlying human gene expression variation withRNA sequencingrdquoNature vol 464 no 7289 pp 768ndash772 2010

[8] D Risso K Schwartz G Sherlock et al ldquoGC-content normal-ization for RNA-seqrdquo Tech Rep 291 Division of BiostatisticsUniversity of California Berkeley 2011

[9] J T Leek R B Scharpf H C Bravo et al ldquoTacklingthe widespread and critical impact of batch effects in high-throughput datardquo Nature Reviews Genetics vol 11 no 10 pp733ndash739 2010

[10] J H Bullard E Purdom K D Hansen and S Dudoit ldquoEvalu-ation of statistical methods for normalization and differentialexpression in mRNA-Seq experimentsrdquo BMC Bioinformaticsvol 11 article 94 2010

[11] M D Robinson and A Oshlack ldquoA scaling normalizationmethod for differential expression analysis of RNA-seq datardquoGenome Biology vol 11 no 3 article r25 2010

[12] M D Robinson D J McCarthy and G K Smyth ldquoEdgeRa bioconductor package for differential expression analysis ofdigital gene expression datardquo Bioinformatics vol 26 no 1 pp139ndash140 2010

[13] S Anders and W Huber ldquoDifferential expression analysis forsequence count datardquo Genome Biology vol 11 no 10 articleR106 2010

[14] N Leng J Dawson J Thomson et al ldquoEBSeq an empiricalbayes hierarchical model for inference in RNA-seq experi-mentsrdquo Tech Rep 226 University of Wisconsin 2012

[15] J Li D M Witten I M Johnstone and R TibshiranildquoNormalization testing and false discovery rate estimation forRNA-sequencing datardquo Biostatistics vol 13 no 3 pp 523ndash5382012

[16] Y W Asmann B M Necela K R Kalari et al ldquoDetection ofredundant fusion transcripts as biomarkers or disease-specifictherapeutic targets in breast cancerrdquo Cancer Research vol 72no 8 pp 1921ndash1928 2012

[17] V G Cheung R R Nayak I X Wang et al ldquoPolymorphic cis-and Trans-Regulation of human gene expressionrdquo PLoS Biologyvol 8 no 9 Article ID e1000480 2010

[18] A C Frazee B Langmead and J T Leek ldquoRecount a multi-experiment resource of analysis-ready RNA-seq gene countdatasetsrdquo BMC Bioinformatics vol 12 article 449 2011

[19] C Argyropoulos A A Chatziioannou G Nikiforidis AMoustakas G Kollias and V Aidinis ldquoOperational criteria forselecting a cDNA microarray data normalization algorithmrdquoOncology Reports vol 15 no 4 pp 983ndash996 2006

[20] B Uszczynska J Zyprych-Walczak L Handschuh et al ldquoAnaly-sis of boutique arrays a universalmethod for the selection of theoptimal data normalization procedurerdquo International Journal ofMolecular Medicine vol 32 no 3 pp 668ndash684 2013

[21] D Chen Z Liu X Ma and D Hua ldquoSelecting genes by teststatisticsrdquo Journal of Biomedicine and Biotechnology vol 2005no 2 pp 132ndash138 2005

[22] RDevelopment Core Team R A Language and Environment forStatistical Computing R Foundation for Statistical ComputingVienna Austria 2011

[23] M Goswami N Hensel B D Smith et al ldquoExpression ofputative targets of immunotherapy in acute myeloid leukemiaand healthy tissuesrdquo Leukemia vol 28 no 5 pp 1167ndash1170 2014

[24] E Eisenberg and E Y Levanon ldquoHuman housekeeping genesrevisitedrdquo Trends in Genetics vol 29 no 10 pp 569ndash574 2013

[25] F Seyednasrollah A Laiho and L L Elo ldquoComparison ofsoftware packages for detecting differential expression in RNA-seq studiesrdquo Briefings in Bioinformatics vol 16 no 1 pp 59ndash702015

[26] M-A Dillies A Rau J Aubert et al ldquoA comprehensive evalu-ation of normalization methods for Illumina high-throughputRNA sequencing data analysisrdquo Briefings in Bioinformatics vol14 no 6 pp 671ndash683 2013

[27] E Maza P Frasse P Senin M Bouzayen and M ZouineldquoComparison of normalization methods for differential geneexpression analysis in RNA-Seq experiments a matter ofrelative size of studied transcriptomesrdquo Communicative ampIntegrative Biology vol 6 no 6 Article ID e25849 2013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 11: Research Article The Impact of Normalization Methods on ...downloads.hindawi.com/journals/bmri/2015/621690.pdf · normalization step di erent aspects based on the biology or informatics

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology