ieee/acm transactions on computational biology and ...weisun/research/tcbb-2009-09-0156-2.pdf · 2...

9
Incorporating Nonlinear Relationships in Microarray Missing Value Imputation Tianwei Yu, Hesen Peng, and Wei Sun Abstract—Microarray gene expression data often contain missing values. Accurate estimation of the missing values is important for downstream data analyses that require complete data. Nonlinear relationships between gene expression levels have not been well- utilized in missing value imputation. We propose an imputation scheme based on nonlinear dependencies between genes. By simulations based on real microarray data, we show that incorporating nonlinear relationships could improve the accuracy of missing value imputation, both in terms of normalized root-mean-squared error and in terms of the preservation of the list of significant genes in statistical testing. In addition, we studied the impact of artificial dependencies introduced by data normalization on the simulation results. Our results suggest that methods relying on global correlation structures may yield overly optimistic simulation results when the data have been subjected to row (gene)-wise mean removal. Index Terms—gene expression, statistical analysis, missing value. Ç 1 INTRODUCTION M ICROARRAY and other high-throughput data often contain missing values that hamper the application of many unsupervised and supervised learning techniques. A number of methods were developed for the imputation of the missing values, some of which were reviewed and compared by Brock et al. [1]. Most imputation methods stem from the observation that the relationship between gene expression levels can be approximated by linear models. There are two broad classes of imputation methods. The first class utilizes local information, i.e., neighbors of the gene of interest. A representative of this class is the widely used K-nearest neighbor (KNN) method, which fills the missing value by the weighted average of the expression of K neighboring genes in the same array [2]. The LSimpute method applies linear regression between the gene with missing value and neighboring genes [3]. Robust regression using the principal components of the neighbors was also proposed [4]. Local least squares (LLSs) imputation uses multiple regression of the gene with missing value against the neighboring genes [5]. The method often achieves optimal performance with a very large number of neigh- bors, some times over 50 percent of the genes in the array [1], [5]. This suggests the method partly takes advantage of the global correlation structures between arrays. A multi- stage clustering method which incorporates missing value imputation was also proposed [6]. The second class of imputation methods utilizes the global correlation structure. The SVD method uses the dependency of the experiments on the first several singular vectors of the data matrix to impute the missing values [2]. A more complex and better performing variant, BPCA, uses probabilistic PCA and Bayesian estimation [7]. To go beyond the linear relation- ship between experiments, the Support Vector Regression imputation maps the samples to higher dimensional space with kernel functions [8]. In addition to using a single imputation scheme, the idea of ensemble learning has been applied, in which imputed values from multiple methods were combined [9], [10], [11]. Another way of improving imputation accuracy is to recruit external information. Examples include the use of other microarray data sets from the same organism [12], functional annotation of genes [13], knowledge of synchro- nization loss in time series data [14], and epigenetic information [15]. Dependencies between genes can be complex and far from linear [16], [17]. General dependency has been shown to be effective for the inference of gene regulatory networks [17], [18], [19]. These results suggest that nonlinear relation- ships between genes are biologically relevant, and may provide extra information in missing value imputation. A parametric nonlinear regression method has been proposed to utilize such information, which however still preselects neighboring genes based on euclidean distance [20]. Here, we present a missing value imputation scheme that follows the nearest neighbor principal. The definition of “neigh- bors” is broadened to incorporate nonlinear dependency. The method is based on both Pearson’s correlation coefficient and a new metric for detecting nonlinear relationships, named similarity based on conditional ordered list (SCOL). For neighbors selected by SCOL, the points corresponding to the expression levels may not be geometrically close to each other. Consequently, the imputation is done using the principle of estimating nonlinear response curves by kernel smoother between the gene of interest and its neighbors. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. X, XXXXXXX 2011 1 . T. Yu and H. Peng are with the Department of Biostatistics and Bioinformatics, 1518 Clifton Rd., N.E., 3rd Floor, Rollins School of Public Health, Emory University, Atlanta, GA 30322. E-mail: {tyu8, hpeng5}@emory.edu. . W. Sun is with the Department of Biostatistics and the Department of Genetics, University of North Carolina, 4115E McGavran-Greenberg Hall, Chapel Hill, NC 27599-7420. E-mail: [email protected]. Manuscript received 14 Sept. 2009; revised 4 Jan. 2010; accepted 30 Mar. 2010; published online 20 Aug. 2010. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TCBB-2009-09-0156. Digital Object Identifier no. 10.1109/TCBB.2010.73. 1545-5963/11/$26.00 ß 2011 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM

Upload: others

Post on 28-Jan-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND ...weisun/research/TCBB-2009-09-0156-2.pdf · 2 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. X,

Incorporating Nonlinear Relationshipsin Microarray Missing Value Imputation

Tianwei Yu, Hesen Peng, and Wei Sun

Abstract—Microarray gene expression data often contain missing values. Accurate estimation of the missing values is important for

downstream data analyses that require complete data. Nonlinear relationships between gene expression levels have not been well-

utilized in missing value imputation. We propose an imputation scheme based on nonlinear dependencies between genes. By

simulations based on real microarray data, we show that incorporating nonlinear relationships could improve the accuracy of missing

value imputation, both in terms of normalized root-mean-squared error and in terms of the preservation of the list of significant genes in

statistical testing. In addition, we studied the impact of artificial dependencies introduced by data normalization on the simulation

results. Our results suggest that methods relying on global correlation structures may yield overly optimistic simulation results when the

data have been subjected to row (gene)-wise mean removal.

Index Terms—gene expression, statistical analysis, missing value.

Ç

1 INTRODUCTION

MICROARRAY and other high-throughput data oftencontain missing values that hamper the application

of many unsupervised and supervised learning techniques.A number of methods were developed for the imputation ofthe missing values, some of which were reviewed andcompared by Brock et al. [1]. Most imputation methodsstem from the observation that the relationship betweengene expression levels can be approximated by linearmodels. There are two broad classes of imputation methods.The first class utilizes local information, i.e., neighbors ofthe gene of interest. A representative of this class is thewidely used K-nearest neighbor (KNN) method, which fillsthe missing value by the weighted average of the expressionof K neighboring genes in the same array [2]. The LSimputemethod applies linear regression between the gene withmissing value and neighboring genes [3]. Robust regressionusing the principal components of the neighbors was alsoproposed [4]. Local least squares (LLSs) imputation usesmultiple regression of the gene with missing value againstthe neighboring genes [5]. The method often achievesoptimal performance with a very large number of neigh-bors, some times over 50 percent of the genes in the array[1], [5]. This suggests the method partly takes advantage ofthe global correlation structures between arrays. A multi-stage clustering method which incorporates missing valueimputation was also proposed [6]. The second class ofimputation methods utilizes the global correlation structure.

The SVD method uses the dependency of the experimentson the first several singular vectors of the data matrix toimpute the missing values [2]. A more complex and betterperforming variant, BPCA, uses probabilistic PCA andBayesian estimation [7]. To go beyond the linear relation-ship between experiments, the Support Vector Regressionimputation maps the samples to higher dimensional spacewith kernel functions [8].

In addition to using a single imputation scheme, theidea of ensemble learning has been applied, in whichimputed values from multiple methods were combined [9],[10], [11]. Another way of improving imputation accuracyis to recruit external information. Examples include the useof other microarray data sets from the same organism [12],functional annotation of genes [13], knowledge of synchro-nization loss in time series data [14], and epigeneticinformation [15].

Dependencies between genes can be complex and farfrom linear [16], [17]. General dependency has been shownto be effective for the inference of gene regulatory networks[17], [18], [19]. These results suggest that nonlinear relation-ships between genes are biologically relevant, and mayprovide extra information in missing value imputation. Aparametric nonlinear regression method has been proposedto utilize such information, which however still preselectsneighboring genes based on euclidean distance [20]. Here,we present a missing value imputation scheme that followsthe nearest neighbor principal. The definition of “neigh-bors” is broadened to incorporate nonlinear dependency.The method is based on both Pearson’s correlationcoefficient and a new metric for detecting nonlinearrelationships, named similarity based on conditionalordered list (SCOL). For neighbors selected by SCOL, thepoints corresponding to the expression levels may not begeometrically close to each other. Consequently, theimputation is done using the principle of estimatingnonlinear response curves by kernel smoother betweenthe gene of interest and its neighbors.

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. X, XXXXXXX 2011 1

. T. Yu and H. Peng are with the Department of Biostatistics andBioinformatics, 1518 Clifton Rd., N.E., 3rd Floor, Rollins School of PublicHealth, Emory University, Atlanta, GA 30322.E-mail: {tyu8, hpeng5}@emory.edu.

. W. Sun is with the Department of Biostatistics and the Department ofGenetics, University of North Carolina, 4115E McGavran-Greenberg Hall,Chapel Hill, NC 27599-7420. E-mail: [email protected].

Manuscript received 14 Sept. 2009; revised 4 Jan. 2010; accepted 30 Mar.2010; published online 20 Aug. 2010.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TCBB-2009-09-0156.Digital Object Identifier no. 10.1109/TCBB.2010.73.

1545-5963/11/$26.00 � 2011 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM

Page 2: IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND ...weisun/research/TCBB-2009-09-0156-2.pdf · 2 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. X,

2 METHODS

2.1 Similarity Based on Conditional Ordered List

Here, we present a similarity score between two vectors xxand yy, the purpose of which is to measure the spread of theconditional distribution YY jXX in a nonparametric manner.First, each vector is standardized using normal scoretransformation. The mm values of the vector are comparedto obtain the ranks R1; R2; . . . ; Rm, and then each xi of thevector is replaced by ��1ðRi=ðmþ 1ÞÞ, where �ð:Þ is thecumulative normal distribution. Second, we sort the datafðxi; yiÞg by the XX values: fðx�i ; y�i Þg; x�1 � x�2 � � � � � x�p.Third, the sum of absolute differences between the adjacentYY values in the ordered list is obtained as the raw distance,

Dcolðx;yÞ ¼Xmi¼2

jy�i � y�i�1j:

This is equivalent to summing the Manhattan distancesbetween adjacent points in the list ordered by XX, up to aconstant. Last, the raw distance is standardized such that itis between zero and one. The theoretical minimum of thedistance is

Dmincol ¼ ��1ðm=ðmþ 1ÞÞ � ��1ð1=ðmþ 1ÞÞ;

which is obtained when the two vectors have the exact samerank order. When m is odd, the theoretical maximumpossible distance is

Dmaxcol ¼ 2

Xi

j��1ði=ðmþ 1ÞÞj þ ��1ððm� 1Þ=2ðmþ 1ÞÞ:

When m is even,

Dmaxcol ¼ 2

Xi

j��1ði=ðmþ 1ÞÞj þ 2��1ðm=2ðmþ 1ÞÞ:

These maximum possible values are obtained when all theadjacent YY values in the ordered list have opposite signsand the two end yy values have the smallest absolute values.To standardize the raw score such that it is between zeroand one, we obtain the SCOL similarity by

Sstdcol ðx;yÞ ¼ 1�Dcolðx;yÞ �Dmincol

Dmaxcol �Dmin

col

:

This SCOL similarity is asymmetric, as it focuses on thespread of YY jX (Figs. 1a & 1b). For the purpose of missingvalue imputation, we use the asymmetric version Sstdcol ðx;yÞwhere YY is the gene with missing value to be imputed andXX is any other gene. To obtain a symmetric version formeasuring nonlinear relationship, we can take

Sstdcol;symðx;yÞ ¼ max Sstdcol ðx;yÞ; Sstdcol y;xð Þ� �

:

2.2 Selecting Informative Genes

To impute the missing values of a gene, we utilize bothnonlinear dependency (measured by SCOL) and lineardependency (measured by the absolute value of Pearson’scorrelation coefficient) to select informative genes. Thereason for combining CC and SCOL in gene selection isbecause CC is more responsive when a slight curvatureexist, yet the relationship is still monotone, while SCOL ismore responsive when the curvature is large and/or therelation is nonmonotone.

First, we compute the absolute correlation matrix CC andthe SCOL matrix BB, in which bij ¼ Sstdcol ðgi;gjÞ. To computethe ith row in BB, we reorder the columns of the expressionmatrix Gn�m based on the rank order of the ith gene togenerateG�n�m. Then, absolute differences are taken betweenadjacent columns of the reordered matrix to generate the

2 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. X, XXXXXXX 2011

Fig. 1. Illustration of the similarity based on conditional ordered list. (a) An example where YY and XX have a nonlinear relationship and the spread ofYY jXX is small. (b) The same data as in (a) with the role of XX and YY reversed, resulting in much lower similarity score because the spread of YY jX islarge. Red vertical line segments: distance between adjacent YY values based on the list ordered by the value of XX. Blue dotted horizontal linesegments: distance between adjacent XX values.

Page 3: IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND ...weisun/research/TCBB-2009-09-0156-2.pdf · 2 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. X,

difference matrix F : f j ¼ jg�jþ1 � g�j j; j ¼ 1; . . . ;m� 1. When

missing values exist in the ith gene, the corresponding

column of the expression matrix is ignored. There are

missing entries in the FF matrix caused by missing values in

other genes. We then take the row-wise mean values of the FF

matrix, ignoring missing entries, and multiply by the

number of columns ofFF . After transforming the raw distance

values to SCOL scores, the vector is used for the ith row of

the BB matrix.Second, for every gene that has missing values, its top kk

neighbors as defined by absolute CC and top k neighbors as

defined by SCOL are found from the corresponding

columns of the CC and BB matrices. Combining these two

sets of neighbors gives us the set of informative genes.

Overlaps may exist, which means the total number of

informative genes may be smaller than 2k. In practice, the

optimal k can be selected by masking some observed values

and then evaluating the imputation accuracy using different

values of k.

2.3 Imputing the Missing Value

Since nonlinear relations are involved, and each informative

gene may show a different response pattern with the gene of

interest, we use a procedure based on kernel smoother in

conjunction with weighted averaging for the imputation. In

steps 1 to 4 below, we find an imputed value from one

informative gene. We use gx to denote the gene expression

profile with a missing value to be imputed, and let gx1 be

missing. We use gi to denote the expression profile of an

informative gene. To simplify the discussion, we assume

both expression profiles are nonmissing at positions 2; ::::; p.

Steps 1 to 3 are simply steps of fitting a kernel smoother.

Because our interest is in the fit at the missing value location,

only a single point of the kernel smoother is calculated.

1. Find the difference between the nonmissing valuesof the two expression profiles.

dij ¼ gxj � gij; j ¼ 2; . . . ; p:

The purpose of using the difference, rather than

gx itself, is to alleviate underestimation of extreme

values when the relation is monotone.2. Find the weight based on the Gaussian density.

Observations with gij closer to gi1 receive higherweights.

wij ¼ e�ðgij�gi1Þ2=2�2

i ; j ¼ 2; . . . ; p;

where �i is one half of the 1/3 quantile of jgij � gi1j;j ¼ 2; . . . ; p. This ensures that 1/3 of the samples are

within 2�i and make major contribution to the

imputation.3. Impute the missing value by weighted mean,

di1 ¼Xp

j¼2

dijwij

.Xp

j¼2

wij

gix1 ¼ di1 þ gi1:

4. Take the weighted variance of the d vector for ameasure of confidence,

s2i ¼

Ppj¼2 wijðdij � di1Þ

2

p�2p�1

Ppj¼2 wij

:

5. Merge the imputed values from different informativegenes by weighted averaging. The weight has twocomponents: 1) genes that have higher SCOL simi-larity with the gene of interest receive higher weight,and 2) predictions with smaller weighted variancereceive higher weight. We assign the weight by

�i ¼ 1=e�6Sstdcolðgi;gxÞsi:

The reason the parameter value 6 was chosen for theexponential decay function was that we observed empiri-cally most top neighbors showed similarity scores between0.5 and 0.9. With the parameter 6, the weight increasesroughly 10 fold from an SCOL score of 0.5 to an SCOL scoreof 0.9. The final imputed value is the weighted average

gx1 ¼Xni¼1

�igix1

.Xni¼1

�i;

where n is the number of informative genes used.

2.4 Simulation Study

2.4.1 Data Sets

Six data sets were used in the simulation study. Theyincluded the B-cell lymphoma profiling data [21], the dataset of yeast transcriptome/translatome comparison [22], theNCI60 cell line gene expression data [23], and the GSE19119data set on Atlantic salmon [24]. Two yeast cell cycle timeseries [25], the alpha factor data set and the elutriation dataset, were used to probe the effect of data normalization onsimulation results in imputation studies. The number ofgenes/conditions of each data set is listed in Tables 1 and 2.

2.4.2 Methods Compared

Four popular imputation methods were used for compar-ison. They included the KNN method [2], the Bayesian PCA(BPCA) method [7], the LLS method [5], and the SVDmethod [2]. The R implementations of these methods wereused, which involved packages “impute” and “pca-Methods” [27]. In the results, we refer to our method as“NL.” For NL and KNN, the data matrix (after maskingknown expression values) was first row (gene)-standardizedbefore the imputation, and the values were transformedback after imputation.

2.4.3 Simulating Missing Patterns

Different percentages of missing patterns (1 percent,5 percent, 10 percent, 15 percent and 20 percent) weresimulated. To simulate data with realistic missing patterns,we used a scheme to learn the missing patterns directly fromthe real data. For every data set, we divided the data into twoparts—matrix AA containing all the genes without anymissing value, and matrix BB containing genes with missingvalues. The simulated data were generated from matrix AA.Assume matrix AA had n rows and m columns, and wetargeted to obtain a simulated data matrix with p percent

YU ET AL.: INCORPORATING NONLINEAR RELATIONSHIPS IN MICROARRAY MISSING VALUE IMPUTATION 3

Page 4: IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND ...weisun/research/TCBB-2009-09-0156-2.pdf · 2 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. X,

missing values. For every gene gx in matrix AA, we first took a

random sample from the Poisson distribution with � ¼mp=100 to assign the number of missing entries to the gene.Assume the assigned number of missing entries was l. Wethen selected all the genes from matrix BB that contained l

missing entries, and randomly sampled one gene from them,denoted g1. We then found the positions of the g1 vector thatwere missing values, and assigned N/A to the samepositions of the gx vector. By carrying out this procedure

for every gene in matrix AA, we obtained a simulated matrixAA� that contained p percent missing values.

2.4.4 Parameter Setting

A preliminary simulation study was conducted to select thebest parameter setting for each method. We tested differentvalues of the tuning parameters at every missing percentage,by repeating the mask-impute procedure three times. Theparameter values tested for the NL method were k ¼ 5, 10,15, 20; the values for the KNN method were k ¼ 5, 10, 15, 20;the values for LLS were k ¼ 5 percent, 10 percent, 20 percent,30 percent; . . . . . . ; 90 percent of the number of genes of thedata set; the values for SVD were nPcs = 2, 4, 6, 8, 10, 15, 20,

30, 40 (up to the number of arrays in the data set). No

parameter selection was done for the BPCA method,

because it’s been established that the recommended value

for the nPcs parameter, which is the number of samples

minus 1, performs the best [7]. Other than the major tuning

parameters, we relied on the default of each method for

other parameters.To measure the success of imputation and compare the

performance between different parameter settings, we used

the normalized root mean squared error (NRMSE),

nrmseðy; yÞ ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1n

Pni¼1 ðyi � yiÞ

2qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

1n�1

Pni¼1 ðyi � �yÞ2

q :

It measures the deviation of the imputed values from the

hidden true values. For each method, the parameter setting

that achieved the lowest average NRMSE in the preliminary

simulation study (Tables 1 and 2) was selected for the larger

scale simulation study, in which we repeated the mask-

impute procedure 50 times for every data set at each of the

five missing percentage settings.

4 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. X, XXXXXXX 2011

TABLE 1Data Sets and Parameter Settings of the Simulation Study

TABLE 2Data Sets and Parameter Settings for the Study of Normalization Effect

Page 5: IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND ...weisun/research/TCBB-2009-09-0156-2.pdf · 2 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. X,

2.4.5 Comparing the Performance

To compare the performance of the methods, we employedtwo methods. The first was NRMSE. The second methodmeasured the impact of the imputation on the statisticaltesting results. The general idea was to measure how muchthe imputation changed the list of significant genes instatistical testing, as compared to the test results generatedfrom the original data without missing values. First, testingwas conducted on the complete data, and the list ofsignificant genes taken as the reference. Then after eachmask-impute process, the same testing procedure wasapplied and the list of significant genes was comparedwith the reference list. The false positive (FP) and falsenegative (FN) counts were analyzed to judge the success ofthe imputation. It should be noted that the reference listfrom the original data was not the biological truth. Thus, theresults measured only relative changes, and the FP and FNcounts carried no direct biological interpretation. The ideawas that the imputation method should change the list ofsignificant genes as little as possible.

Different statistical tests were selected based on thecharacteristics of each data set. All the tests were done on agene-by-gene basis. After testing, the p-values from all thegenes were transformed to false discovery rate (FDR) [28],and genes with FDR less than a threshold (detailed in theResults section) were considered significant. Different FDRcutoffs were selected for different data sets, in order toreach a significant gene list of reasonable size for theassessment of FPs and FNs rates after imputation.

For the B-cell lymphoma profiling (Alizadeh) data set[21], the main clinical interest was the survival outcome.The accelerated failure time (AFT) model assuming theWeibull distribution was fitted for each gene using thesamples with survival information. For the yeast transcrip-tome/translatome comparison (Halbeisen) data set [22], wefitted a multiple linear regression model for each gene, withRNA source (transcriptome/translatome) as one predictor,stress treatments as another predictor, and the geneexpression value as the response. Because the main interestof the study was the contrast between transcriptome andtranslatome, we focused on genes significantly associatedwith the RNA source predictor. For the NCI60 cell line(Ross) data set [23], a regression model was fitted for eachgene, with the tissue origin of the cancer cell lines as thepredictor, and the gene expression value as the response.The F-test p-value for the overall fit, which measured theassociation of the gene expression with the tissue origin,was collected. For the yeast cell cycle time series (Spellman)data sets [25], because the main interest was in genesshowing periodic behavior, the Fisher’s exact g test forperiodicity [29] was used to identify periodically expressedgenes. The only data set for which no test was performedwas the GSE19119 (Tymchuk) data set [24], as no clearoutcome could be used for testing.

3 RESULTS AND DISCUSSIONS

We illustrate the calculation of SCOL in Fig. 1. The datapoints are normal score transformed and ordered by the XXvalues, and the absolute distance between adjacent YY values

in the list are summed up (Fig. 1, red bars). After furthertransformation as described in the Methods section, theSCOL similarity measure is obtained. When the spread ofY jXY jX is small (Fig. 1a), the distances are small and thesimilarity is high. When the spread of YjX is large (Fig. 1b),the distances are big and the similarity is low. Clearly,SCOL similarity score is asymmetric as demonstrated by thetwo panels in Fig. 1 (same data; role of XX and YY reversed).Because in missing value imputation, we only focus on theconditional variance of the gene with a missing value giventhe value of the informative gene, the asymmetric version ofSCOL serves the purpose well. In the case of Fig. 1a, anobserved value in XX will be helpful to predict the value ofYY . Yet in the case of Fig. 1b, an observed value in XX will notbe as helpful in predicting the value of YY .

3.1 Simulation Results

By masking observed values at different percentages, wecompared the performance of the five methods by findingthe NRMSE, and by examining the impact of the imputationon statistical testing results. The results are summarized inFig. 2. With the Alizadeh, Ross, and Tymchuk data sets, NLshowed an advantage over the other methods in terms ofNRMSE, while with the Halbeisen data set, LLS and BPCAperformed better (Fig. 2).

We further explored the impact of the imputation on theselection of significant genes. The list of significant genesafter imputation was compared to the reference listgenerated from the original data. The “false positives” and“false negatives” in the subsequent discussion are onlyrelative to the reference list, not the unknown biologicaltruth. We selected different FDR cutoffs between 0.01 and 0.2for different data sets in order to achieve a reasonable size ofthe reference list to facilitate FP and FN analysis (SupportingTable 1). For the Alizadeh data set, the AFT model was usedand the FDR cutoff of 0.2 (corresponding p-value cutoff0.0025) was chosen, which yielded 65 (1.1 percent) signifi-cant genes. For the Halbeisen data set, multiple regressionwas used and the FDR cutoff of 0.01 (corresponding p-valuecutoff 0.0016) was chosen, which yielded 1,184 (15.9 percent)significant genes. For the Ross data set, ANOVA was usedand the FDR cutoff of 0.01 (corresponding p-value cutoff0.0022) was chosen, which yielded 510 (22.5 percent)significant genes.

The simulation results showed that NL’s lead in NRMSEtranslated to a lead in FPþ FN counts, in both the Alizadehdata set and the Ross data set (Fig. 2, column 2). In addition,KNN showed stronger performance in FPþ FN counts, ascompared to it’s own performance in NRMSE. Interestingly,for the Halbeisen data set, BPCA and LLS’s lead in NRMSEdidn’t translate to a clear lead in FPþ FN counts. NL’sperformance in FPþ FN was extremely close to LLS andBPCA up to 15 percent missing, while it showed loweraverage FPþ FN counts at 20 percent missing. Pronounceddifferences were found between the methods in terms ofFP/FN ratio (Fig. 2, column 3). NL was almost always themost conservative in this aspect, producing FP/FN ratiosthat were low. Using both the Halbeisen data set and theRoss data set, NL was the only method producing FP/FNratios close to 1 (dashed line in the plots, zero in log scale).With high-throughput data, the key concern is controlling

YU ET AL.: INCORPORATING NONLINEAR RELATIONSHIPS IN MICROARRAY MISSING VALUE IMPUTATION 5

Page 6: IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND ...weisun/research/TCBB-2009-09-0156-2.pdf · 2 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. X,

the FDR. Hence, an imputation method that doesn’t

introduce too many false positives is preferable. In this

aspect, NL clearly outperformed the other methods.

3.2 The Impact of Rowwise Mean Removal on theEvaluation of Imputation Methods

The most common method of evaluating the performance of

imputation methods is through simulations using existing

microarray data sets. However, it should be noted that some

preprocessing of the data sets may cause artifacts in the

evaluation. Particularly, overly optimistic results may be

obtained for imputation methods that use global correlation

structures, when the processing of the data introduces

artificial dependency among the columns (arrays) of the

data matrix. This is the case because although some data

points are changed to “missing” in the simulation, their true

values have played a role in the normalization already, and

the information needed to retrieve the true values is in fact

6 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. X, XXXXXXX 2011

Fig. 2. Simulation results for four data sets based on 50 simulations. The data sets (labeled on the left side) are in rows and the type of results(labeled on top) are in columns. Error bars represent standard errors. No testing was performed for the Tymchuk data, because no clear outcomecan be used for testing.

Page 7: IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND ...weisun/research/TCBB-2009-09-0156-2.pdf · 2 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. X,

buried in other columns. In other words, with the row-wisenormalization, each column of the matrix becomes perfectlylinearly dependent on other columns. If an imputationmethod utilizes this dependency, it will recover the missingvalue almost perfectly when there’s only one value missingin a row. However, this perfect recovery only happens insimulations. When a value is truly missing, it cannot haveplayed a role in the normalization. With random noise inmicroarray measurements, perfect linear dependency be-tween columns doesn’t exist in the raw data.

The Spellman’s cell cycle data were widely used in theevaluation and comparison of imputation methods. It hasbeen the data set that showed the most dramatic differencesbetween methods. When downloaded in matrix form at thecell cycle study website (http://cellcycle-www.stanford.edu/), the data were already subjected to normalizationincluding row-wise mean removal within every time seriesexperiment, i.e., every gene’s expression levels werenormalized to obtain mean zero within every time series.This is suggested in the original publication [25] andevident by examining the data. The data were rounded totwo decimal places, hence the row means can deviate veryslightly from zero. When we used the two normalizedSpellman time series data sets in the simulations, LLS andBPCA were dramatically better in terms of NRMSE than theother three methods (Fig. 3, column 1), which wasconsistent with previous studies [1], [5], [7].

In order to make a comparison, we downloaded the rawdata (original log ratios) from the Stanford Microarray

Database [30], and assembled the raw data matrix. When agene was represented by more than one probes on thearray, the mean of log ratios was taken to obtain a singlereading. We then took the genes that were included in thesimulation study with normalized data. Due to smalldifferences in missing patterns, the numbers of genes wereslightly different from the normalized data (Table 2). Usingthe non-normalized log ratios, we conducted anothersimulation study (parameters listed in Table 2). The resultsshowed that with non-normalized data, although LLS andBPCA still led in performance in NRMSE, the differencebetween the methods was drastically smaller (Fig. 3,compare column 2 with column 1). To prove the pointfrom the other end, we conducted another simulation usingthe NCI60 data. To introduce a dependency that is at similarlevel as the cell cycle data, we partitioned the NCI60 datamatrix into three equal-sized submatrices, by selecting threegroups of continuous columns—columns 1-20, 21-40, and41-60. Within every submatrix, we conducted row-wisemean removal. Then, we subjected the whole matrix to themask-impute simulation procedure. The simulation yieldeda dramatic decreases in the NRMSE from LLS and BPCA(Fig 3, row 3). Comparatively, the SVD method, althoughutilizing global dependency, did not make as much artificialgains when the data were row-normalized. This is becausealthough the last three eigen values were driven to zero byrow-normalization, the relative scale of the larger eigenvalues didn’t change much. The SVD method largely relieson the first several singular vectors, which prevented it

YU ET AL.: INCORPORATING NONLINEAR RELATIONSHIPS IN MICROARRAY MISSING VALUE IMPUTATION 7

Fig. 3. Simulation results for three data sets before/after row-wise mean removal. The data sets (labeled on the left side) are in rows and the type ofresults (labeled on top) are in columns. Error bars represent standard errors. The testing results (columns 3 and 4) were generated from the non-normalized data. The testing results for the Ross data are reported in Fig. 2.

Page 8: IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND ...weisun/research/TCBB-2009-09-0156-2.pdf · 2 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. X,

from taking advantage of the perfect dependency createdby row-wise mean removal. Based on these results, we cansee that the impact of row-wise mean removal on theNRMSE results could be substantial. In addition toinfluencing the performance estimations in simulationstudies, normalization can also impact real-data analysis,in which we often resort to cross validation using thecomplete portion of the data to select the method andparameter settings. The same artifact as detailed abovecould influence the choice of method and parametersettings, if the datum is subjected to row-wise meanremoval before the cross validation.

Although LLS and BPCA didn’t show as much advan-tage using the original data, they still led in the performanceas judged by NRMSE. This is expected because the cell cyclearrays contain strong between-array dependencies [1]. Wefurther explored whether the better NRMSE results trans-lated to better testing results. Here, we used the Fisher’sexact g test to find periodic behaviors in the genes, becausethe Spellman experiment was designed to find cell cycle-related genes. The FDR cutoff of 0.2 was used, whichyielded �10% significant genes in both data sets (Support-ing Table 1). NL and KNN showed a clear lead in FPþ FNcounts despite they trailed in NRMSE results. Again NLwas the most conservative method in terms of FP/FN ratio,though for the elutriation data set all method yielded manymore false positives than false negatives (Fig. 3, column 4).

To better understand why NL tended to be moreconservative in terms of FP/FN ratio, we selected one ofthe scenarios, 10 percent missing in the Halbeisen data set, forfurther examination. We repeated the mask-impute process16 times and recorded the test p-values for all the genes. Wethen compared the test p-values after imputation against thecomplete data p-values. For the NL method, 46.2 percent ofthe test p-values after imputation were smaller than thecorresponding p-values from the complete data, while for allthe other methods, over 50 percent of the test p-values afterimputation were smaller than the corresponding p-valuesfrom the complete data (KNN: 52.6 percent, BPCA: 54.4 per-cent; LLS: 53.2 percent; SVD: 55.8 percent). Given that most ofthe genes were insignificant in the complete data, thetendency of the methods to decrease the p-values, evenslightly, made a lot of false positives. NL was conservative inthe sense that on average it tends to decrease the significanceof a gene, albeit slightly, when imputation is done.

In summary, the nonlinear imputation method based onSCOL and kernel smoother showed strong performance inthe imputation of missing data, especially when judged bythe impact on the statistical testing results. Aside fromgenerating lower FPþ FN counts, its property of beingconservative in terms of FP/FN ratio is desirable, ascontrolling false discovery rate is a key concern whenconducting tests using high-throughput data.

ACKNOWLEDGMENTS

The authors thank Dr. Qi Long for helpful discussions. Thisresearch is partially supported by NIH grants1P01ES016731-01, 2U19AI057266-06, 5P30AI50409-10, and1UL1RR025008-02, and a grant from the University Re-search Committee of Emory University.

REFERENCES

[1] G.N. Brock et al., “Which Missing Value Imputation Method toUse in Expression Profiles: A Comparative Study and TwoSelection Schemes,” BMC Bioinformatics, vol. 9, article no. 12, 2008.

[2] O. Troyanskaya et al., “Missing Value Estimation Methods forDNA Microarrays,” Bioinformatics, vol. 17, no. 6, pp. 520-525, June2001.

[3] T.H. Bo, B. Dysvik, and I. Jonassen, “LSimpute: AccurateEstimation of Missing Values in Microarray Data with LeastSquares Methods,” Nucleic Acids Research, vol. 32, no. 3, p. e34,2004.

[4] D. Yoon, E.K. Lee, and T. Park, “Robust Imputation Method forMissing Values in Microarray Data,” BMC Bioinformatics, vol. 8,S6, 2007.

[5] H. Kim, G.H. Golub, and H. Park, “Missing Value Estimation forDNA Microarray Gene Expression Data: Local Least SquaresImputation,” Bioinformatics, vol. 21, no. 2, pp. 187-198, Jan. 2005.

[6] D.S. Wong, F.K. Wong, and G.R. Wood, “A Multi-Stage Approachto Clustering and Imputation of Gene Expression Profiles,”Bioinformatics, vol. 23, no. 8, pp. 998-1005, Apr. 2007.

[7] S. Oba et al., “A Bayesian Missing Value Estimation Method forGene Expression Profile Data,” Bioinformatics, vol. 19, no. 16,pp. 2088-2096, Nov. 2003.

[8] X. Wang et al., “Missing Value Estimation for DNA MicroarrayGene Expression Data by Support Vector Regression Imputationand Orthogonal Coding Scheme,” BMC Bioinformatics, vol. 7,article no. 32, 2006.

[9] R. Jornsten et al., “DNA Microarray Data Imputation andSignificance Analysis of Differential Expression,” Bioinformatics,vol. 21, no. 22, pp. 4155-4161, Nov. 2005.

[10] M.S. Sehgal, I. Gondal, and L.S. Dooley, “Collateral Missing ValueImputation: A New Robust Missing Value Estimation Algorithmfor Microarray Data,” Bioinformatics, vol. 21, no. 10, pp. 2417-2423,May 2005.

[11] M.S. Sehgal et al., “Ameliorative Missing Value Imputation forRobust Biological Knowledge Inference,” J. Biomedical Informatics,vol. 41, no. 4, pp. 499-514, Aug. 2008.

[12] R. Jornsten, M. Ouyang, and H.Y. Wang, “A Meta-Data BasedMethod for DNA Microarray Imputation,” BMC Bioinformatics,vol. 8, article no. 109, 2007.

[13] J. Tuikkala et al., “Improving Missing Value Estimation inMicroarray Data with Gene Ontology,” Bioinformatics, vol. 22,no. 5, pp. 566-572, Mar. 2006.

[14] X. Gan, A.W. Liew, and H. Yan, “Microarray Missing DataImputation Based on a Set Theoretic Framework and BiologicalKnowledge,” Nucleic Acids Research, vol. 34, no. 5, pp. 1608-1619,2006.

[15] Q. Xiang et al., “Missing Value Imputation for Microarray GeneExpression Data Using Histone Acetylation Information,” BMCBioinformatics, vol. 9, article no. 252, 2008.

[16] K.C. Li et al., “A System for Enhancing Genome-Wide Coexpres-sion Dynamics Study,” Proc. Nat’l Academy of Sciences USA,vol. 101, no. 44, pp. 15561-15566, Nov. 2004.

[17] T. Suzuki et al., “Mutual Information Estimation Reveals GlobalAssociations between Stimuli and Biological Processes,” BMCBioinformatics, vol. 10, S52, 2009.

[18] W. Luo, K.D. Hankenson, and P.J. Woolf, “Learning Transcrip-tional Regulatory Networks from High Throughput Gene Expres-sion Data Using Continuous Three-Way Mutual Information,”BMC Bioinformatics, vol. 9, article no. 467, 2008.

[19] P.E. Meyer, F. Lafitte, and G. Bontempi, “minet: A R/Bioconduc-tor Package for Inferring Large Transcriptional Networks UsingMutual Information,” BMC Bioinformatics, vol. 9, article no. 461,2008.

[20] X. Zhou, X. Wang, and E.R. Dougherty, “Missing-Value Estima-tion Using Linear and Non-Linear Regression with Bayesian GeneSelection,” Bioinformatics, vol. 19, no. 17, pp. 2302-2307, Nov. 2003.

[21] A.A. Alizadeh et al., “Distinct Types of Diffuse Large B-CellLymphoma Identified by Gene Expression Profiling,” Nature,vol. 403, no. 6769, pp. 503-511, Feb. 2000.

[22] R.E. Halbeisen and A.P. Gerber, “Stress-Dependent Coordinationof Transcriptome and Translatome in Yeast,” PLoS Biology, vol. 7,no. 5, e105, May 2009.

[23] D.T. Ross et al., “Systematic Variation in Gene Expression Patternsin Human Cancer Cell Lines,” Nature Genetics, vol. 24, no. 3,pp. 227-235, Mar. 2000.

8 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. X, XXXXXXX 2011

Page 9: IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND ...weisun/research/TCBB-2009-09-0156-2.pdf · 2 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. X,

[24] http://0-www.ncbi.nlm.nih.gov.millennium.unicatt.it/projects/geo/query/acc.cgi?acc=GSE19119, 2009.

[25] P.T. Spellman et al., “Comprehensive Identification of Cell Cycle-Regulated Genes of the Yeast Saccharomyces Cerevisiae byMicroarray Hybridization,” Molecular Biology of the Cell, vol. 9,no. 12, pp. 3273-3297, Dec. 1998.

[26] K.C. Li, M. Yan, and S.S. Yuan, “A Simple Statistical Model forDepicting the cdc15-Synchronized Yeast Cell-Cycle RegulatedGene Expression Data,” Statistica Sinica, vol. 12, no. 1, pp. 141-158,Jan. 2002.

[27] W. Stacklies et al., “pcaMethods—A Bioconductor PackageProviding PCA Methods for Incomplete Data,” Bioinformatics,vol. 23, no. 9, pp. 1164-1167, May 2007.

[28] J.D. Storey and R. Tibshirani, “Statistical Significance for Genome-wide Studies,” Proc. Nat’l Academy of Sciences USA, vol. 100, no. 16,pp. 9440-9445, Aug. 2003.

[29] S. Wichert, K. Fokianos, and K. Strimmer, “Identifying Periodi-cally Expressed Transcripts in Microarray Time Series Data,”Bioinformatics, vol. 20, no. 1, pp. 5-20, Jan. 2004.

[30] J. Demeter et al., “The Stanford Microarray Database: Implemen-tation of New Analysis Tools and Open Source Release ofSoftware,” Nucleic Acids Research, vol. 35, pp. D766-770, Jan. 2007.

Tianwei Yu received the PhD degree instatistics from the University of California, LosAngeles, in 2005. Since 2006, he has been anassistant professor in the Department of Biosta-tistics and Bioinformatics at Emory University.His current research interests include high-dimensional data analysis, biological networks,and data preprocessing in metabolomics.

Hesen Peng received the BS degree in statisticsfrom Fudan University, Shanghai, China, in2008, and is, currently, a second year PhDdegree student in biostatistics at Emory Univer-sity. In 2008, he was a research intern at ChineseAcademy of Science—Max Planck Group Part-ner Institute of Computational Biology. Hisresearch interests include high-dimensional dataanalysis, nonlinear multidimensional networkdiscovery, and time series analysis.

Wei Sun received the PhD degree in statisticsfrom the University of California, Los Angeles, in2007. Since 2007, he is an assistant professor inthe Department of Biostatistics and the Depart-ment of Genetics, University of North Carolina atChapel Hill. His current research interestsinclude developing statistical/computationalmethods and software in the area of statisticalgenetics or genomics. In particular, he isinterested in the topics of gene expression

QTL (eQTL), RNA-seq and allele-specific expression, ChIP-chip/ChIP-seq analysis, and copy number alterations.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

YU ET AL.: INCORPORATING NONLINEAR RELATIONSHIPS IN MICROARRAY MISSING VALUE IMPUTATION 9