statistical analysis of big data on pharmacogenomicsjqfan/papers/13/pharm-bigdata.pdf · sible...

16
Statistical Analysis of Big Data on Pharmacogenomics Jianqing Fan and Han Liu Department of Operations Research and Financial Engineering Princeton University, Princeton, NJ 08544, USA Abstract This paper discusses statistical methods for estimating complex correlation structure from large pharmacogenomic datasets. We selectively overview several prominent statistical methods for estimating large covariance matrix for understanding correlation structure, inverse covariance matrix for network modeling, large-scale simultaneous tests for selecting significantly dierently expressed genes and proteins and genetic markers for complex diseases, and high dimensional variable selection for identify im- portant molecules for understanding molecule mechanisms in pharmacogenomics. Their applications to gene network estimation and biomarker selection are used to illustrate the methodological power. Several new challenges of Big data analysis, including complex data distribution, missing data, measurement error, spurious correlation, endogeneity, and the need for robust statistical methods, are also discussed. Keywords: Big data, High dimensional statistics, Approximate factor model, Graphical model, Multiple testing, Variable selection, Marginal screening, Robust statistics. 1. Introduction The ultimate goal of pharmacogenomics is to improve health care based on individual genomic profiles [1, 2, 3, 4, 5, 6, 7]. Together with other factors that may aect drug response — such as diet, age, diseases, lifestyle, environment, and state of health — pharmacogenomics has the potential to facilitate the creation of individualized treatment plan for patients and leads to the overarching goal of personalized medicine. Sim- ilar to other areas of human genomics, pharmacogenomics is experiencing an explosion of data, specifically by overloads of omics information (genomes, transcriptomes and other data from cells, tissues and drug eects)[8]. The pharamacoge- nomics research is entering the era of “Big data” — a term that refers to the explosion of available information. The vast amount of data brings both opportunities and new challenges. Statistical analysis for Big data is becoming increasingly im- portant [9, 10, 11, 12]. This paper selectively overviews several state of the art sta- tistical methods for analyzing large pharmacogenomics data. In particular, we emphasize on the fundamental problems of esti- mating two types of correlation structures: marginal correlation and conditional correlation. The former represents the corre- lation between two variables by ignoring the remaining vari- ables, while the latter represents the correlation between two variables by conditioning on the remaining ones. The marginal correlation can be estimated using the covariance matrix. The conditional correlation can be estimated using the inverse co- variance matrix. We introduce several recently developed sta- tistical methods for estimating high dimensional covariance and inverse covariance matrices. We also introduce cutting- edge methods to estimate false discovery proportions for large- scale simultaneous tests, and to select important molecules or SNPs in high dimensional regression models. The former cor- responds to finding molecules or SNPs that have significant marginal correlations with biological outcomes, while the lat- ter finds conditional correlations with biomedical responses in presence of many other molecules or SNPs. The rationale behind this paper is that we believe Big data provides new opportunities for estimating complex correlation structures among a large number of variables. These methods are new to the pharmacogenomics community and have the po- tential to play important roles in analyzing the next-generation Big data within the pharmacogenomics area. A notable feature of most methods introduced in this pa- per is the exploitation of sparsity assumption, which is an essential concept for modern statistical methods applied to high dimensional data. For covariance estimation, we briefly introduce the thresholding approach [13] and its extension called POET (Principal Orthogonal complEment Threshold- ing) [14, 15] which provides a unified view of most previous methods. For inverse covariance estimation, we mainly fo- cus on introducing two inverse covariance estimation methods named CLIME [16] and TIGER [17], which stand respectively for “Constrained L 1 -Minimization for Inverse Matrix Estima- tion” and “Tuning-Insensitive Graph Estimation and Regres- sion”. Both methods estimate the inverse covariance matrix in a column-by-column fashion and achieve the minimax op- timal rates of convergence under dierent norms. For appli- cations, we discuss how the estimated covariance matrix and inverse covariance matrix can be exploited on gene network es- timation and large-scale simultaneous tests. We introduce the principal factor approximation (PFA) in [18] for estimating and Preprint submitted to Advanced Drug Delivery Reviews March 17, 2013

Upload: others

Post on 31-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical Analysis of Big Data on Pharmacogenomicsjqfan/papers/13/pharm-bigdata.pdf · sible solutions of Big data analysis. Topics include modeling nonGaussianity, handling missing

Statistical Analysis of Big Data on PharmacogenomicsJianqing Fan and Han Liu

Department of Operations Research and Financial EngineeringPrinceton University, Princeton, NJ 08544, USA

Abstract

This paper discusses statistical methods for estimating complex correlation structure from large pharmacogenomic datasets. Weselectively overview several prominent statistical methods for estimating large covariance matrix for understanding correlationstructure, inverse covariance matrix for network modeling, large-scale simultaneous tests for selecting significantly differentlyexpressed genes and proteins and genetic markers for complex diseases, and high dimensional variable selection for identify im-portant molecules for understanding molecule mechanisms in pharmacogenomics. Their applications to gene network estimationand biomarker selection are used to illustrate the methodological power. Several new challenges of Big data analysis, includingcomplex data distribution, missing data, measurement error, spurious correlation, endogeneity, and the need for robust statisticalmethods, are also discussed.

Keywords: Big data, High dimensional statistics, Approximate factor model, Graphical model, Multiple testing, Variableselection, Marginal screening, Robust statistics.

1. Introduction

The ultimate goal of pharmacogenomics is to improve healthcare based on individual genomic profiles [1, 2, 3, 4, 5, 6, 7].Together with other factors that may affect drug response —such as diet, age, diseases, lifestyle, environment, and stateof health — pharmacogenomics has the potential to facilitatethe creation of individualized treatment plan for patients andleads to the overarching goal of personalized medicine. Sim-ilar to other areas of human genomics, pharmacogenomics isexperiencing an explosion of data, specifically by overloadsof omics information (genomes, transcriptomes and other datafrom cells, tissues and drug effects)[8]. The pharamacoge-nomics research is entering the era of “Big data” — a termthat refers to the explosion of available information. The vastamount of data brings both opportunities and new challenges.Statistical analysis for Big data is becoming increasingly im-portant [9, 10, 11, 12].

This paper selectively overviews several state of the art sta-tistical methods for analyzing large pharmacogenomics data. Inparticular, we emphasize on the fundamental problems of esti-mating two types of correlation structures: marginal correlationand conditional correlation. The former represents the corre-lation between two variables by ignoring the remaining vari-ables, while the latter represents the correlation between twovariables by conditioning on the remaining ones. The marginalcorrelation can be estimated using the covariance matrix. Theconditional correlation can be estimated using the inverse co-variance matrix. We introduce several recently developed sta-tistical methods for estimating high dimensional covarianceand inverse covariance matrices. We also introduce cutting-edge methods to estimate false discovery proportions for large-

scale simultaneous tests, and to select important molecules orSNPs in high dimensional regression models. The former cor-responds to finding molecules or SNPs that have significantmarginal correlations with biological outcomes, while the lat-ter finds conditional correlations with biomedical responses inpresence of many other molecules or SNPs.

The rationale behind this paper is that we believe Big dataprovides new opportunities for estimating complex correlationstructures among a large number of variables. These methodsare new to the pharmacogenomics community and have the po-tential to play important roles in analyzing the next-generationBig data within the pharmacogenomics area.

A notable feature of most methods introduced in this pa-per is the exploitation of sparsity assumption, which is anessential concept for modern statistical methods applied tohigh dimensional data. For covariance estimation, we brieflyintroduce the thresholding approach [13] and its extensioncalled POET (Principal Orthogonal complEment Threshold-ing) [14, 15] which provides a unified view of most previousmethods. For inverse covariance estimation, we mainly fo-cus on introducing two inverse covariance estimation methodsnamed CLIME [16] and TIGER [17], which stand respectivelyfor “Constrained L1-Minimization for Inverse Matrix Estima-tion” and “Tuning-Insensitive Graph Estimation and Regres-sion”. Both methods estimate the inverse covariance matrixin a column-by-column fashion and achieve the minimax op-timal rates of convergence under different norms. For appli-cations, we discuss how the estimated covariance matrix andinverse covariance matrix can be exploited on gene network es-timation and large-scale simultaneous tests. We introduce theprincipal factor approximation (PFA) in [18] for estimating and

Preprint submitted to Advanced Drug Delivery Reviews March 17, 2013

Page 2: Statistical Analysis of Big Data on Pharmacogenomicsjqfan/papers/13/pharm-bigdata.pdf · sible solutions of Big data analysis. Topics include modeling nonGaussianity, handling missing

controlling false discovery proportions in large-scale simulta-neous tests that are dependent. In addition, we introduce penal-ized likelihood [19], variable screening [20], and their iteratedversion of screening and selection for high dimensional variableselection. All these problems are important and widely applica-ble in different subareas of pharmacogenomics. Besides survey-ing existing methods, we also discuss new challenges and pos-sible solutions of Big data analysis. Topics include modelingnonGaussianity, handling missing data, dealing with measure-ment error, spurious correlation, endogeneity, and developingrobust methods.

The rest of this paper is organized as follows. In §2, we sum-marize some notations. In §3, we introduce statistical methodsfor estimating large covariance matrices. In §4, we introducestatistical methods for estimating large inverse covariance ma-trices. In §5, we apply the results of §3 and §4 to large-scalemultiple tests. In §6, we introduce high dimensional variableselection. In §7, we provide case studies of several describedmethods. In §8, we discuss more on the future directions.

2. Notation

In this section, we summarize some notations. Let A =

(a jk) ∈ Rd×d, B = (b jk) ∈ Rd×d and v = (v1, . . . , vd)T ∈ Rd.Denote by λmin(A) and λmax(A) the smallest and largest eigen-values of A. The inner product of A and B is defined as⟨A,B

⟩= tr(AT B). Define the vector norms: ‖v‖1 =

∑j |v j|,

‖v‖22 =∑

j v2j , ‖v‖∞ = max j |vi|. We also define matrix operator

and elementwise norms: ‖A‖22 = λmax(AT A), ‖A‖2F =∑

j,k a2jk.

Notation A � 0 means that A is positive definite and an � bn

implies there are positive constants c1 and c2 independent of nsuch that c1bn ≤ an ≤ c2bn.

3. Estimating Large Covariance Matrix

Estimating a large covariance or correlation matrix under asmall sample size is a fundamental problem which has manyapplications in phamacogenomics.

For example, in functional genomics, an important problemis to cluster genes into different groups based on the similaritiesof their microarray expression profiles. One popular measureof the similarity between a pair of genes is the correlation oftheir expression profiles. Thus, if d genes are being analyzed(with d ranges from the order of ∼1,000 to ∼10,000), a corre-lation matrix of size d × d needs to be estimated. Note that1, 000 × 1, 000 covariance matrices involve already over halfa million elements. Yet, the sample size n is of order ∼100,which is significantly smaller than the dimensionality d. Thus,the sample covariance matrix degenerates and needs regulariza-tion.

As another example, many multivariate statistical methodsthat are useful in pharmacogenomics, including principal com-ponent analysis (PCA), linear discriminant analysis (LDA),multivariate regression analysis, the Hotelling T 2-statistic, andmultivariate likelihood ratio tests as in finding quantitative traitloci based on longitudinal data [21] request the estimation of the

covariance matrix as their first step. Thus a reliable estimate ofthe covariance matrix is of paramount importance.

Moreover, the correlation matrix itself can be used to con-struct co-expression gene network, which is an undirectedgraph whose edges indicate marginal correlations among thed genes. The network is built by drawing edges between thosepairs of genes whose magnitude of pairwise correlation coef-ficients exceed a certain threshold. More applications of largecovariance matrix estimation will be discussed in §7.

A common key problem underlying all these examples is asfollows: Let x1, . . . , xn ∈ Rd be n independent observations of ad-dimensional random vector X = (X1, . . . , Xd)T . Without lossof generality, we assume EX = 0. We want to find a reliableestimate of the population covariance matrix Σ∗ = EXXT . Atthe first sight, this problem does not seem to be challenging.In the literature, Σ∗ was traditionally estimated by the samplecovariance matrix

S =1n

n∑i=1

(xi − x)(xTi − x)T with x =

1n

n∑i=1

xi, (3.1)

which has many good theoretical properties when the dimen-sionality d is small. However, in the more realistic settingswhere the dimensionality d is comparable or even larger thann (i.e., d/n goes to a nonzero constant or infinity), the sam-ple covariance matrix in (3.1) is no longer a good estimate ofthe population covariance matrix Σ∗. More details will be ex-plained as follows.

3.1. Inconsistency of Sample Covariance in High Dimensions

We use a simple simulation to illustrate the inconsistency ofthe sample covariance matrix in high dimensions. Specifically,we sample n data points from a d-dimensional spherical Gaus-sian distribution X ∼ N(0,Σ∗) with Σ∗ = Id. Here Id is a d-dimensional identity matrix. We consider different setups in-cluding d/n = 2, d/n = 1, d/n = 0.2, and d/n = 0.1. Theresults are summarized in Figure 1.

Figure 1 shows the sorted eigenvalues of the sample covari-ance matrix S (black curve) with the true eigenvalues (dashedred curve) for d = 1, 000. By examining these plots, we see thatwhen the dimensionality d is large, the eigenvalues of S signif-icantly deviate from their true values. In fact, even when n isreasonably large compared with d (d/n = 0.1), the result is stillnot accurate.

This phenomenon can be characterized by random matrixtheory. Let d/n → γ with γ ∈ (0, 1) and λmax(S) be the largesteigenvalue of S. It has been shown that λmax(S) converges to(1 +

√γ)2 almost surely [22, 23], i.e.,

P(

limn→∞

λmax(S) = (1 +√γ)2

)= 1.

This result illustrates that, for large d, the largest eigenvalue ofthe sample covariance matrix S does not converge to that of thepopulation covariance matrix Σ∗.

2

Page 3: Statistical Analysis of Big Data on Pharmacogenomicsjqfan/papers/13/pharm-bigdata.pdf · sible solutions of Big data analysis. Topics include modeling nonGaussianity, handling missing

●●

●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 200 400 600 800 1000

01

23

45

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 200 400 600 800 1000

01

23

45

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 200 400 600 800 1000

01

23

45

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 200 400 600 800 1000

01

23

45

Eige

nval

ues

Eige

nval

ues

Sample covariance

Sample covariance

Sample covariance

Sample covariance

d/n=2 d/n=1

d/n=0.2 d/n=0.1

Figure 1: Sorted eigenvalues of the sample covariance matrix S (black curve)and that of the population covariance matrix Σ∗ (dashed red curve). In thesimulation, we always use d = 1, 000 but different ratios of d/n. Subfigurescorrespond to d/n = 2, d/n = 1, d/n = 0.2 and d/n = 0.1.

3.2. Sparse Covariance Matrix Estimation MethodsTo handle the inconsistency issue of high dimensional sam-

ple covariance matrix, most existing methods assume the pop-ulation covariance matrix is sparse, i.e., many off-diagonal en-tries of Σ∗ are zero. Such a sparsity assumption is reasonable inmany pharmacogenomics applications. For example, in a lon-gitudinal study where variables have a natural order, it is naturalto assume that variables are weakly correlated when they are farapart [24]. Under the sparsity assumption, many regularizationbased statistical methods have been proposed to estimate largecovariance matrix [25, 26, 27, 28, 29, 30, 31, 32, 33, 34]. Thesemethods usually only require elementwise thresholding proce-dures and are computationally efficient. The simplest thresh-olding estimator is the hard-thresholding estimator [35, 13]

Σ =(si jI(|si j| ≥ τi j)

), (3.2)

where I(·) is the indicator function and si j is the sample covari-ance between Xi and X j. Here τi j is a thresholding parameter.Another example is the adaptive thresholding [29] which takesτi j = SD(si j)τ. Here

SD(si j) =

√∑ni=1(xkixk j − si j)2

n

is the estimated standard error of si j and τ is a user-specifiedparameter (e.g.,

√(2 log d)/n). [15] suggests a simplified ver-

sion τi j =√siis j jτ so that the correlation is thresholded at level

τ (e.g., τ = 0.2). Estimator (3.2) is not necessarily positive defi-nite. However, when τ is sufficiently large, it is positive definitewith high probability.

One additional example is the soft-thresholding estimator[15]:

Σ =(sgn(si j)(|si j| − τi j)+

), (3.3)

where (a)+ = max{0, a} and sgn(si j) = I(si j > 0) − I(si j < 0).It makes the matrix (3.3) positive definite for a wider range ofτi j than the hard-thresholding estimator (3.2) [15]. Althoughestimators (3.2) and (3.3) suffice for many applications, moregeneral sparse covariance estimation can be found via the fol-lowing penalized least-squares [36]:

Σ = argminΣ=(σ jk)

12‖S − Σ‖2F +

∑j,k

Pλ,γ(σ jk)

, (3.4)

where the term ‖S − Σ‖2F measures the goodness of fit and∑j,k Pλ,γ(σ jk) is a sparsity-inducing penalty that encourages

sparsity [19]. The tuning parameter λ controls the bias-variancetradeoff and γ is a possible fine-tune parameter which controlsthe concavity of the penalty. Popular choices of the penaltyfunction Pλ,γ(·) include the hard-thresholding, soft-thresholding[37, 38], smoothly clipped absolution deviation (SCAD, [19]),and minimax concavity penalties (MCP, [39]). In the followingwe explain these penalties in more detail.

Penalized least-squares or more generally penalized likeli-hood is a generally applicable method for variable selection[19, 40]. We introduce it in the context of sparse covariance ma-trix estimation. First, it is easy to see that the optimization prob-lem in (3.4) decomposes into many univariate subproblems:

σ jk = argminu

{12|s jk − u|2 + Pλ,γ(u)

}, (3.5)

whose solution is carefully studied in [41]. Each can be ana-lytically solved for the following commonly used penalty func-tions:

(1) Complexity penalty: Pλ,γ(u) = λI(u , 0).(2) Hard-thresholding penalty: pλ(u) = λ2 − (λ − |u|)2

+.(3) L1-penalty: Pλ,γ(u) = λ|u|.(4) Smoothly clipped absolution deviation penalty (SCAD):

with γ > 2,

Pλ,γ(u) =

λ|u| if |u| ≤ λu2−2γλ|u|+λ2

2(1−γ) if λ < |u| ≤ γλ(γ+1)λ2

2 if |u| > γλ.

(5) minimax concavity penalty (MCP): With γ > 1,

Pλ,γ(u) =

|u| − u2

2λγ if |u| ≤ λγλ2γ2 if |u| > λγ

.

Note that for problem (3.5) both the complexity penalty andthe hard-thresholding penalty produce the same solution, whichis the hard-thresholding rule. Figure 2 visualizes these penaltyfunctions for λ = 1. All penalty functions are folded concave,but L1-penalty is also convex. The parameter γ in SCAD andMCP controls the degree of concavity. From Figure 2(e) andFigure 2(g), we see that a smaller value of γ results in moreconcave penalties. When γ becomes larger, the SCAD and

3

Page 4: Statistical Analysis of Big Data on Pharmacogenomicsjqfan/papers/13/pharm-bigdata.pdf · sible solutions of Big data analysis. Topics include modeling nonGaussianity, handling missing

−3 −2 −1 0 1 2 3

−3−2

−10

12

3

z

u ● ●

P(u)

−3 −2 −1 0 1 2 3

0.0

0.5

1.0

1.5

2.0

2.5

3.0

u

P(u,

λ)

−3 −2 −1 0 1 2 3

−3−2

−10

12

3

z

u

−4 −2 0 2 4

01

23

45

u

P(u,

λ, γ

)

γ = 2.01γ = 2.3γ = 2.7γ = 200

−4 −2 0 2 4

−4−2

02

4

u

u

γ = 2.01γ = 2.3γ = 2.7γ = 200

−3 −2 −1 0 1 2 3

0.0

0.5

1.0

1.5

2.0

2.5

3.0

z

P(u,

λ, γ

)

γ = 1.01γ = 1.7γ = 3γ = 100

−3 −2 −1 0 1 2 3

−3−2

−10

12

3

z

u

γ = 1.01γ = 1.7γ = 3γ = 100

(a) Hard-thresholding Penalty

(b) Hard-thresholding Operator

(c) Soft-thresholding Penalty

(d) Soft-thresholding Operator

(e) SCAD Penalty

(f) SCAD Operator

(g) MCP Penalty

(h) MCP Operator

u u u u

Before thresholding

Afte

r thr

esho

ldin

g

Before thresholding Before thresholding Before thresholding

−3 −2 −1 0 1 2 3

0.0

0.5

1.0

1.5

2.0

2.5

3.0

u

P(u,

λ)

Figure 2: Visualization of the penalty functions (above) and the corresponding thresholding operators (below) of Hard-thresholding, Soft-thresholding, SCAD andMCP penalties. In all cases, λ = 1. For SCAD and MCP, different values of γ are chosen as shown in graphs (Best viewed in color).

MCP converge to the L1-penalty. MCP is a generalization ofthe hard-thresholding penalty which corresponds to γ = 1.

With the aforementioned thresholding penalties, the closed-form solution to the penalized least-squares (3.5) can be found.The complexity and hard-thresholding penalties give hard-thresholding rule (3.2), and the L1-penalty yields the soft-thresholding rule (3.3). The SCAD thresholding operator isgiven by

σ jk =

0 if |s jk | ≤ λsgn(s jk)(|s jk | − λ) if λ < |s jk | ≤ 2λ(γ − 1)s jk − sgn(s jk)γλ

γ − 2if 2λ < |s jk | ≤ γλ

s jk if |s jk | > γλ

,

and the MCP thresholding rule is

σ jk =

0 if |s jk | ≤ λ

sgn(s jk)(|s jk |−λ

1−1/γ

)if λ < |s jk | ≤ γλ

s jk if |s jk | > γλ

.

When γ → ∞, the last two operators become the soft-thresholding operator.

How shall we choose among these thresholding operators inapplications? We provide rough insights and suggestions onthis. From Figure 2(b), we see that the hard-thresholding op-

erator does not introduce extra bias for the large nonzero en-tries. However, it is highly discontinuous. Unlike the hard-thresholding operator, the soft-thresholding operator in Fig-ure 2(d) is continuous. However, it introduces biases for thenonzero entries. From Figure 2(f) and Figure 2(h), we see thatthe SCAD and MCP thresholding operators are both continu-ous and do not introduce extra bias for nonzero entries withlarge absolute values. In applications, we always recommendto use either SCAD or MCP thresholding since they combinethe advantages of both hard- and soft-thresholding operators.

3.3. Positive Definite Sparse Covariance Matrix Estimation

The above thresholding methods, though simple, do not guar-antee the positive definiteness of the estimated covariance ma-trix. This may cause trouble in downstream analysis such asevaluating the predictive likelihood or linear discriminant anal-ysis. To handle this problem, we introduce a covariance es-timation method named EC2 (Estimation of Covariance withEigenvalue Constraints) which explicitly enforces the positivedefiniteness constraint [42].

EC2 solves the following constrained optimization

Σ = argminλmin(Σ)≥τ

12‖S − Σ‖2F +

∑j,k

Pλ,γ(σ jk)

, (3.6)

4

Page 5: Statistical Analysis of Big Data on Pharmacogenomicsjqfan/papers/13/pharm-bigdata.pdf · sible solutions of Big data analysis. Topics include modeling nonGaussianity, handling missing

where τ > 0 is a pre-specified tuning parameter that controls thesmallest eigenvalue of the estimated covariance matrix Σ. Com-pared with (3.4), the optimization problem in (3.6) is not de-composable due to the constraint λmin(Σ) ≥ τ. [42] proves thatthe optimization problem in (3.6) is convex when the penaltyfunction is convex and develops an efficient ISP (Iterative Soft-thresholding and Projection) algorithm to solve it. More detailsabout ISP can be found in [42]. A similar algorithm is alsoproposed in [43].

Other sparse covariance matrix estimation methods with pos-itive definiteness constraints include [28] and [44]. Both use alog-determinant function to enforce the positive definiteness ofthe estimated covariance matrix. Their main difference is that[44] adopts a convex least square formulation, while [28] adoptsthe penalized Gaussian log-likelihood approach.

3.4. Theory of Sparse Covariance Matrix Estimation

Under the sparsity assumption, the above sparse covarianceestimators have good theoretical properties. In particular, forthe positive definite sparse covariance matrix estimator Σ de-fined in (3.6), it has been proven by various authors (For moredetails, see [42]) that

supΣ∗∈Mk

E∥∥∥Σ − Σ∗∥∥∥2 � k

√log d

n,

whereMk represents the model class

Mk =

Σ : Σ � 0 and max1≤ j≤d

d∑`=1

I(Σ j` , 0) ≤ k

. (3.7)

This rate of convergence is minimax optimal. Similar resultsalso hold for different types of thresholding estimators [36] de-fined in (3.4). Using the triangle inequality, we get∣∣∣∣Eλmax(Σ) − λmax(Σ∗)

∣∣∣∣ ≤ E∥∥∥Σ − Σ∗∥∥∥2 � k

√log d

n.

This result implies that limn→∞ Eλmax(Σ) = λmax(Σ∗). Thus theinconsistency issue of the sample covariance matrix discussedin §3.1 is avoided.

3.5. POET: New Insight on Large Covariance Matrix Estima-tion

All the aforementioned methods assume that Σ∗ is sparse.Though this assumption is reasonable for many pharmacoge-nomics applications, it is not always appropriate. For example,all the genes from the same pathway may be co-regulated bya small amount of regulatory factors, which makes the geneexpression data highly correlated; when genes are stimulatedby cytokines, their expressions are also highly correlated. Thesparsity assumption is obviously unrealistic in these situations.To solve this problem, we introduce the POET method in recentpaper with discussion [15], which provides an integrated frame-work for combining latent factor analysis and sparse covariancematrix estimation.

3.5.1. The POET MethodThe POET estimator is formed by directly running the sin-

gular value decomposition on the sample covariance matrix S.It keeps the covariance matrix formed by the first K principalcomponents and applies the thresholding procedure to the resid-ual covariance matrix. The final covariance matrix estimate isthen obtained by combing these two components.

Let λ1 ≥ λ2 ≥ . . . ≥ λd be the ordered eigenvalues of S andξ1, . . . , ξd be their corresponding eigenvectors. Then S has thefollowing spectral decomposition:

S =

K∑m=1

λmξmξTm + RK (3.8)

where the first K principal components {ξm}Km=1 estimate the la-

tent factors that drive the common dependence. We now applysparse thresholding on RK by solving

RλK = argmin

Σ

12‖RK − Σ‖

2F +

∑j,k

Pλ,γ(σ jk)

. (3.9)

For example, one can apply the soft-thresholding (3.3) to RK .The POET estimator of Σ∗ is defined as:

ΣK =

K∑m=1

λmξmξTm + Rλ

K . (3.10)

It is a nonparametric estimator and can be positive definite whenλ is large [15]. When K > 0 and we use the thresholdingmethod in §3.2 with τi j =

√rK,iirK, j j, namely, τ = 1, where

rK,ii is a diagonal element of RK . In this case, RλK = diag(RK),

and our nonparametric estimator ΣK is positive definite evenwhen d � n. It is called the estimator based on the strict factormodel [26]. The POET method is available in the R-packagenamed POET. When K = 0, the POET estimator reduces to thethresholding estimator in §3.2.

3.5.2. Approximate Factor ModelPOET exploits an approximate factor structure [14, 15] and

works the best under such a model. The approximate factormodel assumes

X = BU + ε (3.11)

where B is a d × K loading matrix, U is a K × 1 latent fac-tor vector, and ε is a d-dimensional random error term that isuncorrelated with U. The model implied covariance structure is

Σ∗ = BΣUBT + Σε , (3.12)

where ΣU = Var(U) and Σε = Var(ε). We assume Σε is sparse.This can be interpreted as the conditional sparse covariancemodel: Given the latent factor U, the conditional (after takingthe linear projection out) covariance matrix of X is sparse.

Model (3.11) is not identifiable. The linear space spannedby the first K principal components of BΣUBT is the same asthose spanned by the columns of B. Without loss of generality,we impose the identifiability condition that the columns of Bare orthogonal and ΣU = IK .

5

Page 6: Statistical Analysis of Big Data on Pharmacogenomicsjqfan/papers/13/pharm-bigdata.pdf · sible solutions of Big data analysis. Topics include modeling nonGaussianity, handling missing

3.5.3. Modeling AssumptionPOET assumes that the factors U are pervasive and Σε is

sparse. An example of pervasiveness is that the factor load-ings {bi}

di=1 (the transpose of the ith row of B) for {Xi}

di=1 are an

independent realization from a certain population quantity b. Inthis case,

1d

BBT =1d

d∑i=1

bibTi

converges to a non-degenerate matrix EbbT (as d → ∞). There-fore, the matrix BBT (recalling ΣU = IK) in (3.12) has spikedeigenvalues. The sparseness is made so that d−1‖Σε‖2 → 0. Therationale of this assumption is that, after taking out the commonfactors, many residual pairs become weakly correlated. There-fore, the first term in (3.12) dominates and the principal com-ponent analysis is approximately the same as the factor analysiswhen d is large.

The spiked eigenvalues assumption still needs to be verifiedon more applications. Since POET works very well in the pres-ence of spiked principal eigenvalues where most of the afore-mentioned sparse covariance matrix estimation methods mayfail, it has many potential applications in pharmacogenomics.

4. Large Inverse Covariance Matrix Estimation

Estimating a large inverse covariance matrix Θ∗ = (Σ∗)−1

is another fundamental problem in Big data analysis. Unlikethe covariance matrix Σ∗ which only captures the marginal cor-relations among X1, . . . , Xd, the inverse covariance matrix Θ∗

captures the conditional correlations among these variables andis closely related to undirected graphs under a Gaussian model.

More specifically, we define an undirected graph G = (V, E),where V contains nodes corresponding to the d variables in Xand the edge ( j, k) ∈ E if and only if Θ∗jk , 0. Let X−{ j,k} =

{X` : ` , j, k}. Under a Gaussian model X ∼ N(0,Σ∗), X j isindependent of Xk given X−{ j,k} for all ( j, k) < E. Therefore, thegraph estimation problem is reduced to estimating the inversecovariance matrix Θ∗ [45].

X1

X2

X3 X4

X5

(a) Undirected Graph (b) Inverse Covariance Matrix

Figure 3: A sparse inverse covariance matrix Θ∗ and the corresponding undi-rected graph defined by Θ∗

Figure 3 illustrates the difference between the marginal andconditional uncorrelatedness. From Figure 3(b), we see the in-verse covariance matrix Θ∗ has many zero entries. Thus theundirected graph defined by Θ∗ is sparse. The covariance ma-

trix Σ∗ = (Θ∗)−1 is

Σ∗ =

1.05 −0.23 0.05 −0.02 0.05−0.23 1.45 −0.25 0.10 −0.25

0.05 −0.25 1.10 −0.24 0.10−0.02 0.10 −0.24 1.10 −0.24

0.05 −0.25 0.10 −0.24 1.10

.It is a dense matrix, which implies that every pair of variablesare marginally correlated. Thus the covariance matrix and in-verse covariance matrix encode different relationships. For ex-ample, in Figure 3(a), even though X1 and X5 are conditionallyuncorrelated given the other variables, they are marginally cor-related since both of them are correlated with X2.

In the following subsections, we first introduce the Gaussiangraphical model which provides a general framework for in-verse covariance matrix estimation. We then introduce severalestimation methods which are computationally simple and suit-able for Big data analysis. In particular, we highlight two meth-ods named CLIME [16] and TIGER [17], which have poten-tial to be widely applied for different pharmacogenomics appli-cations. We also briefly explain how to combine the ideas ofPOET with TIGER.

We first introduce some additional notations. Let A be a sym-metric matrix and I and J be index sets. We denote AI,J to bethe submatrix of A with rows and columns indexed by I and J.Let A∗ j be the jth column of A and A∗− j be the submatrix of Awith the jth column removed. We define the matrix norm:

‖A‖q = max‖v‖q=1

‖Av‖q.

It is easy to see that when q = ∞, ‖A‖∞ = ‖A‖1.

4.1. Gaussian Graphical Model

Let X ∼ N(0,Σ∗), α j = (Σ∗− j,− j)

−1Σ∗− j, j and σ2

j = Σ∗j j −

Σ∗− j, j(Σ

∗− j,− j)

−1Σ∗− j, j. Then, we have

X j = αTj X− j + ε j (4.1)

where ε j ∼ N(0 , σ2

j)

is independent of X− j = {X` : ` , j}.Using the block matrix inversion formula, we have

Θ∗j j = σ−2j , (4.2)

Θ∗− j, j = −σ−2j α j. (4.3)

Therefore, we can recover Θ∗ by estimating the regression co-efficient α j and the residual variance σ2

j . Indeed, [46] estimatesα j by solving the Lasso problem

α j = argminα j∈Rd−1

12n

∥∥∥X∗ j − X∗− jα j

∥∥∥22 + λ j

∥∥∥α j

∥∥∥1, (4.4)

where λ j is a tuning parameter. Once α j is given, we get theneighborhood edges by reading out its nonzero coefficients.The final graph estimate G is obtained by combining the neigh-borhoods for all the d nodes. However, this method only esti-mates the graph G but not the inverse covariance matrix Θ∗.

6

Page 7: Statistical Analysis of Big Data on Pharmacogenomicsjqfan/papers/13/pharm-bigdata.pdf · sible solutions of Big data analysis. Topics include modeling nonGaussianity, handling missing

4.2. Penalized Likelihood Estimation

When the penalized likelihood approach is applied to esti-mate Θ∗, it becomes

Θ = argminΘ=(θi j)

{log |Θ| + tr(SΘ)︸ ︷︷ ︸

negative Gaussian log-likelihood

+∑i, j

Pλ,γ(θi j)}, (4.5)

where the first part is the negative Gaussian log-likelihood andPλ,γ(·) is defined the same way as in §3.2. It is a penalty termthat encourages sparse solutions. The optimization in (4.5)can be computed by the graphical lasso [47] and the first-ordermethod [48] and have also been studied in [27, 28].

4.3. The CLIME Estimator

To estimate both the inverse covariance matrix Θ∗ and graphG, [16] proposes the CLIME estimator, which directly esti-mates the jth column of Θ∗ by solving

Θ∗ j = argminΘ∗ j

∥∥∥Θ∗ j

∥∥∥1 subject to

∥∥∥SΘ∗ j − e j

∥∥∥∞≤ δ j, (4.6)

where e j is the jth canonical vector (i.e., the vector with the jth

element being 1, while the remaining elements being 0) and δ j

is a tuning parameter. The method borrows heavily the ideafrom the Dantzig selector [49]. [16] shows that this convexoptimization can be formulated into a linear program and hasthe potential to scale to large problems. Once Θ is obtained, weuse another tuning parameter τ to threshold it to estimate thegraph G.

4.4. The TIGER Estimator

There are several ways to implement model (4.1). Examplesinclude the graphical Dantzig selector [50], the scaled-Lassoestimator [51], and the SICO estimator [52]. Each of these in-volves d tuning parameters such as {λ j}

dj=1 in (4.4) and {δ j}

dj=1

in (4.6). Yet, model (4.1) is heteroscedastic, with noise variancedepending on unknown σ2

j . Therefore, these tuning parametersshould depend on j, which is difficult to implement in practice.

TIGER method [17] overcomes these scale problems. LetD = diag(S) be a d-dimensional diagonal matrix with the diag-onal elements being the same as those in S. Let

Z = (Z1, . . . ,Zd)T = XD −1/2,

β j = s −1/2j j D 1/2

− j,− jα j and τ2j = σ2

j s−1

j j ,

where s j j is the jth diagonal element of S. Therefore, model(4.1) can be standardized as

Z j = βTj Z− j + s −1/2

j j ε j. (4.7)

We define R = (D) −1/2S(D) −1/2 to be the sample correlationmatrix and let Z ∈ Rn×d be the normalized data matrix, i.e.,Z∗ j = s −1/2

j j X∗ j for j = 1, . . . , d. Motivated by the model in

(4.7), we propose the following inverse covariance matrix esti-mator. More specifically,

β j = argminβ j∈Rd−1

{ 1√

n

∥∥∥Z∗ j − Z∗− jβ j

∥∥∥2 + λ

∥∥∥β j

∥∥∥1

}, (4.8)

τ j =1√

n

∥∥∥Z∗ j − Z∗− jβ j

∥∥∥2,

Θ j j = τ −2j s −1

j j and Θ− j, j = −τ −2j s −1/2

j j D−1/2− j,− j β j.

Once we have Θ, the estimated graph G = (V, E) is ( j, k) ∈ E ifand only Θ jk , 0. The formulation in (4.8) is called a SQRT-Lasso problem. [53] has shown that the SQRT-Lasso is tuning-insensitive. This explains the tuning-insensitive property of theTIGER method. The TIGER method is available in the R pack-age flare.

An equivalent form to estimating the jth column of Θ∗ is tosolve

β j = argminβ j∈Rd−1

{√1 − 2βT

j R− j, j + βTj R− j,− jβ j + λ

∥∥∥β j

∥∥∥1

}, (4.9)

τ j =

√1 − 2βT

j R− j, j + βTj R− j,− jβ j,

In (4.9), λ is a tuning parameter. [17] shows that, by choosing

λ = π

√log d2n

, the obtained estimator achieves the minimax

optimal rates of convergence, thus this procedure is asymptoti-cally tuning-free. For finite samples, [17] suggests to set

λ = ζπ

√log d2n

, (4.10)

and ζ can be chosen from a grid in [√

2/π, 1]. Since the choiceof ζ does not depend on any unknown quantities, we call theTIGER procedure tuning-insensitive. Empirically, we can sim-ply set ζ =

√2/π and the resulting estimator works well in

many applications.If a symmetric inverse covariance matrix estimate is pre-

ferred, we make the correction: Θ jk ← min{Θ jk, Θk j

}for all

k , j. Another correction method is

Θ←Θ + ΘT

2. (4.11)

As has been shown by [16], Θ achieves the same rate of con-vergence as Θ.

4.5. Combining POET with TIGER: Conditional SparseGraphical Model

The TIGER estimator can be integrated into the POET frame-work. Recall that the main idea of the POET method is toexploit conditional sparsity. The common factors BU are ex-tracted out from (3.11) via a principal component analysis, re-sulting in the residual matrix RK in (3.8). The thresholdingrules are applied directly to RK due to the assumption of theconditional sparsity of the covariance matrix. If we assume(Σε)−1 is sparse, namely, the genomics network is sparse con-ditioned on unobservable factors, it is inappropriate to apply

7

Page 8: Statistical Analysis of Big Data on Pharmacogenomicsjqfan/papers/13/pharm-bigdata.pdf · sible solutions of Big data analysis. Topics include modeling nonGaussianity, handling missing

POET. In this case, it is more appropriate to apply the TIGERmethod on RK to obtain the final estimate in (3.10). The graphinduced by the TIGER method is a conditional graph after tak-ing out the common dependence on latent factors.

5. Large-Scale Simultaneous Tests and False DiscoveryControl

Selection of differently expressed genes and proteins as wellas finding SNPs that are associated with phenotypes or geneexpressions give rise to large-scale simultaneous tests in phar-macogenomics [54, 55, 56]. They examine the marginal ef-fects of the treatments and gain substantial popularity in thelast decades [57, 58, 59, 60]. These procedures are designedpredominately based on the assumption that test statistics areindependent or weakly dependent [61, 62, 63]. Yet, biologi-cal outcomes such as gene expressions are often correlated, sodo the statistical tests in the genomewide association studies(GWAS). It is very important to incorporate the dependence in-formation in controlling the false discovery proportion (FDP)and to improve the power of the tests [64, 65, 66, 67]. In therecent paper [18] with discussions, it is convincingly demon-strated that FDP varies substantially from data to data, but canstill be well estimated and controlled for each given data whenthe dependence information is properly used.

5.1. Rises of Large-Scale Hypothesis Tests

Suppose that gene or protein expression profiles {Xi}mi=1 and

{Yi}ni=1 are collected for m and n individuals respectively from

the treatment and control groups. The problem of selectingsignificantly expressed genes is the following large-scale two-sample testing:

H0 j : µX, j = µY, j vs H1 j : µX, j , µY, j, j = 1, · · · , d

based on the assumption that

Xi ∼ N(µX ,Σ∗) and Yi ∼ N(µY ,Σ

∗), (5.1)

where µX, j and µY, j are the mean expressions of the jth gene inthe treatment and control groups, respectively. Letting X and Ybe the sample means, the two-sample Z-statistic is

Z1 =√

mn/(m + n)(X − Y) ∼ N(µ1,Σ∗)

with µ1 =√

mn/(m + n)(µX−µY ). Let D be the diagonal matrixwith the pooled estimates of the marginal standard deviationson its diagonal. The vector of the two-sample t-test statisticshas the approximate distribution as follows:

Z = D−1Z1a∼ N(µ,R) (5.2)

where µ = D−1µ1, R is the correlation matrix of Σ∗, and a∼

means “distributed approximately”. Our testing problem thenreduces to

H0 j : µ j = 0 vs H1 j : µ j , 0, j = 1, · · · , d (5.3)

based on the vector of test statistics Z. In pharmacoge-nomics applications, the correlation matrix R represents the co-expressions of molecules and is in general unknown.

Another concrete example is that in GWAS, we wish to as-sociate the SNPs with phenotypes or gene expressions. For in-dividual i, let Xi j be the genotype of the jth SNP and Yi be itsassociated outcome. The association between the jth SNP andthe response is measured through the marginal regression [18]:

(α j, β j) = argmina j,b j

n−1n∑

i=1

E(Yi − a j − b jXi j)2.

Our simultaneous test in GWAS becomes testing

H0 j : β j = 0 vs H1 j : β j , 0, j = 1, · · · , d. (5.4)

Let β be the vector of the least-squares estimates of the regres-sion coefficients β = (β1, · · · , βd)T , which are obtained by themarginal regressions based on the data {(xi j, yi)}ni=1. Denotingby Z = (Z1, . . . ,Zd)T the t-test statistics for the problem (5.4),it is easy to show that Z a

∼ N(β,R), where R is the correlationmatrix of {Xi}

ni=1. In this case, the correlation matrix is known.

The above problem can be summarized as follows. Let Y ∼N(µ,Σ) with unit standard deviations (σ2

11 = . . . σ2dd = 1) be a

vector of dependent test statistics. We wish to simultaneouslytest

H0 j : µ j = 0 vs H1 j : µ j , 0, j = 1, · · · , d. (5.5)

In pharmacogenomics applications, the number of hypotheses dis large and the number of interested genes or proteins or SNPsare small so that most of {µ j}

dj=1 are zero.

5.2. False Discovery Rate and ProportionLet S0 = { j : µ j = 0} be the set of true nulls and Φ(·) be

the cumulative distribution function of the standard Gaussianrandom variable. We denote P j = 1 − 2Φ(|Z j|) to be the P-value for the jth test of problem (5.5) based on the dependenttest statistics (5.2). Using a threshold t, we define

FDP(t) = V(t)/R(t) and FDR(t) = E{FDP(t)}

to be the false discovery proportion and false discovery rate,where V(t) = #{ j ∈ S0 : P j ≤ t} and R(t) = #{ j : P j ≤ t} arethe number of false rejections and total rejections, respectively.Our aim is to accurately estimate FDP(t) for a large class of Σ∗.Clearly, FDP is more relevant, since it is related to the data athand whereas FDR only controls the average FDP. To controlFDP at a prescribed level α, we choose the smallest t0 such thatestimated FDP(t0) ≤ α [56, 18].

Note that R(t) is the number of rejected hypotheses or thenumber of discoveries. It is an observable random variable.Yet, V(t) is unknown to us and needs to be estimated. Whentest statistics are independent, V(t) is the number of success ina sequence of independent Bernoulli trials, with the number oftrials d0 = |S0| (the number of true nulls) and probability of suc-cess t. Therefore, by the law of large numbers, FDP and FDRare close, both approximately d0t/R(t) ≈ dt/R(t) due to sparsity.

8

Page 9: Statistical Analysis of Big Data on Pharmacogenomicsjqfan/papers/13/pharm-bigdata.pdf · sible solutions of Big data analysis. Topics include modeling nonGaussianity, handling missing

For dependent test statistics, however, FDP varies substantiallyfrom one data to another, even when FDR is the same. We usea simple example in [18] to illustrate this point.

Let 1 = (1, . . . , 1)T ∈ Rd. For equally correlated matrix Σ∗ =

(1 − ρ)Id + ρ1d1Td , we consider the test statistic

Y j = µ j + ρU +

√1 − ρ2ε j,

where U ∼ N(0, 1) is independent of ε j ∼ N(0, 1). Without lossof generality, we assume that S0 is the first d0 genes. Let zt/2be the upper t/2-quantile of the standard normal distributionand |Y j| ≥ zt/2 be a critical region. Then, by the law of largenumbers,

V(t) =

d0∑j=1

I(|Y j| ≥ zt/2) ≈ d0P(t, η, ρ) (5.6)

where P(t, η, ρ) = P(|Y j| ≥ zt/2 |U, µ j = 0) is given by

P(t, η, ρ) = Φ

(−zt/2 − η√

1 − ρ2

)+ 1 − Φ

(zt/2 − η√

1 − ρ2

), (5.7)

with η = ρU. It is clear that the result depends critically on therealization of U. For example, when ρ = 0.8, d0 = 1, 000 andzt/2 = 2.5,

V(t) ≈ 1000 ×{

Φ

(−2.5 − 0.8U

0.6

)+ 1 − Φ

(2.5 − 0.8U

0.6

)}which are 0, 2.3, 66.8, 433.6 respectively for U = 0, 1, 2, and3. Clearly, it depends on the realization of U. Yet, the realizedfactor U is estimable. For example, using sparsity, Y ≈ ρU andthus U can be estimated by Y/ρ. On the other hand, EV(t) =

d0t = 12.4. Clearly, V(t) or FDP(t) is far more relevant thanEV(t) or FDR(t).

In summary, FDP is a quantity of the primary interest fortests that are dependent.

5.3. Principal Factor ApproximationWhen the correlation matrix Σ∗ is known as in GWAS, [18]

gives a principal factor approach to approximate FDP(t) for ar-bitrary Σ∗. It generalizes formula (5.6). Let {λ j}

dj=1 be ordered

eigenvalues of Σ∗ and {ξ j}dj=1 be their corresponding eigenvec-

tors. We can decompose Σ∗ as

Σ∗ =

d∑j=1

λ jξ jξTj = BBT + A

where B = (√λ1ξ1, · · · ,

√λKξK) consists of the first K unnor-

malized principal components and A = Σdj=K+1λ jξ jξ

Tj . Then, as

in (3.11), we can write

Y = µ + BU + ε, U ∼ N(0, IK), ε ∼ N(0,A). (5.8)

Since principal factors U that drive the dependence are takenout, ε can be assumed to be weakly correlated and FDP(t) canbe approximated by

FDPA(t) =∑d

j=1P(t, η j, ‖b j‖2)/R(t), η j = bTj U (5.9)

under some mild conditions [18], where bTj is the jth row of B

and the function P(·) is given in (5.7). Note that the covariancematrix is used to calculate η j and ‖b j‖2 and that formula (5.9)is a generalization of (5.6). Thus, we need only to estimate therealized but unobserved factors U in order to use FDPA(t). [18]provides a penalized estimator as described below.

Since µ is sparse, by (5.8), a natural estimator is the mini-mizer of

minµ,U‖Y − µ − BU‖22 + λ‖µ‖1 (5.10)

which is equivalent [68] to minimizing the Huber’s ψ-loss withrespect to U: minU ψ(Y − BU). An alternative is to find theminimizer of the L1-loss:

minU‖Y − BU‖1. (5.11)

The sampling property of the estimator based on (5.10) hasbeen established in [68].

5.4. Factor Adjustment: an Alternative Ranking of Molecules

With U estimated from (5.10), the common factors can besubtracted out from model (5.8), resulting in

Z = Y − BU ≈ µ + ε. (5.12)

The test statistics Z are more powerful than those based onY since ε has marginal variances no larger than 1. In fact,Zi ∼ N(µi, 1 − ‖bi‖

22), after ignoring the approximation error.

Hence, Zi has a larger signal-to-noise ratio than Yi ∼ N(µi, 1).In addition, ranking of important genes/proteins based on thestandardized test statistics {|Zi|/(1 − ‖bi‖

22)1/2}di=1 can be very

different from that based on {|Yi|}di=1. This method is called

dependent-adjusted procedure in [18]. It provides pharmaco-scientists a chance to discover important genes, proteins, andSNPs that are not highly ranked by the conventional methods,based on ranking of {|Yi|}

di=1.

After common factors being taken out, the elements of εare weakly correlated. There is no need to apply PFA to thetest statistics Z in (5.12). In other words, FDP should beapproximately the same as FDR. But application of PFA to{|Zi|/(1 − ‖bi‖

22)1/2}di=1 will not result in very different results

from the case with K = 0.

5.5. Estimating FDP with Unknown Dependence

In many genomic applications such as selecting significantgenes from microarray data, the covariance matrix Σ∗ is typi-cally unknown. A natural approach is to estimate it from thedata by POET as described in §3.5.1. Then proceed as if thecorrelation matrix is given. The resulting method is namedPOET-PFA. More details on the development of this methodcan be found in [69]. An R package, called PFA, implementsthis procedure.

9

Page 10: Statistical Analysis of Big Data on Pharmacogenomicsjqfan/papers/13/pharm-bigdata.pdf · sible solutions of Big data analysis. Topics include modeling nonGaussianity, handling missing

6. High Dimensional Variable Selection

With proper coding of the treatment and control groups, itis easy to see that the two-sample test statistics in (5.2) areequivalent to the marginal correlations between the expressionsof molecules and response [20]. To examine the conditionalcontribution of a molecule to the response Y given the rest ofmolecules, one often employs a sparse generalized linear modelunder canonical link: Up to a normalization constant, the con-ditional density of Y given X is

log p(y|x,β0) = (xTβ0)y − b(xTβ0). (6.1)

Here, the sparsity means that a majority of regression coeffi-cients in β0 = (β0,1, . . . , β0,d)T are zero. The regression coef-ficient β0, j is frequently regarded as a measure of conditionalcontribution of the molecule X j to the response Y given the restof molecules, similar to the interpretation of the sparse inversecovariance matrix in §4.

When the response Y is a continuous measurement such asin eQTL studies, one takes b(θ) = θ2/2. For binary response,one often employs a sparse logistic regression in which b(θ) =

− log(1 + exp(θ)). The latter is closely related to classificationproblems as will be described in §6.4.

Over the last decade, there is a surged interest in sparse re-gression. For an overview, see [40, 70, 71]. The basic principlesare screening and penalization. We give only a brief summarydue to the space limitation.

6.1. Penalized Likelihood Method

As has been shown in [19], sparse vector β0 in (6.1) can beestimated by penalized likelihood which minimizes

n−1n∑

i=1

b(xTi β) − (xT

i β)yi +

d∑j=1

Pλ,γ(|β j|) (6.2)

for a folded concave penalty function Pλ,γ(·). It has been sys-tematically studied by [72, 73].

There are many algorithms to compute the minimizer of(6.2). Examples include local quadratic approximation [19], lo-cal linear approximation [74], coordinate decent algorithm [75],iterative shrinkage-thresholding algorithm [76, 77]. For exam-ple, given estimate β(k) = (β(k)

1 , · · · , β(k)d )T at the kth iteration, by

Taylor’s expansion,

Pλ,γ(|β j|) ≈ Pλ,γ(|β(k)j |) + P′λ,γ(|β(k)

j |)(|β j| − |β(k)j |). (6.3)

Thus, at the (k + 1)th iteration, we minimize

n−1n∑

i=1

b(xTi β) − (xT

i β)yi +

d∑j=1

wk, j|β j|, (6.4)

where wk, j = P′λ,γ(|β(k)j |). Note that problem (6.4) is convex so

that a convex solver can be used. If one further approximatesthe likelihood part in (6.4) by a quadratic function via Taylorexpansion, then the LARS algorithm [78] can be used.

6.2. Screening Method

The sure independent screening [20, 79] is another effec-tive method to reduce the dimensionality in sparse regressionproblems. It utilizes the marginal contribution of a covari-ate to probe its importance in the joint regression model. Forexample, assuming each covariate has been standardized, themarginal contribution can be measured by the magnitude of βM

j ,which, along with αM

j , minimizes the negative marginal likeli-hood:

n−1n∑

i=1

b(α j + β jxi j) − (α j + β jxi j)yi. (6.5)

The set of covariates that survive the marginal screening is

S = { j : |βMj | ≥ δ}, (6.6)

for a given threshold δ. One can also measure the importanceof a covariate X j by using its deviance reduction. For the least-squares problem, both methods reduce to ranking importanceof predictors by using the magnitudes of their marginal correla-tions with the response Y . [20, 79] give conditions under whichsure screening property can be established and false selectionrate are controlled.

The idea of sure screening is very effective in dramaticallyreducing the computational burden of Big data analysis. Ithas been extended in various directions. For example, gener-alized correlation screening was used in [80] and nonparamet-ric screening was proposed by [81]. In addition, [82] utilizesthe distance correlation to conduct screening and [83] employsrank correlation.

6.3. Iterative Sure Independence Screening

An iterative sure independence screening method was devel-oped in [84]. The basic idea is to iteratively use the large-scalescreening (6.6) and the moderate-scale selection, which appliesthe penalized likelihood method (6.2) to the survived variables(6.6). The software, ISIS, is available in R-package, whichimplements this idea. It also allows us to compute penalizedlikelihood (6.2) or (6.6) alone.

6.4. Sparse Classification

Classification, also known as supervised learning, is to pre-dict the class label for a given covariate. In pharmacogeneticstudies, one often wishes to not only predict well the class la-bels, but also understand the molecule mechanisms for the ef-fectiveness of drugs. Therefore, it is desirable to select a smallgroup of molecules that have a high classification power.

Sparse logistic regression (6.2) provides an effective ap-proach to high dimensional classification. Other approachesinclude the support vector machine [85] and the AdaBoost al-gorithm [86, 87]. The former replaces the log-likelihood by thehinge loss L(θ, y) = (1 − θy)+ and the latter replaces the log-likelihood by the exponential loss L(θ, y) = exp(−θy). In bothcases, the response Y is recoded as ±1.

10

Page 11: Statistical Analysis of Big Data on Pharmacogenomicsjqfan/papers/13/pharm-bigdata.pdf · sible solutions of Big data analysis. Topics include modeling nonGaussianity, handling missing

(a) Histogram of expression of MECPS

(b) Normal QQ-Plot of MECPS (c) Estimated Gene Network

Expression Value

Freq

uenc

y

−4 −2 0 2 4

05

1015

2025

30

●●

●●

●●●

●●

●●

●●●

●●●●

●●●

● ●

●●

●●

−2 −1 0 1 2

−3−2

−10

12

Theoretical Quantile

Sam

ple Q

uant

ile

IPPI2FPPS1

HMGR2

MK

DPPS2

DXR

IPPI1

CMK

MCT MECPS

HDS

PP'DS1 PP'DS2

GPPS

AACT2

HMGR1

AACT1

GPPS2,8,10

UPPS1

DXPS1

DXPS3

DXPS2

HDR

DPPS3

DPPS1

MPDC2

MPDC1

HMGS

FPPS2

GPPS3,4

MVA PathwayMEP Pathway

GPPS6

GPPS2

GPPS11

GPPS12

Figure 4: (a) and (b): The histogram and normal QQ plots of the marginal expression levels of the gene MECPS. We see that the data are not exactly Gaussiandistributed. (c) The estimated gene networks of the Arabadopsis data. The within-pathway links are denoted by solid lines and between-pathway links are denotedby dashed lines. Adapted from [17].

When data from both classes follow normal distributions(5.1), the optimal classifier is the Fisher’s discriminant rule,which classifies a new data point x to class “X” if

(µX − µY )T (Σ∗)−1(x − (µX + µY )/2) > 0, (6.7)

where Σ∗ is the common covariance matrix in (5.1). Whenthe dimensionality d is large, we could estimate Σ∗ or its in-verse (Σ∗)−1 by the regularized estimators introduced in §3 or§4, depending on the sparsity assumption. See, for example,[88, 89, 90, 91].

7. Applications

In this section, we apply the aforementioned methods to es-timate large gene networks and select significantly differentlyexpressed genes when test statistics are dependent.

7.1. Network Estimation

As has been explained in §4, an important application oflarge inverse covariance matrix estimation is to reconstruct theundirected graph of a high dimensional distribution based onobservational data. In this section, we apply the TIGER methodon a microarray data to illustrate the main idea.

This dataset includes 118 gene expression arrays from Ara-bidopsis thaliana originally appeared in [92]. Our analysis fo-cuses on expression profiles from 39 genes involved in two iso-prenoid metabolic pathways: 16 from the mevalonate (MVA)pathway are located in the cytoplasm, 18 from the plastidial(MEP) pathway are located in the chloroplast, and 5 are locatedin the mitochondria. While the MVA and MEP pathways gener-ally operate independently, crosstalk is known to happen [92].Our goal is to reconstruct the gene network by estimating theundirected graph, with special interest in crosstalk.

We first examine whether the data actually satisfies the Gaus-sian distribution assumption. In Figure 4(a) and Figure 4(b), weplot the histogram and normal QQ plot of the expression levelsof a gene named MECPS. From the histogram, we see that thedistribution is left-skewed compared to the Gaussian distribu-tion. From the normal QQ plot, we see that the empirical dis-tribution has a heavier tail compared to Gaussian. To apply theTIGER method to analyze this data, we need to first transformthe data so that its distribution is closer to Gaussian. For this, weGaussianize the expression values of each gene by convertingthem to the corresponding normal-scores. This is automaticallydone by the huge.npn function in the R package huge [93].

We apply the TIGER method on the transformed data usingthe default tuning parameter ζ =

√2/π. The estimated net-

work is shown in Figure 4(c). We see the estimated network isvery sparse with only 44 edges. We draw the within-pathway

11

Page 12: Statistical Analysis of Big Data on Pharmacogenomicsjqfan/papers/13/pharm-bigdata.pdf · sible solutions of Big data analysis. Topics include modeling nonGaussianity, handling missing

(a) Results from the POET-PFA method

(b) Results from the dependence-adjusted method

0 200 400 600 800 1000

0.00

0.10

0.20

# of total rejections

Est

imat

ed F

DP

k = 2k = 3k=4k=5

0 200 400 600 800 1000

050

100

150

200

250

# of total rejections

Est

imat

ed #

of f

alse

rej

ectio

ns

k = 2k = 3k=4k=5

0 200 400 600 800 1000

0.00

0.05

0.10

0.15

# of total rejections

Est

imat

ed F

DP

k = 2k = 3k=4k=5

0 200 400 600 800 1000

050

100

150

# of total rejections

Est

imat

ed #

of f

alse

rej

ectio

ns

k = 2k = 3k=4k=5

Figure 5: The estimated false discovery proportion and the estimated numberof false discoveries as functions of the number of total discoveries for d =

3, 226 genes, where the number of factors are chosen to be k = 2, 3, 4, 5. (a)Results from the POET-PFA method; (b) Results from the dependence-adjustedmethod. Adapted from [69].

connections using solid lines and the between-pathway con-nections using dashed lines. Our result is consistent with pre-vious investigations, which suggest that the connections fromgenes AACT1 and HMGR2 to gene MECPS indicate a pri-mary sources of the crosstalk between the MEP and MVA path-ways and these edges are presented in the estimated network.MECPS is clearly a hub gene for this pathway.

For the MEP pathway, the genes DXPS2, DXR, MCT, CMK,HDR, and MECPS are connected as in the true metabolic path-way. Similarly, for the MVA pathway, the genes AACT2,HMGR2, MK, MPDC1, MPDC2, FPPS1 and FPP2 are closelyconnected. Our analysis suggests 11 cross-pathway links,which is consistent to previous investigation in [92]. This resultsuggests that there might exist rich inter-pathway crosstalks.

7.2. Select Significantly Expressed Genes under Dependency

We now apply the POET-PFA method in §5.5 and thedependence-adjusted method in §5.4 to analyze a microarraydata. The data contains expression profiles from a well-knownbreast cancer study [94, 95]. This study involves 15 womansubjects with two genetic mutations: BRCA1 (7 subjects) andBRCA2 (8 subjects). These two genetic mutations are knownto increase the lifetime risk of hereditary breast cancer. Wewant to find a set of genes that are associated with these genetic

mutations. This allows us to identify cases of hereditary breastcancer on the basis of gene-expression profiles.

We make two assumptions: (i) A large proportion of thegenes are not differently expressed; (ii) The gene expressionfollows an approximate k-factor model. Both assumptionshave gained increasing popularity among biologists in the pastdecade, since it has been widely acknowledged that gene ac-tivities are usually driven by a small number of latent variablesand the genetic mutations are only caused by a small amount ofgenes. See [65, 96] for more details.

We first apply the POET-PFA method to obtain a consistentFDP estimator FDP(t) for a given threshold value t and a fixednumber of factors k.

Let V(t) be the estimated number of false rejections, whichis the numerator of (5.9). The results of our analysis are sum-marized in Figure 5(a), which has two subfigures that plot(R(t), FDP(t)) and (R(t), V(t)) for k = 2, 3, 4, 5. We see thatFDP(t) is close to zero when R(t) is below 200, suggesting thatthe rejected hypotheses in this range have high accuracy to bethe true discoveries. In addition, when 1,000 hypotheses, al-most 1/3 of the total number, have been rejected, the estimatedFDPs are still as low as 25%. Finally, it is worth noting thatalthough our procedure seems robust under different choicesof number of factors, the estimated FDP tends to be relativelysmall with larger number of factors.

We also apply the dependence-adjusted procedure to the data.The results are shown in Figure 5(b). Comparing with Figure5(a), both the estimated FDP and the estimated number of falserejections become smaller.

The selected gene sets based on the conventional two-samplet tests and the factor-adjusted tests are different (not presentedhere). This provides pharmaco-scientists a chance to dis-cover important genes, proteins, and SNPs that are not highlyranked by the conventional methods. In fact, the factor-adjustedmethod is more powerful than the conventional two-sample t-tests when test statistics are dependent.

8. Discussion and Future Directions

The paper is written in the context that massive amountsof pharmacogenomic data combined with quantitative statisti-cal approaches are beginning to shed lights towards personal-ized medicine. While the so-called Big data movement hasreceived a lot of attention, and the pharmacogenomics appli-cations are prime examples, we highlight three characteristicsof modern pharmacogenomic data that worth equal attention.These include: (1) Complex Data - the data distribution can notbe characterized by simple parametric models; (2) Noisy Data- the data are usually aggregated from numerous sources andwe must deal with large biological and measurement hetero-geneity with possible outliers, measurement errors, and missingdata; (3) Dependent Data - the data exhibit complicated corre-lations with relatively weak signals. These problems are nowso ubiquitous, with new variations accruing at such an alarm-ing rate, one might refer to them as simply the Modern dataproblem. The large amount of data are usually obtained by us-ing high-throughput technologies and they contain inevitably

12

Page 13: Statistical Analysis of Big Data on Pharmacogenomicsjqfan/papers/13/pharm-bigdata.pdf · sible solutions of Big data analysis. Topics include modeling nonGaussianity, handling missing

(a) The Vein diagram

(b) The perspective plot of a transelliptical density (c) The hierarchical diagram illustrating the transelliptical graphical model

Elliptical

Nonparanormal

Transelliptical

Multivariate t

Gaussian

Y2

Y4Y1

Y3

Yd

Z2

Z4Z1

Z3

Zd

X2

X4X1

X3

Xd

Layer 1: Observed transelliptical Variables

Layer 2: Latent Elliptical Variables

Layer 3: Latent Gaussian Variables

Latent Generalized Partial Correlation graph G

Stochastic scaling transformation

Marginal monotone transformation

Figure 6: Transelliptical family. (a) The Vein diagram illustrating the relationships of the distribution families. (b) The perspective plot of atranselliptical density. (c) The hierarchical latent variable representation of the transelliptical graphical model with the latent variables grey-colored. Here the first layer is composed of observed X j, and the second and third layers are composed of latent variables Z j and Y j. The solidundirected lines in the third layer encode the conditional independence graph of Y1, . . . ,Yd (Adapted from a manuscript that is under review).

measurement errors. Measurement errors distort statistical con-clusion, reducing correlation with clinical outcomes. Spuriouscorrelations arise inevitably in high dimensional data. Theyinduce the so-called endogenous covariates. When some col-lected variables are correlated with the residual noise of a re-sponse, namely, when the part of the response that can not beexplained by relevant molecules are correlated with irrelevantmolecules, endogeneity arises. Like measurement errors, endo-geneity causes model selection inconsistency, leading to erro-neous scientific conclusions.

To handle the challenges of modern pharmacogenomic data,we need statistical methods that are simultaneously robust tothe main issues of scalability, complexity, noise, and depen-dence. However, general methodological development for phar-macogenomic data analysis, mirroring numerous other scien-tific settings, has lagged behind the rapid development of newtechnologies and new datasets. For example, most existingmethods assume the data are independently sampled from aparametric distribution (e.g., Gaussian), and most of them usePearson’s sample covariance matrix as the sufficient statisticand thus are not robust to outliers and possible data contam-ination. To bridge these gaps, new statistical methodology isurgently needed. In the following sections, we discuss severalrecent efforts that aim at addressing these problems.

8.1. Handling Complex Data with Semiparametric ModelingTo deal with complex nonGaussian data, [97] has proposed a

semiparametric modeling framework based on the transellipti-

cal distribution family.

The transelliptical family is a semiparametric extension ofthe elliptical family. Let X ∈ Rd be a random vector withmean µ and correlation matrix Σ. We say that X followsan elliptical distribution if its density can be written in theform of p(x) = |Σ|−1/2g((x − µ)TΣ−1(x − µ)). Elliptical fam-ily contains many multivariate distributions, including multi-variate Gaussian, multivariate t-distribution, Cauchy, logisticand Kotz distributions. We define X = (X1, . . . , Xd)T fol-lows a transelliptical distribution, denoted by X ∼ T E(Σ, g; f ),if there exists a set of increasing functions { f j}

dj=1 such that

f (X) = ( f1(X1), . . . , fd(Xd))T follows an elliptical distribution.

Figure 6(a) illustrates the relationships of the transelliptical,elliptical [98], and nonparanormal families [99, 100, 101]. Boththe nonparanormal and elliptical distributions are proper sub-sets of the transelliptical family. They share the Gaussian fam-ily as a common subset. Figure 6(b) visualizes a 2-dimensionaltranselliptical density. Clearly the transelliptical family is muchricher than the Gaussian family.

Similar to the Gaussian graphical model, we could also con-struct the transelliptical graphical model based on the transellip-tical family (More details can be found in [97]). To understandthe semantics of a transelliptical graph, we have proved that atranselliptical distribution must admit a three-layer hierarchi-cal latent variable representation as illustrated in Figure 6(c):The observed vector, denoted by X = (X1, . . . , Xd)T and pre-sented in the first layer, has a transelliptical distribution, and a

13

Page 14: Statistical Analysis of Big Data on Pharmacogenomicsjqfan/papers/13/pharm-bigdata.pdf · sible solutions of Big data analysis. Topics include modeling nonGaussianity, handling missing

latent random vector, Z = (Z1, . . . ,Zd)T in the second-layer, iselliptically distributed. Variables in the first and second lay-ers are related through the transformation Z j = f j(X j) withf j being an unknown monotone function. The latent vector Zcan be further represented by a third-layer latent random vec-tor Y = (Y1, . . . ,Yd)T , which is multivariate Gaussian with acorrelation matrix Σ (called latent correlation matrix) and aninverse correlation matrix Θ = Σ−1 (called latent inverse cor-relation matrix). We define the transelliptical graph G = (V, E)with the node set V = {1, . . . , d} and the edge set E encod-ing the nonzero entries of Θ. We provide interpretations of thegraph G for the variables in different layers: (i) For the observedvariables in the first layer, the absence of an edge between twovariables means the absence of a certain rank-based associationof the pair given other variables; (ii) For the latent variables inthe second layer, the absence of an edge means the absence ofthe conditional Pearson’s correlation of the pair; (iii) For thethird layer variables, the absence of an edge means the condi-tional independence of the pair. Compared with the Gaussiangraphical model, the transelliptical graphical model has richerstructure with more relaxed modeling assumptions.

8.2. Handling Noisy and Dependent Data with Robust Methods

To handle noisy data, we need to develop robust statisticalmethods. We introduce a rank-based method using the afore-mentioned transelliptical graphical model as an illustrative ex-ample. Recall that estimating the transelliptical graph is equiv-alent to estimating the latent inverse correlation matrix, [97]develops a rank-based estimator using the Kendall’s tau statis-tics. This estimator has been proved to have good theoreticalproperty and is simultaneously robust to outliers, missing val-ues, and data dependence.

More specifically, let X = (X1, . . . , Xd)T be an independentcopy of a random vector X ∈ Rd. The population Kendall’s taustatistic is

τ jk = Corr(sgn(X j − X j), sgn(Xk − Xk)).

Let x1, . . . , xn ∈ Rd be n observed data points with xi =

(xi1, . . . , xid)T . The sample version Kendall’s tau statistic is

τ jk =2

n(n − 1)

∑1≤s<t≤n

sgn(xs j − xt j)(xsk − xtk).

This is a monotone-transformation-invariant measure of asso-ciation between the empirical realizations of two random vari-ables X j and Xk. Let S = (s jk) ∈ Rd×d with s jk = sin( π2 τ jk)·I( j ,k) + I( j = k), where I(·) is the indicator function. A rank-basedestimator Θ can be obtained by plugging S into a sparse in-verse covariance matrix estimation algorithm like CLIME [16]or TIGER [17]. Such a rank-based estimator is robust and easyto compute.

Besides the above rank-based method, there are many otherways we can develop robust estimators. In summary, statisti-cal analysis of Big pharmacogenomics data is both promisingand challenging. On one hand, significant progress has beenmade towards developing new statistical method and theory to

explore and predict massive amounts of high dimensional data.On the other hand, there are many new challenges and openproblems remain to be solved.

Acknowledgements

We thank Ronglin Wu for his helpful comments and discus-sions. Jianqing Fan is supported by NSF Grant DMS-1206464and NIH Grants R01GM100474 and R01-GM072611. Han Liuis supported by NSF Grant III-1116730 and a NIH sub-awardfrom Johns Hopkins University.

References

[1] W. E. Evans, M. V. Relling, Pharmacogenomics: translating functionalgenomics into rational therapeutics, Science 286 (5439) (1999) 487–491.

[2] A. J. Wood, W. E. Evans, H. L. McLeod, Pharmacogenomics-drugdisposition, drug targets, and side effects, New England Journal ofMedicine 348 (6) (2003) 538–549.

[3] K. Jain, Applications of biochip and microarray systems in pharmacoge-nomics, Pharmacogenomics 1 (3) (2000) 289–307.

[4] P. J. Mishra, J. R. Bertino, Microrna polymorphisms: the future of phar-macogenomics, molecular epidemiology and individualized medicine,Pharmacogenomics 10 (3) (2009) 399–416.

[5] B. R. Winkelmann, W. Marz, B. O. Boehm, R. Zotz, J. Hager, P. Hell-stern, J. Senges, Rationale and design of the luric study-a resource forfunctional genomics, pharmacogenomics and long-term prognosis ofcardiovascular disease, Pharmacogenomics 2 (1) (2001) 1–73.

[6] H. E. Wheeler, M. L. Maitland, M. E. Dolan, N. J. Cox, M. J. Ratain,Cancer pharmacogenomics: strategies and challenges, Nature ReviewsGenetics (2013) 23–34.

[7] S. Ross, S. S. Anand, P. Joseph, G. Pare, Promises and challenges ofpharmacogenetics: an overview of study design, methodological andstatistical issues, JRSM Cardiovascular Disease 1 (1) (2012) 1–7.

[8] R. Wu, M. Lin, Statistical and computational pharmacogenomics, Chap-man & Hall/CRC, 2008.

[9] B. J. Grady, M. D. Ritchie, Statistical optimization of pharmacoge-nomics association studies: key considerations from study design toanalysis, Current pharmacogenomics and personalized medicine 9 (1)(2011) 41–66.

[10] S.-J. Wang, R. T O’Neill, H. J. Hung, Statistical considerations in eval-uating pharmacogenomics-based clinical effect for confirmatory trials,Clinical Trials 7 (5) (2010) 525–536.

[11] S. D. Turner, D. C. Crawford, M. D. Ritchie, Methods for optimizing sta-tistical analyses in pharmacogenomics research, Expert review of clini-cal pharmacology 2 (5) (2009) 559–570.

[12] E. Topic, Pharmacogenomics and personalized medicine, The 7th EFCCContinuous Postgraduate Course in Clinical Chemistry: New trends indiagnosis, monitoring and management using molecular diagnosis meth-ods.

[13] P. Bickel, E. Levina, Covariance regularization by thresholding, Annalsof Statistics 36 (6) (2008) 2577–2604.

[14] J. Fan, Y. Liao, M. Mincheva, High dimensional covariance matrix esti-mation in approximate factor models, Annals of statistics 39 (6) (2011)3320–3356.

[15] J. Fan, Y. Liao, M. Mincheva, Large covariance estimation by threshold-ing principal orthogonal complements (with discussion), Journal of theRoyal Statistical Society Series B, to appear.

[16] T. Cai, W. Liu, X. Luo, A constrained `1 minimization approach tosparse precision matrix estimation, Journal of the American StatisticalAssociation 106 (494) (2011) 594–607.

[17] H. Liu, L. Wang, Tiger: A tuning-insensitive approach for optimallyestimating gaussian graphical models, Tech. rep., Massachusett Instituteof Technology (2012).

[18] J. Fan, X. Han, W. Gu, Estimating false discovery proportion under ar-bitrary covariance dependence (with discussion), Journal of AmericanStatistical Association 107 (3) (2012) 1019–1048.

14

Page 15: Statistical Analysis of Big Data on Pharmacogenomicsjqfan/papers/13/pharm-bigdata.pdf · sible solutions of Big data analysis. Topics include modeling nonGaussianity, handling missing

[19] J. Fan, R. Li, Variable selection via nonconcave penalized likelihoodand its oracle properties, Journal of the American Statistical Association96 (456) (2001) 1348–1360.

[20] J. Fan, J. Lv, Sure independence screening for ultrahigh dimensionalfeature space (with discussion), Journal of the Royal Statistical Society:Series B (Statistical Methodology) 70 (5) (2008) 849–911.

[21] J. S. Yap, J. Fan, R. Wu, Nonparametric modeling of longitudinal covari-ance structure in functional mapping of quantitative trait loci, Biometrics65 (4) (2009) 1068–1077.

[22] S. Geman, A limit theorem for the norm of random matrices, Annals ofProbability 8 (2) (1980) 252–261.

[23] Y. Q. Yin, Z. D. Bai, P. R. Krishnaiah, On the limit of the largest eigen-value of the large-dimensional sample covariance matrix, ProbabilityTheory Related Fields 78 (1988) 509–521.

[24] W. Wu, M. Pourahmadi, Nonparametric estimation of large covariancematrices of longitudinal data, Biometrika 90 (1) (2003) 831–844.

[25] P. Bickel, E. Levina, Some theory for fisher’s linear discriminant func-tion, “naive bayes”, and some alternatives when there are many morevariables than observations, Bernoulli 10 (6) (2004) 989–1010.

[26] J. Fan, Y. Fan, J. Lv, High dimensional covariance matrix estimationusing a factor model, Journal of Econometrics 147 (1) (2008) 186–197.

[27] P. Bickel, E. Levina, Regularized estimation of large covariance matri-ces, Annals of Statistics 36 (1) (2008) 199–227.

[28] C. Lam, J. Fan, Sparsistency and rates of convergence in large covari-ance matrix estimation, Annals of Statistics 37 (2009) 42–54.

[29] T. Cai, W. Liu, Adaptive thresholding for sparse covariance matrix esti-mation, Journal of the American Statistical Association 106 (494) (2011)672–684.

[30] R. Furrer, T. Bengtsson, Estimation of high-dimensional prior and poste-rior covariance matrices in kalman filter variants, Journal of MultivariateAnalysis 98 (2) (2007) 227–255.

[31] J. Huang, N. Liu, M. Pourahmadi, L. Liu, Covariance matrix selec-tion and estimation via penalised normal likelihood, Biometrika 93 (1)(2006) 85–98.

[32] E. Levina, A. Rothman, J. Zhu, Sparse estimation of large covariancematrices via a nested lasso penalty, The Annals of Applied Statistics(2008) 245–263.

[33] A. Rothman, E. Levina, J. Zhu, A new approach to cholesky-based co-variance regularization in high dimensions, Biometrika 97 (3) (2010)539–550.

[34] T. Cai, C. Zhang, H. Zhou, Optimal rates of convergence for covariancematrix estimation, Annals of Statistics 38 (4) (2010) 2118–2144.

[35] D. L. Donoho, J. M. Johnstone, Ideal spatial adaptation by waveletshrinkage, Biometrika 81 (3) (1994) 425–455.

[36] A. Rothman, E. Levina, J. Zhu, Generalized thresholding of large covari-ance matrices, Journal of the American Statistical Association 104 (485)(2009) 177–186.

[37] R. Tibshirani, Regression shrinkage and selection via the lasso, Journalof the Royal Statistical Society, Series B 58 (1) (1996) 267–288.

[38] S. Chen, D. Donoho, M. Saunders, Atomic decomposition by basis pur-suit, SIAM Journal on Scientific Computing 20 (1) (1998) 33–61.

[39] C. Zhang, Nearly unbiased variable selection under minimax concavepenalty, Annals of Statistics 38 (2) (2010) 894–942.

[40] T. Hastie, R. Tibshirani, J. Friedman, The elements of statistical learn-ing: data mining, inference, and prediction (2nd ed.), Springer NewYork, 2009.

[41] A. Antoniadis, J. Fan, Regularization of wavelet approximations, Jour-nal of the American Statistical Association 96 (455) (2001) 939–967.

[42] H. Liu, L. Wang, T. Zhao, Sparse covariance estimation with eigenvalueconstraints, Journal of Computational and Graphical Statistics, to ap-pear.

[43] L. Xue, S. Ma, H. Zou, Positive definite `1 penalized estimation oflarge covariance matrices, Journal of the American Statistical Associ-ation 107 (500) (2012) 1480–1491.

[44] A. Rothman, Positive definite estimators of large covariance matrices,Biometrika, to appear.

[45] A. Dempster, Covariance selection, Biometrics 28 (1972) 157–175.[46] N. Meinshausen, P. Buhlmann, High dimensional graphs and variable

selection with the lasso, Annals of Statistics 34 (3) (2006) 1436–1462.[47] J. Friedman, T. Hastie, R. Tibshirani, Sparse inverse covariance estima-

tion with the graphical lasso, Biostatistics 9 (3) (2008) 432–441.

[48] A. d’Aspremont, O. Banerjee, L. El Ghaoui, First-order methods forsparse covariance selection, SIAM Journal on Matrix Analysis and Ap-plications 30 (1) (2008) 56–66.

[49] E. Candes, T. Tao, The dantzig selector: Statistical estimation when p ismuch larger than n, The Annals of Statistics 35 (6) (2007) 2313–2351.

[50] M. Yuan, High dimensional inverse covariance matrix estimation via lin-ear programming, Journal of Machine Learning Research 11 (8) (2010)2261–2286.

[51] T. Sun, C.-H. Zhang, Sparse matrix inversion with scaled lasso, Tech.rep., Department of Statistics, Rutgers University (2012).

[52] W. Liu, X. Luo, High-dimensional sparse precision matrix estimationvia sparse column inverse operator, arXiv/1203.3896.

[53] A. Belloni, V. Chernozhukov, L. Wang, Square-root lasso: Pivotal re-covery of sparse signals via conic programming, Biometrika 98 (2012)791–806.

[54] B. Efron, R. Tibshirani, J. D. Storey, V. Tusher, Empirical bayes analysisof a microarray experiment, Journal of the American Statistical Associ-ation 96 (456) (2001) 1151–1160.

[55] S. Dudoit, J. Fridlyand, T. P. Speed, Comparison of discrimination meth-ods for the classification of tumors using gene expression data, Journalof the American statistical association 97 (457) (2002) 77–87.

[56] J. D. Storey, R. Tibshirani, Statistical significance for genomewide stud-ies, Proceedings of the National Academy of Sciences 100 (16) (2003)9440–9445.

[57] Y. Benjamini, Y. Hochberg, Controlling the false discovery rate: a prac-tical and powerful approach to multiple testing, Journal of the RoyalStatistical Society. Series B (Methodological) (1995) 289–300.

[58] C. Genovese, L. Wasserman, A stochastic process approach to false dis-covery control, The Annals of Statistics 32 (3) (2004) 1035–1061.

[59] E. Lehmann, J. P. Romano, J. P. Shaffer, On optimality of stepdown andstepup multiple test procedures, Annals of statistics (2005) 1084–1108.

[60] B. Efron, Large-scale inference: empirical Bayes methods for estima-tion, testing, and prediction, Vol. 1, Cambridge University Press, 2010.

[61] Y. Benjamini, D. Yekutieli, The control of the false discovery rate inmultiple testing under dependency, Annals of statistics (2001) 1165–1188.

[62] J. D. Storey, J. E. Taylor, D. Siegmund, Strong control, conservativepoint estimation and simultaneous conservative consistency of false dis-covery rates: a unified approach, Journal of the Royal Statistical Society:Series B (Statistical Methodology) 66 (1) (2003) 187–205.

[63] S. Clarke, P. Hall, Robustness of multiple testing procedures against de-pendence, The Annals of Statistics 37 (1) (2009) 332–358.

[64] B. Efron, Correlation and large-scale simultaneous significance testing,Journal of the American Statistical Association 102 (477) (2007) 93–103.

[65] J. T. Leek, J. D. Storey, A general framework for multiple testing de-pendence, Proceedings of the National Academy of Sciences 105 (48)(2008) 18718–18723.

[66] B. Efron, Correlated z-values and the accuracy of large-scale statisti-cal estimates, Journal of the American Statistical Association 105 (491)(2010) 1042–1055.

[67] A. Schwartzman, X. Lin, The effect of correlation in false discovery rateestimation, Biometrika 98 (1) (2011) 199–214.

[68] J. Fan, R. Tang, X. Shi, Partial consistency with sparse incidental pa-rameters.

[69] J. Fan, X. Han, Estimation of false discovery proportion with unknowndependence, Manuscript.

[70] J. Fan, J. Lv, A selective overview of variable selection in high dimen-sional feature space, Statistica Sinica 20 (1) (2010) 101–148.

[71] P. Buhlmann, S. Van De Geer, Statistics for High-Dimensional Data:Methods, Theory and Applications, Springer, 2011.

[72] S. A. Van de Geer, High-dimensional generalized linear models and thelasso, The Annals of Statistics 36 (2) (2008) 614–645.

[73] J. Fan, J. Lv, Nonconcave penalized likelihood with NP-dimensionality,Information Theory, IEEE Transactions on 57 (8) (2011) 5467–5484.

[74] H. Zou, R. Li, One-step sparse estimates in nonconcave penalized like-lihood models, Annals of statistics 36 (4) (2008) 1509–1533.

[75] J. Friedman, T. Hastie, H. Hofling, R. Tibshirani, Pathwise coordinateoptimization, The Annals of Applied Statistics 1 (2) (2007) 302–332.

[76] I. Daubechies, M. Defrise, C. De Mol, An iterative thresholding algo-rithm for linear inverse problems with a sparsity constraint, Communi-

15

Page 16: Statistical Analysis of Big Data on Pharmacogenomicsjqfan/papers/13/pharm-bigdata.pdf · sible solutions of Big data analysis. Topics include modeling nonGaussianity, handling missing

cations on pure and applied mathematics 57 (11) (2004) 1413–1457.[77] A. Beck, M. Teboulle, A fast iterative shrinkage-thresholding algorithm

for linear inverse problems, SIAM Journal on Imaging Sciences 2 (1)(2009) 183–202.

[78] B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, Least angle regression,The Annals of statistics 32 (2) (2004) 407–499.

[79] J. Fan, R. Song, Sure independence screening in generalized linear mod-els with np-dimensionality, The Annals of Statistics 38 (6) (2010) 3567–3604.

[80] P. Hall, H. Miller, Using generalized correlation to effect variable selec-tion in very high dimensional problems, Journal of Computational andGraphical Statistics 18 (3) (2009) 533–550.

[81] J. Fan, Y. Feng, R. Song, Nonparametric independence screening insparse ultra-high-dimensional additive models, Journal of the AmericanStatistical Association 106 (494) (2011) 544–557.

[82] R. Li, W. Zhong, L. Zhu, Feature screening via distance correlationlearning, Journal of the American Statistical Association 107 (499)(2012) 1129–1139.

[83] G. Li, H. Peng, J. Zhang, L. Zhu, Robust rank correlation based screen-ing, The Annals of Statistics 40 (3) (2012) 1846–1877.

[84] J. Fan, R. Samworth, Y. Wu, Ultrahigh dimensional feature selection:beyond the linear model, The Journal of Machine Learning Research 10(2009) 2013–2038.

[85] V. Vapnik, S. E. Golowich, A. Smola, Support vector method for func-tion approximation, regression estimation, and signal processing, Ad-vances in neural information processing systems (1997) 281–287.

[86] Y. Freund, R. Schapire, A desicion-theoretic generalization of on-linelearning and an application to boosting, in: Computational learning the-ory, Springer, 1995, pp. 23–37.

[87] L. Breiman, Arcing classifier (with discussion and a rejoinder by theauthor), The annals of statistics 26 (3) (1998) 801–849.

[88] D. M. Witten, R. Tibshirani, Covariance-regularized regression and clas-sification for high dimensional problems, Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) 71 (3) (2009) 615–636.

[89] J. Shao, Y. Wang, X. Deng, S. Wang, Sparse linear discriminant analy-sis by thresholding for high dimensional data, The Annals of Statistics39 (2) (2011) 1241–1265.

[90] T. Cai, W. Liu, A direct estimation approach to sparse linear discrimi-nant analysis, Journal of the American Statistical Association 106 (496)(2011) 1566–1577.

[91] J. Fan, Y. Feng, X. Tong, A road to classification in high dimensionalspace: the regularized optimal affine discriminant, Journal of the RoyalStatistical Society: Series B (2012) 745–771.

[92] A. Wille, P. Zimmermann, E. Vranova, A. Frholz, O. Laule, S. Bleuler,L. Hennig, A. Prelic, P. von Rohr, L. Thiele, E. Zitzler, W. Gruissem,P. Buhlmann, Sparse graphical gaussian modeling of the isoprenoid genenetwork in arabidopsis thaliana, Genome Biology 5 (11) (2004) R92.

[93] T. Zhao, H. Liu, K. Roeder, J. Lafferty, L. Wasserman, The huge pack-age for high-dimensional undirected graph estimation in r, Journal ofMachine Learning Research 13 (2012) 1059–1062.

[94] I. Hedenfalk, D. Duggan, Y. Chen, M. Radmacher, M. Bittner, R. Simon,P. Meltzer, B. Gusterson, M. Esteller, M. Raffeld, et al., Gene-expressionprofiles in hereditary breast cancer, New England Journal of Medicine344 (8) (2001) 539–548.

[95] B. Efron, Correlation and large-scale simultaneous significance testing,Journal of the American Statistical Association 102 (477) (2007) 93–103.

[96] K. H. Desai, J. D. Storey, Cross-dimensional inference of dependenthigh-dimensional data, Journal of the American Statistical Association107 (497) (2012) 135–151.

[97] H. Liu, F. Han, C. Zhang, Transelliptical graphical models, in: Advancesin Neural Information Processing Systems 25, 2012, pp. 809–817.

[98] K. Fang, S. Kotz, K. Ng, Symmetric multivariate and related distribu-tions, Chapman & Hall, 1990.

[99] H. Liu, J. Lafferty, L. Wasserman, The nonparanormal: Semiparametricestimation of high dimensional undirected graphs, Journal of MachineLearning Research 10 (2009) 2295–2328.

[100] H. Liu, F. Han, M. Yuan, J. Lafferty, L. Wasserman, High dimensionalsemiparametric gaussian copula graphical models, Annals of Statistics40 (40) (2012) 2293–2326.

[101] L. Xue, H. Zou, Regularized rank-based estimation of high-dimensional

nonparanormal graphical models, Annals of Statistics 40 (5) (2012)2541–2571.

16