Transcript
Page 1: Leveraging for big data regression - …homepages.math.uic.edu/~minyang/Big Data Discussion Group... · Leveraging for big data regression ... WIREsComputStat2014.doi:10.1002/wics.1324

Focus Article

Leveraging for big data regressionPing Ma∗ and Xiaoxiao Sun

Rapid advance in science and technology in the past decade brings an extraordinaryamount of data, offering researchers an unprecedented opportunity to tacklecomplex research challenges. The opportunity, however, has not yet been fullyutilized, because effective and efficient statistical tools for analyzing super-largedataset are still lacking. One major challenge is that the advance of computingresources still lags far behind the exponential growth of database. To facilitatescientific discoveries using current computing resources, one may use an emergingfamily of statistical methods, called leveraging. Leveraging methods are designedunder a subsampling framework, in which one samples a small proportion of thedata (subsample) from the full sample, and then performs intended computationsfor the full sample using the small subsample as a surrogate. The key of the successof the leveraging methods is to construct nonuniform sampling probabilities so thatinfluential data points are sampled with high probabilities. These methods stand asthe very unique development of their type in big data analytics and allow pervasiveaccess to massive amounts of information without resorting to high performancecomputing and cloud computing. © 2014 Wiley Periodicals, Inc.

How to cite this article:WIREs Comput Stat 2014. doi: 10.1002/wics.1324

Keywords: big data; subsampling; estimation; least squares; leverage

INTRODUCTION

Rapid advance in science and technology in thepast decade brings an extraordinary amount

of data that were inaccessible just a decade ago,offering researchers an unprecedented opportunity totackle much larger and more complex research chal-lenges. The opportunity, however, has not yet beenfully utilized, because effective and efficient statis-tical and computing tools for analyzing super-largedataset are still lacking. One major challenge is thatthe advance of computing technologies still lags farbehind the exponential growth of database. Thecutting edge computing technologies of today andforeseeable future are high performance comput-ing (HPC) platforms, such as supercomputers andcloud computing. Researchers have proposed use ofGPUs (graphics processing units),1 and explorations

∗Correspondence to: [email protected]

Department of Statistics, University of Georgia, Athens, GA, USA

Conflict of interest: The authors have declared no conflicts ofinterest for this article.

of FPGA (field-programmable gate array)-based accel-erators are ongoing in both academia and industry(for example, with TimeLogic and Convey Comput-ers). However, none of these technologies by them-selves can address the big data problem accurately andat scale. Supercomputers are precious resources andare not designed to be used routinely by everyone.The selection of the projects for running on super-computer is very stringent. For example, the resourceallocation of latest deployed Blue Water supercom-puter at National Center for Super Computing Appli-cations (https://bluewaters.ncsa.illinois.edu/) requiresa rigorous National Science Foundation (NSF) reviewprocess and supports only a handful projects. Whileclouds have large storage capabilities, ‘cloud comput-ing is not a panacea: it poses problems for devel-opers and users of cloud software, requires largedata transfers over precious low-bandwidth Inter-net uplinks, raises new privacy and security issues,and is an inefficient solution’.2 While cloud facili-ties like mapReduce do offer some advantages, theoverall issues of data transfer and usability of today’sclouds will indeed limit what we can achieve; the

© 2014 Wiley Per iodica ls, Inc.

Page 2: Leveraging for big data regression - …homepages.math.uic.edu/~minyang/Big Data Discussion Group... · Leveraging for big data regression ... WIREsComputStat2014.doi:10.1002/wics.1324

Focus Article wires.wiley.com/compstats

tight coupling between computing and storage isabsent. A recently conducted comparison3 found thateven high-end GPUs are sometimes outperformed bygeneral-purpose multi-core implementations becauseof the time required to transfer data.

‘One option is to invent algorithms that makebetter use of a fixed amount of computing power’.2

In this article, we review an emerging family of sta-tistical methods, called leveraging methods that aredeveloped for achieving such a goal. Leveraging meth-ods are designed under a subsampling framework,in which we sample a small proportion of the data(subsample) from the full sample, and then performintended computations for the full sample using thesmall subsample as a surrogate. The key of the successof the leveraging methods relies on effectively con-structing nonuniform sampling probabilities so thatinfluential data points are to be sampled with highprobabilities. Traditionally, ‘subsampling’ has beenused to refer to ‘m-out-of-n’ bootstrap, whose pri-mary motivation is to make approximate inferenceowing to the difficulty or intractability in derivinganalytical expressions.4,5 However, the general moti-vation of the leveraging method is different from thetraditional subsampling. In particular, (1) leveragingmethods are used to achieve feasible computation evenif the simple analytic results are available, (2) lever-aging methods enable the visualization of the datawhen visualization of the full sample is impossible,and (3) as we will see later, leveraging methods usu-ally use unequal sampling probabilities for subsam-pling data, whereas subsampling almost exclusivelyuses equal sampling probabilities. Leveraging methodsstand as the very unique development in big data ana-lytics and allow pervasive access to massive amountsof information without resorting to HPC and cloudcomputing.

LINEAR REGRESSION MODELAND LEVERAGING

In this article, we focus on the linear regression setting.The first effective leveraging method in regressionis developed in Refs 6 and 7 The method uses thesample leverage scores to construct a nonuniformsampling probability. Thus, the resulting sample tendsto concentrate more on important or influential datapoints for the ordinary least squares (OLS).6,7 Theestimate based on such a subsample has been shownto provide an excellent approximation to the OLSbased on full data (when p is small and n is large).Ma et al.8,9 studied the statistical properties of thebasic leveraging method in Refs 6 and 7 and proposed

two novel methods to improve the basic leveragingmethod.

Linear Model and Least Squares ProblemGiven n observations,

(xT

i , yi

), where i=1, … , n, we

consider a linear model,

y = X𝛽 + 𝜖, (1)

where y= (y1, … , yn)T is the n× 1 response vec-tor, X= (x1, … , xn)T is the n×p predictor or designmatrix, 𝛽 is a p×1 coefficient vector, and 𝜖 is zeromean random noise with constant variance. Theunknown coefficient 𝛽 can be estimated via the OLS,which is to solve

𝛽ols = argmin𝛽||y − X𝛽||2, (2)

where || · || represents the Euclidean norm. Whenmatrix X is of full rank, the OLS estimate 𝛽ols of 𝛽,minimizer of (2), has the following expression,

𝛽ols =(XTX

)−1XTy. (3)

When matrix X is not of full rank, the inverseof XTX in expression (3) is replaced by generalizedinverse. The predicted response vector is

y = Hy, (4)

where the hat matrix H=X(XTX)−1XT. The ith diag-onal element of the hat matrix H, hii = xT

i

(XTX

)−1xi,

is called the leverage score of the ith observation. Itis well known that as hii approaches to 1, the pre-dicted response of the ith observation yi gets closeto yi. Thus the leverage score has been regarded asthe most important influence index indicting howinfluential the ith observation to the least squaresestimator.

LeveragingWhen sample size n is super large, the computation ofthe OLS estimator using the full sample may becomeextremely expensive. For example, if p =

√n, the

computation of the OLS estimator is O(n2), whichmay be infeasible for super large n. Alternatively,one may opt to use leveraging methods, which arepresented below.

There are two types of leveraging methods,weighted leveraging methods and unweighted leverag-ing methods. Let us look at the weighted leveragingmethods first.

© 2014 Wiley Per iodica ls, Inc.

Page 3: Leveraging for big data regression - …homepages.math.uic.edu/~minyang/Big Data Discussion Group... · Leveraging for big data regression ... WIREsComputStat2014.doi:10.1002/wics.1324

WIREs Computational Statistics Leveraging for big data regression

Algorithm 1. Weighted Leveraging

• Step 1. Taking a random subsample of size rfrom the data Constructing sampling probability𝜋 = {𝜋1, … ,𝜋n} for the data points. Draw a ran-dom subsample (with replacement) of size r≪n,denoted as (X*, y*), from the full sample accord-ing to the probability 𝜋. Record the correspond-ing sampling probability matrix Φ∗ = diag

{𝜋∗

k

}.

• Step 2. Solving a weighted least squares (WLS)problem using the subsample. Obtain leastsquares estimate using the subsample. Estimate𝛽 via solving a WLS problem on the subsampleto get estimate 𝛽, i.e., solve

arg min𝛽

||Φ∗−1∕2y∗ − Φ∗−1∕2X∗𝛽||2. (5)

In the weighted leveraging methods (andunweighted leveraging methods to be reviewedbelow), the key component is to construct the sam-pling probability 𝜋. An easy choice is 𝜋i =

1n

foreach data point, i.e., uniform sampling probabil-ity. The resulting leveraging estimator is denotedas 𝛽UNIF. The leveraging algorithm with uniformsubsampling probability is very simple to implementbut performs poorly in many cases, see Figure 1.The first nonuniform weighted leveraging algorithmwas developed by Drineas et al..6,7 In their weightedleveraging algorithm, they construct sampling prob-ability 𝜋i using the (normalized) sample leveragescore as 𝜋i = hii/

∑ihii. The resulting leveraging esti-

mator is denoted as 𝛽BLEV, which yields impressiveresults: A preconditioned version of this leveragingmethod ‘beats LAPACK’s direct dense least squaressolver by a large margin on essentially any dense tallmatrix’.10

An important feature of the weighted leverag-ing method is, in step 2, to use sampling probabilityas weight in solving WLS based on the subsample.The weight is analogous to the weight in classicHansen–Hurwitz estimator.11 Recall in classic sam-pling methods, given data (y1, … , yn) with samplingprobability (𝜋1, … ,𝜋n), taking a random sample ofsize r, denoted by

(y∗1, … , y∗r

), the Hansen–Hurwitz

estimator is r−1 ∑ri=1 y∗i ∕𝜋

∗i , where 𝜋∗

i is the corre-sponding sampling probability of y∗i . It is well knownthat r−1 ∑r

i=1 y∗i ∕𝜋∗i is an unbiased estimate of

∑ni=1 yi.

It was shown in Refs 8 and 9 that weighted leverag-ing estimator is an approximately unbiased estimatorof 𝛽ols.

–4 –2 0 2 4 6 8–4 –2 0 2 4 6 8

–8

–6

–4

–2

0

2

4

–8

–6

–4

–2

0

2

4

FIGURE 1 | A random sample of n= 500 was generated fromyi =− xi + 𝜖i where xi is t(6)-distributed, 𝜖i ~ N(0, 1). The trueregression function is in the dotted black line, the data in black circles,and the OLS estimator using the full sample in the black solid line. (a)The uniform leveraging estimator is in the green dashed line. Theuniform leveraging subsample is superimposed as green crosses. (b) Theweighted leveraging estimator is in the red-dot dashed line. The pointsin the weighted leveraging subsample are superimposed as red crosses.

It was also noticed in Ref 8 that the variance ofthe weighted leveraging estimator may be inflated bythe data with tiny sampling probability. Subsequently,Ma et al.8,9 introduced a shrinkage leveraging method.In the shrinkage leveraging method, sampling proba-bility was shrunk to uniform sampling probability, i.e.,𝜋i = 𝜆

hii∑hii

+ (1 − 𝜆) 1n, where tuning parameter 𝜆 is a

constant in [0, 1]. The resulting estimator is denoted as𝛽SLEV. It was demonstrated in Refs 8 and 9 that 𝛽SLEVhas smaller means squared error than 𝛽BLEV when 𝛼

is chosen between 0.8 and 0.95 in their simulationstudies.

It is worth noting that the first leveraging methodis designed primarily to approximate full sample OLSestimate.6,7 Ma et al.8 substantially expand the scopeof the leveraging methods by applying them to inferthe true value of coefficient 𝛽. Moreover, if inferenceof the true value of coefficient 𝛽 is the major concern,one may use an unweighted leveraging method.8

Algorithm 2. Unweighted Leveraging

• Step 1. Taking a random sample of size r fromthe data. This step is the same as step 1 inAlgorithm 1.

• Step 2. Solving an OLS problem using the sub-sample. Obtain least squares estimate using thesubsample. Estimate 𝛽 by solving an OLS prob-lem (instead of WLS problem) on the sample toget estimate 𝛽u, i.e., solve

arg min𝛽

||y∗ − X∗𝛽||2. (6)

© 2014 Wiley Per iodica ls, Inc.

Page 4: Leveraging for big data regression - …homepages.math.uic.edu/~minyang/Big Data Discussion Group... · Leveraging for big data regression ... WIREsComputStat2014.doi:10.1002/wics.1324

Focus Article wires.wiley.com/compstats

Although the 𝛽u is an approximately biasedestimator of the full sample OLS estimator 𝛽ols, Maet al.8 show that 𝛽u is an unbiased estimator of thetrue value of coefficient 𝛽. Simulation studies alsoindicate that the unweighted leveraging estimator 𝛽u

has smaller variance than the corresponding weightedleveraging estimator 𝛽.

Analytic Framework for LeveragingEstimatorsWe use the mean squared error (MSE) to comparethe performance of the leveraging methods. In par-ticular, MSE can be decomposed in to bias and vari-ance components. We start with deriving the bias andvariance of the unweighted leveraging estimator 𝛽 inAlgorithm 1. It is instructive to note that when study-ing the statistical properties of leveraging estimatorsthere are two layers of randomness (from samplingand subsampling).

From Algorithm 1, we can easily see

𝛽 =(X∗TΦ∗−1X∗)−1

X∗TW∗y∗.

This expression is difficult to analyze as it hasboth random subsample (X*, y*) and their corre-sponding sampling probability Φ*. Instead, we rewritethe weighted leveraging estimator in terms of the orig-inal data (X, y),

𝛽 =(XTWX

)−1XTWy,

where W =diag(w1, w2, … , wn) is an n× n diagonalrandom matrix, wi =

ki

r𝜋i, and ki is the number of

times that the ith data point in the random subsample(Recall that step 1 in Algorithm 1 is to take subsamplewith replacement).

Clearly, the estimator 𝛽 can be regardedas a function of the random weight vectorw= (w1, w2, … , wn)T , denoted as 𝛽 (w). As we takerandom subsample with replacement, it is easy to seethat w has a scaled multinomial distribution,

Pr{

w1 =k1

r𝜋1,w2 =

k2

r𝜋2, … ,wn =

kn

r𝜋n

}= r!

k1!k2!, … ,kn!𝜋

k1

1 𝜋k2

2 · · ·𝜋knn ,

with mean Ew= 1. By setting w0, the vector aroundwhich we perform our Taylor series expansion, tobe the all-ones vector, i.e., w0 =1, then 𝛽 (w) can beexpanded around the full sample OLS estimate 𝛽ols,i.e., 𝛽 (1) = 𝛽ols. From this, we have the followinglemma.

LEMMA 1. Let 𝛽 be the weighted leveraging estima-tor. Then, a Taylor expansion of 𝛽 around the pointw0 =1 yields

𝛽 = 𝛽ols +(XTX

)−1XTdiag

{e}(w − 1) + RW , (7)

where e = y − X𝛽ols is the OLS residual vector,and RW = op(||w−w0||) is the Taylor expansionremainder.

Note that the detailed proofs of all lemma andtheorems can be found in Ref 8.

Given Lemma 1, we can establish the followinglemma, which provides expressions for the conditionaland unconditional expectations and variances for theweighted leveraging estimator.

LEMMA 2. The conditional expectation and condi-tional variance for the weighted leveraging estimatorare given by:

E{𝛽|y} = 𝛽ols + E

{RW

}; (8)

Var{𝛽|y} =

(XTX

)−1XTdiag

{e}

diag{

1r𝝅

}diag

{e}

X(XTX

)−1 + Var{

RW

}, (9)

where W specifies the probability distribution usedin the sampling and rescaling steps. The uncondi-tional expectation and unconditional variance for theweighted leveraging estimator are given by:

E{𝛽}= 𝛽0; (10)

Var{𝛽}= 𝜎2 (XTX

)−1 + 𝜎2

r

(XTX

)−1XT

diag

{(1 − hii

)2

𝜋i

}X(XTX

)−1 + Var{

RW

}. (11)

Equation (8) states that, when the E{RW} term isnegligible, i.e., when the linear approximation is valid,then, conditioning on the observed data y, the estimate𝛽 is approximately unbiased, relative to the full sampleOLS estimate 𝛽ols; and Eq. (10) states that the estimate𝛽 is unbiased, relative to the true value 𝛽0 of theparameter vector 𝛽. That is, given a particular dataset(X, y), the conditional expectation result of Eq. (8)states that the leveraging estimators can approximatewell 𝛽ols; and, as a statistical inference procedurefor arbitrary datasets, the unconditional expectation

© 2014 Wiley Per iodica ls, Inc.

Page 5: Leveraging for big data regression - …homepages.math.uic.edu/~minyang/Big Data Discussion Group... · Leveraging for big data regression ... WIREsComputStat2014.doi:10.1002/wics.1324

WIREs Computational Statistics Leveraging for big data regression

result of Eq. (10) states that the leveraging estimatorscan infer well 𝛽0.

Both the conditional variance of Eq. (9) and the(second term of the) unconditional variance of Eq.(11) are inversely proportional to the subsample sizer; and both contain a sandwich-type expression, themiddle of which depends on how the leverage scoresinteract with the sampling probabilities. Moreover, thefirst term of the unconditional variance, 𝜎2(XTX)−1,equals the variance of the OLS estimator; this implies,e.g., that the unconditional variance of Eq. (11) islarger than the variance of the OLS estimator, whichis consistent with the Gauss-Markov theorem.

Now consider the unweighted leveraging estima-tor, which is different from the weighted leveragingestimator, in that no weight is used for least squaresbased on the subsample. Thus, we shall derive the biasand variance of the unweighted leveraging estimator𝛽u. To do so, we first use a Taylor series expansion toget the following lemma.

LEMMA 3. Let 𝛽u be the unweighted leveragingestimator. Then, a Taylor expansion of 𝛽u around thepoint w0 = r𝜋 yields

𝛽u = 𝛽wls +(XTW0X

)−1XTdiag

{ew

}(w − r𝜋) + Ru,

(12)

where 𝛽wls =(XTW0X

)−1XW0y is the full sample

WLS estimator, ew = y − X𝛽wls is the LS residualvector, W0 = diag{r𝜋}, and Ru is the Taylor expansionremainder.

This lemma is analogous to Lemma 1. In partic-ular, here we expand around the point w0 = r𝜋 sinceE{w}= r𝜋 when no reweighting takes place.

Given this Taylor expansion lemma, we cannow establish the following lemma for the mean andvariance of unweighted leveraging estimator, bothconditioned and unconditioned on the data y.

LEMMA 4. The conditional expectation and condi-tional variance for the unweighted leveraging estima-tor are given by:

E{𝛽u|y} = 𝛽wls + E {Ru} ;

Var{𝛽u|y} =

(XTW0X

)−1XTdiag

{ew

}W0

diag{

ew

}X(XTW0X

)−1 + Var {Ru} ,

where W0 = diag{r𝜋}, and where 𝛽wls =(XTW0X

)−1XW0y is the full sample WLS estimator.

The unconditional expectation and unconditional

variance for the unweighted leveraging estimator aregiven by,

E{𝛽u}= 𝛽0;

Var{𝛽u}= 𝜎2 (XTW0X

)−1XTW2

0X(XTW0X

)−1

+𝜎2 (XTW0X)−1

XTdiag{

I − PX,W0

}W0

diag{

I − PX,W0

}X(XTW0X

)−1 + Var {Ru}

(13)

where PX,W0= X

(XTW0X

)−1XTW0.

These two expectation results in this lemmastate: (1), when E{Ru} is negligible, then, conditioningon the observed data y, the estimator 𝛽u is approxi-mately unbiased, relative to the full sample WLS esti-mator 𝛽wls and (2) the estimator 𝛽u is unbiased, relativeto the true value 𝛽0 of the parameter vector 𝛽.

As expected, when the leverage scores are all thesame, the variance in Eq. (13) is the same as the vari-ance of uniform random sampling. This is expectedsince, when reweighting with respect to the uniformdistribution, one does not change the problem beingsolved, and thus the solutions to the weighted andunweighted OLS problems are identical. More gener-ally, the variance is not inflated by very small lever-age scores, as it is with weighted leveraging estimator.For example, the conditional variance expression isalso a sandwich-type expression, the center of whichis W0 =diag{rhii/n}, which is not inflated by very smallleverage scores.

ComputationThe running time of the leveraging methods dependson both the time to construct the probability 𝜋, andthe time to solve the least squares based on thesubsamples. Solving least squares based on subsampleof size r requires O(rp2) time. On the other hand, whenconstructing 𝜋 involves leverage score hii, the runningtime is dominated by the computation of those scores.The leverage score hii can be computed using singularvalue decomposition (SVD) of X. The SVD of X canbe written as

X = UΛVT,

where U is an n× p orthogonal matrix whose columnscontain the left singular vectors of X, V is an p×porthogonal matrix whose columns contain the rightsingular vectors of X, p× p matrix Λ=diag{𝜆i}, and𝜆i, i= 1, … , p are singular values of X. Hat matrixH can alternatively be expressed as H=UUT. Then,

© 2014 Wiley Per iodica ls, Inc.

Page 6: Leveraging for big data regression - …homepages.math.uic.edu/~minyang/Big Data Discussion Group... · Leveraging for big data regression ... WIREsComputStat2014.doi:10.1002/wics.1324

Focus Article wires.wiley.com/compstats

TABLE 1 Mean (Standard Deviation) of Sample MSEs over 100 Replicates for different subsample size.

Algorithms 200 500 1000 2000

Uniform 8.46e-5(2.44e-5) 1.32e-5(8.16e-7) 9.89e-6(2.38e-7) 8.80e-6(9.20e-8)

Weighted 7.72e-5(2.64e-5) 1.24e-5(6.21e-7) 9.61e-6(2.47e-7) 8.80e-6(1.60e-7)

Unweighted 7.03e-5(2.36e-5) 1.18e-5(5.14e-7) 9.28e-6(1.99e-7) 8.51e-6(6.88e-8)

Shrinkage 7.02e-5(1.88e-5) 1.24e-5(6.50e-7) 9.67e-6(2.80e-7) 8.79e-6(1.49e-7)

MSE, mean squared error.

the leverage score of the ith observation can beexpressed as

hii = ||ui||2, (14)

where ui is the ith row of U. The exact computationof hii, for i=1, … , n, in Eq. (14) requires O(np2)time.12

The fast randomized algorithm from Ref 13 canbe used to compute approximations to the leveragescores of X in o(np2) time. The algorithm of Ref 13takes an arbitrary n× p matrix X as input.

• Generate an r1 ×n random matrix Π1 and a p× r2random matrix Π2.

• Let R be the R matrix from a QR decompositionof Π1X.

• Compute and return the leverage scores of thematrix XR−1Π2.

For appropriate choices of r1 and r2, if onechooses Π1 to be a Hadamard-based random projec-tion matrix, then this algorithm runs in o(np2) time,and it returns 1± 𝜖 approximations to all the leveragescores of X.13 In addition, with a high-quality imple-mentation of the Hadamard-based random projection,this algorithm runs faster than traditional determin-istic algorithms based on LAPACK.10,14 Ma et al.8

used two variants of this fast algorithm of Ref 13:the binary fast algorithm, where (up to normaliza-tion) each element of Π1 and Π2 is generated i.i.d.from {−1, 1} with equal sampling probabilities and theGaussian Fast algorithm, where each element of Π1is generated i.i.d. from a Gaussian distribution withmean zero and variance 1/n and each element of Π2is generated i.i.d. from a Gaussian distribution withmean zero and variance 1/p. Ma et al.8 showed thatthese two algorithms perform very well.

Real ExampleHuman Connectome Project (HPC) aims to delin-eate human brain connectivity and function in thehealthy adult human brain and to provide these data

as public resource for biomedical research. In HPC,functional magnetic resonance imaging (fMRI) datawere collected as follows. Scanners acquire the slicesof three-dimensional (3D) functional image, resultingin brain oxygenation-level dependent (BOLD) signalsamples at different layers of the brain while subjectsperform certain task. In this example, fMRI BOLDintensities released in 201315 were used. The BOLDsignals at approximately 0.2 million voxels in eachsubject’s brain were taken at 176 consecutive timepoints during a task named emotion. Most of con-ventional methods can only analyze single subjectimaging data due to the daunting computational cost.We stacked on five subjects’ BOLD signal together,resulting in a data matrix of 1, 004, 738× 176, inwhich each row is a voxel and each column is atime point. We use the BOLD signal at the last timepoint as response y, the rest 175 BOLD measurementswith additional intercept as the predictor X, which is1, 004, 738× 176. We fit the linear regression modelof (1) to the data.

We applied four leveraging methods to fit thelinear regression model. In particular, uniform lever-aging, weighted leveraging, unweighted leveraging,and shrinkage leveraging with 𝜆= 0.9 were comparedwith subsample size ranging from r= 200 to 2000.The sample MSE =

(y − y

)T (y − y

)∕n was calculated.

We repeated applying the leveraging methods for onehundred times at each subsample size. The mean andstandard deviations of MSEs were reported in Table 1.From Table 1, one can clearly see that the unweightedleveraging outperforms other methods in terms ofsample MSE. The leveraging methods are significantlybetter than uniform leveraging especially when thesubsample size is small.

As for computing time, the OLS takes 70.550CPU seconds to complete in full sample whereasleveraging methods only need about 28 CPU sec-onds to complete. Particularly, in step 1 of weightedleveraging method, SVD takes 22.277 CPU sec-onds, 5.848 seconds for calculating leverages, and0.008 seconds for sampling. Step 2 takes additional0.012 CPU seconds.

© 2014 Wiley Per iodica ls, Inc.

Page 7: Leveraging for big data regression - …homepages.math.uic.edu/~minyang/Big Data Discussion Group... · Leveraging for big data regression ... WIREsComputStat2014.doi:10.1002/wics.1324

WIREs Computational Statistics Leveraging for big data regression

CONCLUSION

In the article, we reviewed the emerging family ofleveraging methods in the context of linear regression.For linear regression, there are many methodsavailable for computing the exact least squaresestimator of the full sample using divide-and-conquerapproaches. However, none of them are designedto visualize the big data. Without looking at thedata, exploratory data analysis is impossible. Incontrast, step 1 in leveraging algorithms provides a

random ‘sketch’ of the data. Thus visualization of thesubsample obtained in step 1 of leveraging methodsenables a surrogate visualization of the full data.While outliers are likely to be the high leverage datapoints and thus likely selected into the subsample inleveraging methods, we suggest that model diagnos-tic may be performed after leveraging estimates arecalculated. Subsequently, outliers can be identifiedand removed. Finally, the leveraging estimates may becalculated in the subsample without outliers.

ACKNOWLEDGMENTS

We thank Tianming Liu Lab for providing the fMRI data. This research was partially supported by NationalScience Foundation grants DMS-1055815, 12228288, and 1222718.

REFERENCES1. Suchard MA, Wang Q, Chan C, Frelinger J, Cron

A, West M. Understanding GPU programming forstatistical computation: studies in massively parallelmassive mixtures. J Comput Graph Stat 2010, 19:419–438.

2. Schatz MC, Langmead B, Salzberg SL. Cloud computingand theDNA data race. Nat Biotechnol 2010, 28:691.

3. Pratas F, Trancoso P, Sousa L, Stamatakis A, Shi G,Kindratenko V. Fine-grain parallelism using multi-core,cell/BE, and GPU systems. Parallel Comput 2012,38:365–390.

4. Efron B. Bootstrap methods: another look at the jack-knife. Ann Stat 1979, 7:1–26.

5. Wu CFJ. Jackknife, bootstrap and other resam-pling methods in regression analysis. Ann Stat 1986,14:1261–1295.

6. Drineas P, Mahoney MW, Muthukrishnan S. Sam-pling algorithms for l2 regression and applications. InProceedings of the 17th Annual ACM-SIAM Sympo-sium on Discrete Algorithms, Miami, Florida, 2006,1127–1136.

7. Drineas P, Mahoney MW, Muthukrishnan S, Sarlós T.Faster least squares approximation. Numer Math 2010,117:219–249.

8. Ma P, Mahoney M, Yu B. A statistical perspective onalgorithmic leveraging. ArXiv e-prints, June 2013.

9. Ma P, Mahoney M, Yu B. A statistical perspectiveon algorithmic leveraging. In Proceedings of the 31stInternational Conference on Machine Learning, Beijing,China, 2014, 91–99.

10. Avron H, Maymounkov P, Toledo S. Blendenpik: Super-charging LAPACK’s least-squares solver. SIAM J SciComput 2010, 32:1217–1236.

11. Hansen M, Hurwitz W. On the theory of sampling froma finite population. Ann Math Stat 1943, 14:333–362.

12. Golub G, Van Loan C. Matrix Computations. Balti-more: Johns Hopkins University Press; 1996.

13. Drineas P, Magdon-Ismail M, Mahoney MW, WoodruffDP. Fast approximation of matrix coherence and statis-tical leverage. J Mach Learn Res 2012, 13:3475–3506.

14. Gittens A, Mahoney MW. Revisiting the Nyströmmethod for improved large-scale machine learning.Technical Report, Preprint: arXiv:1303.1849, 2013.

15. Van Essen DC, Smith SM, Barch DM, Behrens TE,Yacoub E, Ugurbil K. The WU-Minn Human Con-nectome Project: an overview. Neuroimage 2013,80:62–79.

© 2014 Wiley Per iodica ls, Inc.


Top Related