Leveraging for big data regression - minyang/Big Data Discussion Group... · Leveraging for big data…

Download Leveraging for big data regression - minyang/Big Data Discussion Group... · Leveraging for big data…

Post on 31-Aug-2018

214 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Focus Article

    Leveraging for big data regressionPing Ma and Xiaoxiao Sun

    Rapid advance in science and technology in the past decade brings an extraordinaryamount of data, offering researchers an unprecedented opportunity to tacklecomplex research challenges. The opportunity, however, has not yet been fullyutilized, because effective and efficient statistical tools for analyzing super-largedataset are still lacking. One major challenge is that the advance of computingresources still lags far behind the exponential growth of database. To facilitatescientific discoveries using current computing resources, one may use an emergingfamily of statistical methods, called leveraging. Leveraging methods are designedunder a subsampling framework, in which one samples a small proportion of thedata (subsample) from the full sample, and then performs intended computationsfor the full sample using the small subsample as a surrogate. The key of the successof the leveraging methods is to construct nonuniform sampling probabilities so thatinfluential data points are sampled with high probabilities. These methods stand asthe very unique development of their type in big data analytics and allow pervasiveaccess to massive amounts of information without resorting to high performancecomputing and cloud computing. 2014 Wiley Periodicals, Inc.

    How to cite this article:WIREs Comput Stat 2014. doi: 10.1002/wics.1324

    Keywords: big data; subsampling; estimation; least squares; leverage

    INTRODUCTION

    Rapid advance in science and technology in thepast decade brings an extraordinary amountof data that were inaccessible just a decade ago,offering researchers an unprecedented opportunity totackle much larger and more complex research chal-lenges. The opportunity, however, has not yet beenfully utilized, because effective and efficient statis-tical and computing tools for analyzing super-largedataset are still lacking. One major challenge is thatthe advance of computing technologies still lags farbehind the exponential growth of database. Thecutting edge computing technologies of today andforeseeable future are high performance comput-ing (HPC) platforms, such as supercomputers andcloud computing. Researchers have proposed use ofGPUs (graphics processing units),1 and explorations

    Correspondence to: pingma@uga.edu

    Department of Statistics, University of Georgia, Athens, GA, USA

    Conflict of interest: The authors have declared no conflicts ofinterest for this article.

    of FPGA (field-programmable gate array)-based accel-erators are ongoing in both academia and industry(for example, with TimeLogic and Convey Comput-ers). However, none of these technologies by them-selves can address the big data problem accurately andat scale. Supercomputers are precious resources andare not designed to be used routinely by everyone.The selection of the projects for running on super-computer is very stringent. For example, the resourceallocation of latest deployed Blue Water supercom-puter at National Center for Super Computing Appli-cations (https://bluewaters.ncsa.illinois.edu/) requiresa rigorous National Science Foundation (NSF) reviewprocess and supports only a handful projects. Whileclouds have large storage capabilities, cloud comput-ing is not a panacea: it poses problems for devel-opers and users of cloud software, requires largedata transfers over precious low-bandwidth Inter-net uplinks, raises new privacy and security issues,and is an inefficient solution.2 While cloud facili-ties like mapReduce do offer some advantages, theoverall issues of data transfer and usability of todaysclouds will indeed limit what we can achieve; the

    2014 Wiley Per iodica ls, Inc.

    https://bluewaters.ncsa.illinois.edu/

  • Focus Article wires.wiley.com/compstats

    tight coupling between computing and storage isabsent. A recently conducted comparison3 found thateven high-end GPUs are sometimes outperformed bygeneral-purpose multi-core implementations becauseof the time required to transfer data.

    One option is to invent algorithms that makebetter use of a fixed amount of computing power.2

    In this article, we review an emerging family of sta-tistical methods, called leveraging methods that aredeveloped for achieving such a goal. Leveraging meth-ods are designed under a subsampling framework,in which we sample a small proportion of the data(subsample) from the full sample, and then performintended computations for the full sample using thesmall subsample as a surrogate. The key of the successof the leveraging methods relies on effectively con-structing nonuniform sampling probabilities so thatinfluential data points are to be sampled with highprobabilities. Traditionally, subsampling has beenused to refer to m-out-of-n bootstrap, whose pri-mary motivation is to make approximate inferenceowing to the difficulty or intractability in derivinganalytical expressions.4,5 However, the general moti-vation of the leveraging method is different from thetraditional subsampling. In particular, (1) leveragingmethods are used to achieve feasible computation evenif the simple analytic results are available, (2) lever-aging methods enable the visualization of the datawhen visualization of the full sample is impossible,and (3) as we will see later, leveraging methods usu-ally use unequal sampling probabilities for subsam-pling data, whereas subsampling almost exclusivelyuses equal sampling probabilities. Leveraging methodsstand as the very unique development in big data ana-lytics and allow pervasive access to massive amountsof information without resorting to HPC and cloudcomputing.

    LINEAR REGRESSION MODELAND LEVERAGING

    In this article, we focus on the linear regression setting.The first effective leveraging method in regressionis developed in Refs 6 and 7 The method uses thesample leverage scores to construct a nonuniformsampling probability. Thus, the resulting sample tendsto concentrate more on important or influential datapoints for the ordinary least squares (OLS).6,7 Theestimate based on such a subsample has been shownto provide an excellent approximation to the OLSbased on full data (when p is small and n is large).Ma et al.8,9 studied the statistical properties of thebasic leveraging method in Refs 6 and 7 and proposed

    two novel methods to improve the basic leveragingmethod.

    Linear Model and Least Squares ProblemGiven n observations,

    (xTi , yi

    ), where i=1, , n, we

    consider a linear model,

    y = X + , (1)

    where y= (y1, , yn)T is the n 1 response vec-tor, X= (x1, , xn)T is the np predictor or designmatrix, is a p1 coefficient vector, and is zeromean random noise with constant variance. Theunknown coefficient can be estimated via the OLS,which is to solve

    ols = argmin||y X||2, (2)

    where || || represents the Euclidean norm. Whenmatrix X is of full rank, the OLS estimate ols of ,minimizer of (2), has the following expression,

    ols =(XTX

    )1XTy. (3)

    When matrix X is not of full rank, the inverseof XTX in expression (3) is replaced by generalizedinverse. The predicted response vector is

    y = Hy, (4)

    where the hat matrix H=X(XTX)1XT. The ith diag-onal element of the hat matrix H, hii = xTi

    (XTX

    )1xi,

    is called the leverage score of the ith observation. Itis well known that as hii approaches to 1, the pre-dicted response of the ith observation yi gets closeto yi. Thus the leverage score has been regarded asthe most important influence index indicting howinfluential the ith observation to the least squaresestimator.

    LeveragingWhen sample size n is super large, the computation ofthe OLS estimator using the full sample may becomeextremely expensive. For example, if p =

    n, the

    computation of the OLS estimator is O(n2), whichmay be infeasible for super large n. Alternatively,one may opt to use leveraging methods, which arepresented below.

    There are two types of leveraging methods,weighted leveraging methods and unweighted leverag-ing methods. Let us look at the weighted leveragingmethods first.

    2014 Wiley Per iodica ls, Inc.

  • WIREs Computational Statistics Leveraging for big data regression

    Algorithm 1. Weighted Leveraging

    Step 1. Taking a random subsample of size rfrom the data Constructing sampling probability = {1, ,n} for the data points. Draw a ran-dom subsample (with replacement) of size rn,denoted as (X*, y*), from the full sample accord-ing to the probability . Record the correspond-ing sampling probability matrix = diag

    {

    k

    }.

    Step 2. Solving a weighted least squares (WLS)problem using the subsample. Obtain leastsquares estimate using the subsample. Estimate via solving a WLS problem on the subsampleto get estimate , i.e., solve

    arg min

    ||12y 12X||2. (5)In the weighted leveraging methods (and

    unweighted leveraging methods to be reviewedbelow), the key component is to construct the sam-pling probability . An easy choice is i =

    1n

    foreach data point, i.e., uniform sampling probabil-ity. The resulting leveraging estimator is denotedas UNIF. The leveraging algorithm with uniformsubsampling probability is very simple to implementbut performs poorly in many cases, see Figure 1.The first nonuniform weighted leveraging algorithmwas developed by Drineas et al..6,7 In their weightedleveraging algorithm, they construct sampling prob-ability i using the (normalized) sample leveragescore as i = hii/

    ihii. The resulting leveraging esti-

    mator is denoted as BLEV, which yields impressiveresults: A preconditioned version of this leveragingmethod beats LAPACKs direct dense least squaressolver by a large margin on essentially any dense tallmatrix.10

    An important feature of the weighted leverag-ing method is, in step 2, to use sampling probabilityas weight in solving WLS based on the subsample.The weight is analogous to the weight in classicHansenHurwitz estimator.11 Recall in classic sam-pling methods, given data (y1, , yn) with samplingprobability (1, ,n), taking a random sample ofsize r, denoted by

    (y1, , y

    r

    ), the HansenHurwitz

    estimator is r1r

    i=1 yi

    i , where

    i is the corre-

    sponding sampling probability of yi . It is well knownthat r1

    ri=1 y

    i

    i is an unbiased estimate of

    ni=1 yi.

    It was shown in Refs 8 and 9 that weighted leverag-ing estimator is an approximately unbiased estimatorof ols.

    4 2 0 2 4 6 84 2 0 2 4 6 8

    8

    6

    4

    2

    0

    2

    4

    8

    6

    4

    2

    0

    2

    4

    FIGURE 1 | A random sample of n= 500 was generated fromyi = xi + i where xi is t(6)-distributed, i ~ N(0, 1). The trueregression function is in the dotted black line, the data in black circles,and the OLS estimator using the full sample in the black solid line. (a)The uniform leveraging estimator is in the green dashed line. Theuniform leveraging subsample is superimposed as green crosses. (b) Theweighted leveraging estimator is in the red-dot dashed line. The pointsin the weighted leveraging subsample are superimposed as red crosses.

    It was also noticed in Ref 8 that the variance ofthe weighted leveraging estimator may be inflated bythe data with tiny sampling probability. Subsequently,Ma et al.8,9 introduced a shrinkage leveraging method.In the shrinkage leveraging method, sampling proba-bility was shrunk to uniform sampling probability, i.e.,i =

    hiihii

    + (1 ) 1n, where tuning parameter is a

    constant in [0, 1]. The resulting estimator is denoted asSLEV. It was demonstrated in Refs 8 and 9 that SLEVhas smaller means squared error than BLEV when is chosen between 0.8 and 0.95 in their simulationstudies.

    It is worth noting that the first leveraging methodis designed primarily to approximate full sample OLSestimate.6,7 Ma et al.8 substantially expand the scopeof the leveraging methods by applying them to inferthe true value of coefficient . Moreover, if inferenceof the true value of coefficient is the major concern,one may use an unweighted leveraging method.8

    Algorithm 2. Unweighted Leveraging

    Step 1. Taking a random sample of size r fromthe data. This step is the same as step 1 inAlgorithm 1.

    Step 2. Solving an OLS problem using the sub-sample. Obtain least squares estimate using thesubsample. Estimate by solving an OLS prob-lem (instead of WLS problem) on the sample toget estimate u, i.e., solve

    arg min

    ||y X||2. (6)

    2014 Wiley Per iodica ls, Inc.

  • Focus Article wires.wiley.com/compstats

    Although the u is an approximately biasedestimator of the full sample OLS estimator ols, Maet al.8 show that u is an unbiased estimator of thetrue value of coefficient . Simulation studies alsoindicate that the unweighted leveraging estimator u

    has smaller variance than the corresponding weightedleveraging estimator .

    Analytic Framework for LeveragingEstimatorsWe use the mean squared error (MSE) to comparethe performance of the leveraging methods. In par-ticular, MSE can be decomposed in to bias and vari-ance components. We start with deriving the bias andvariance of the unweighted leveraging estimator inAlgorithm 1. It is instructive to note that when study-ing the statistical properties of leveraging estimatorsthere are two layers of randomness (from samplingand subsampling).

    From Algorithm 1, we can easily see

    =(XT1X

    )1XTWy.

    This expression is difficult to analyze as it hasboth random subsample (X*, y*) and their corre-sponding sampling probability *. Instead, we rewritethe weighted leveraging estimator in terms of the orig-inal data (X, y),

    =(XTWX

    )1XTWy,

    where W =diag(w1, w2, , wn) is an n n diagonalrandom matrix, wi =

    kiri

    , and ki is the number oftimes that the ith data point in the random subsample(Recall that step 1 in Algorithm 1 is to take subsamplewith replacement).

    Clearly, the estimator can be regardedas a function of the random weight vectorw= (w1, w2, , wn)T , denoted as (w). As we takerandom subsample with replacement, it is easy to seethat w has a scaled multinomial distribution,

    Pr{

    w1 =k1r1

    ,w2 =k2r2

    , ,wn =knrn

    }= r!

    k1!k2!, ,kn!

    k11

    k22

    knn ,

    with mean Ew= 1. By setting w0, the vector aroundwhich we perform our Taylor series expansion, tobe the all-ones vector, i.e., w0 =1, then (w) can beexpanded around the full sample OLS estimate ols,i.e., (1) = ols. From this, we have the followinglemma.

    LEMMA 1. Let be the weighted leveraging estima-tor. Then, a Taylor expansion of around the pointw0 =1 yields

    = ols +(XTX

    )1XTdiag

    {e}(w 1) + RW , (7)

    where e = y Xols is the OLS residual vector,and RW = op(||ww0||) is the Taylor expansionremainder.

    Note that the detailed proofs of all lemma andtheorems can be found in Ref 8.

    Given Lemma 1, we can establish the followinglemma, which provides expressions for the conditionaland unconditional expectations and variances for theweighted leveraging estimator.

    LEMMA 2. The conditional expectation and condi-tional variance for the weighted leveraging estimatorare given by:

    E{|y} = ols + E{RW} ; (8)

    Var{|y} = (XTX)1 XTdiag{e}diag{ 1

    r

    }diag

    {e}

    X(XTX

    )1 + Var{RW} , (9)where W specifies the probability distribution usedin the sampling and rescaling steps. The uncondi-tional expectation and unconditional variance for theweighted leveraging estimator are given by:

    E{}= 0; (10)

    Var{}= 2

    (XTX

    )1 + 2r

    (XT...