leveraging for big data regression - minyang/big data discussion group... · leveraging for big...

Download Leveraging for big data regression - minyang/Big Data Discussion Group... · Leveraging for big data…

Post on 31-Aug-2018

214 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Focus Article

    Leveraging for big data regressionPing Ma and Xiaoxiao Sun

    Rapid advance in science and technology in the past decade brings an extraordinaryamount of data, offering researchers an unprecedented opportunity to tacklecomplex research challenges. The opportunity, however, has not yet been fullyutilized, because effective and efficient statistical tools for analyzing super-largedataset are still lacking. One major challenge is that the advance of computingresources still lags far behind the exponential growth of database. To facilitatescientific discoveries using current computing resources, one may use an emergingfamily of statistical methods, called leveraging. Leveraging methods are designedunder a subsampling framework, in which one samples a small proportion of thedata (subsample) from the full sample, and then performs intended computationsfor the full sample using the small subsample as a surrogate. The key of the successof the leveraging methods is to construct nonuniform sampling probabilities so thatinfluential data points are sampled with high probabilities. These methods stand asthe very unique development of their type in big data analytics and allow pervasiveaccess to massive amounts of information without resorting to high performancecomputing and cloud computing. 2014 Wiley Periodicals, Inc.

    How to cite this article:WIREs Comput Stat 2014. doi: 10.1002/wics.1324

    Keywords: big data; subsampling; estimation; least squares; leverage

    INTRODUCTION

    Rapid advance in science and technology in thepast decade brings an extraordinary amountof data that were inaccessible just a decade ago,offering researchers an unprecedented opportunity totackle much larger and more complex research chal-lenges. The opportunity, however, has not yet beenfully utilized, because effective and efficient statis-tical and computing tools for analyzing super-largedataset are still lacking. One major challenge is thatthe advance of computing technologies still lags farbehind the exponential growth of database. Thecutting edge computing technologies of today andforeseeable future are high performance comput-ing (HPC) platforms, such as supercomputers andcloud computing. Researchers have proposed use ofGPUs (graphics processing units),1 and explorations

    Correspondence to: pingma@uga.edu

    Department of Statistics, University of Georgia, Athens, GA, USA

    Conflict of interest: The authors have declared no conflicts ofinterest for this article.

    of FPGA (field-programmable gate array)-based accel-erators are ongoing in both academia and industry(for example, with TimeLogic and Convey Comput-ers). However, none of these technologies by them-selves can address the big data problem accurately andat scale. Supercomputers are precious resources andare not designed to be used routinely by everyone.The selection of the projects for running on super-computer is very stringent. For example, the resourceallocation of latest deployed Blue Water supercom-puter at National Center for Super Computing Appli-cations (https://bluewaters.ncsa.illinois.edu/) requiresa rigorous National Science Foundation (NSF) reviewprocess and supports only a handful projects. Whileclouds have large storage capabilities, cloud comput-ing is not a panacea: it poses problems for devel-opers and users of cloud software, requires largedata transfers over precious low-bandwidth Inter-net uplinks, raises new privacy and security issues,and is an inefficient solution.2 While cloud facili-ties like mapReduce do offer some advantages, theoverall issues of data transfer and usability of todaysclouds will indeed limit what we can achieve; the

    2014 Wiley Per iodica ls, Inc.

    https://bluewaters.ncsa.illinois.edu/

  • Focus Article wires.wiley.com/compstats

    tight coupling between computing and storage isabsent. A recently conducted comparison3 found thateven high-end GPUs are sometimes outperformed bygeneral-purpose multi-core implementations becauseof the time required to transfer data.

    One option is to invent algorithms that makebetter use of a fixed amount of computing power.2

    In this article, we review an emerging family of sta-tistical methods, called leveraging methods that aredeveloped for achieving such a goal. Leveraging meth-ods are designed under a subsampling framework,in which we sample a small proportion of the data(subsample) from the full sample, and then performintended computations for the full sample using thesmall subsample as a surrogate. The key of the successof the leveraging methods relies on effectively con-structing nonuniform sampling probabilities so thatinfluential data points are to be sampled with highprobabilities. Traditionally, subsampling has beenused to refer to m-out-of-n bootstrap, whose pri-mary motivation is to make approximate inferenceowing to the difficulty or intractability in derivinganalytical expressions.4,5 However, the general moti-vation of the leveraging method is different from thetraditional subsampling. In particular, (1) leveragingmethods are used to achieve feasible computation evenif the simple analytic results are available, (2) lever-aging methods enable the visualization of the datawhen visualization of the full sample is impossible,and (3) as we will see later, leveraging methods usu-ally use unequal sampling probabilities for subsam-pling data, whereas subsampling almost exclusivelyuses equal sampling probabilities. Leveraging methodsstand as the very unique development in big data ana-lytics and allow pervasive access to massive amountsof information without resorting to HPC and cloudcomputing.

    LINEAR REGRESSION MODELAND LEVERAGING

    In this article, we focus on the linear regression setting.The first effective leveraging method in regressionis developed in Refs 6 and 7 The method uses thesample leverage scores to construct a nonuniformsampling probability. Thus, the resulting sample tendsto concentrate more on important or influential datapoints for the ordinary least squares (OLS).6,7 Theestimate based on such a subsample has been shownto provide an excellent approximation to the OLSbased on full data (when p is small and n is large).Ma et al.8,9 studied the statistical properties of thebasic leveraging method in Refs 6 and 7 and proposed

    two novel methods to improve the basic leveragingmethod.

    Linear Model and Least Squares ProblemGiven n observations,

    (xTi , yi

    ), where i=1, , n, we

    consider a linear model,

    y = X + , (1)

    where y= (y1, , yn)T is the n 1 response vec-tor, X= (x1, , xn)T is the np predictor or designmatrix, is a p1 coefficient vector, and is zeromean random noise with constant variance. Theunknown coefficient can be estimated via the OLS,which is to solve

    ols = argmin||y X||2, (2)

    where || || represents the Euclidean norm. Whenmatrix X is of full rank, the OLS estimate ols of ,minimizer of (2), has the following expression,

    ols =(XTX

    )1XTy. (3)

    When matrix X is not of full rank, the inverseof XTX in expression (3) is replaced by generalizedinverse. The predicted response vector is

    y = Hy, (4)

    where the hat matrix H=X(XTX)1XT. The ith diag-onal element of the hat matrix H, hii = xTi

    (XTX

    )1xi,

    is called the leverage score of the ith observation. Itis well known that as hii approaches to 1, the pre-dicted response of the ith observation yi gets closeto yi. Thus the leverage score has been regarded asthe most important influence index indicting howinfluential the ith observation to the least squaresestimator.

    LeveragingWhen sample size n is super large, the computation ofthe OLS estimator using the full sample may becomeextremely expensive. For example, if p =

    n, the

    computation of the OLS estimator is O(n2), whichmay be infeasible for super large n. Alternatively,one may opt to use leveraging methods, which arepresented below.

    There are two types of leveraging methods,weighted leveraging methods and unweighted leverag-ing methods. Let us look at the weighted leveragingmethods first.

    2014 Wiley Per iodica ls, Inc.

  • WIREs Computational Statistics Leveraging for big data regression

    Algorithm 1. Weighted Leveraging

    Step 1. Taking a random subsample of size rfrom the data Constructing sampling probability = {1, ,n} for the data points. Draw a ran-dom subsample (with replacement) of size rn,denoted as (X*, y*), from the full sample accord-ing to the probability . Record the correspond-ing sampling probability matrix = diag

    {

    k

    }.

    Step 2. Solving a weighted least squares (WLS)problem using the subsample. Obtain leastsquares estimate using the subsample. Estimate via solving a WLS problem on the subsampleto get estimate , i.e., solve

    arg min

    ||12y 12X||2. (5)In the weighted leveraging methods (and

    unweighted leveraging methods to be reviewedbelow), the key component is to construct the sam-pling probability . An easy choice is i =

    1n

    foreach data point, i.e., uniform sampling probabil-ity. The resulting leveraging estimator is denotedas UNIF. The leveraging algorithm with uniformsubsampling probability is very simple to implementbut performs poorly in many cases, see Figure 1.The first nonuniform weighted leveraging algorithmwas developed by Drineas et al..6,7 In their weightedleveraging algorithm, they construct sampling prob-ability i using the (normalized) sample leveragescore as i = hii/

    ihii. The resulting leveraging esti-

    mator is denoted as BLEV, which yields impressiveresults: A preconditioned version of this leveragingmethod beats LAPACKs direct dense least squaressolver by a large margin on essentially any dense tallmatrix.10

    An important feature of the weighted leverag-ing method is, in step 2, to use sampling probabilityas weight in

Recommended

View more >