distributed nonparametric and semiparametric regression on...

Research ArticleDistributed Nonparametric and Semiparametric Regression onSPARK for Big Data Forecasting

Jelena Fiosina andMaksims Fiosins

Clausthal University of Technology Clausthal-Zellerfeld Germany

Correspondence should be addressed to Jelena Fiosina jelenafiosinagmailcom

Received 22 July 2016 Accepted 22 November 2016 Published 8 March 2017

Academic Editor Francesco Carlo Morabito

Copyright copy 2017 Jelena Fiosina and Maksims Fiosins This is an open access article distributed under the Creative CommonsAttribution License which permits unrestricted use distribution and reproduction in any medium provided the original work isproperly cited

Forecasting in big datasets is a common but complicated task which cannot be executed using the well-known parametric linearregression However nonparametric and semiparametric methods which enable forecasting by building nonlinear data modelsare computationally intensive and lack sufficient scalability to cope with big datasets to extract successful results in a reasonabletimeWe present distributed parallel versions of some nonparametric and semiparametric regressionmodels We usedMapReduceparadigm and describe the algorithms in terms of SPARK data structures to parallelize the calculations The forecasting accuracyof the proposed algorithms is compared with the linear regression model which is the only forecasting model currently havingparallel distributed realization within the SPARK framework to address big data problems The advantages of the parallelization ofthe algorithm are also provided We validate our models conducting various numerical experiments evaluating the goodness offit analyzing how increasing dataset size influences time consumption and analyzing time consumption by varying the degree ofparallelism (number of workers) in the distributed realization

1 Introduction

The most current methods of data analysis data miningand machine learning should deal with big databases CloudComputing technologies can be successfully applied to par-allelize standard data mining techniques in order to makeworking with massive amounts of data feasible [1] For thispurpose standard algorithms should often be redesignedfor parallel environment to distribute computations amongmultiple computation nodes

One such approach is to use Apache Hadoop whichincludes MapReduce for job distribution [2] and distributedfile system (HDFS) for data sharing among nodes

Recently a new and efficient framework called ApacheSPARK [3] was built on top of Hadoop which allows moreefficient execution of distributed jobs and therefore is verypromising for big data analysis problems [4] HoweverSPARK is currently in the development stage and the numberof standard data analysis libraries is limited

R software is a popular instrument for data analystsIt provides several possibilities for parallel data processing

through the add-on packages [5] It is possible also to useHadoop and SPARK inside of R using SPARKR This is an Rpackage that provides a lightweight front-end to use ApacheSPARK within R SPARKR is still in the developing stage andsupports only some features of SPARK but has a big potentialfor the future of data science [6]

There exist also alternative parallelization approachessuch as Message Passing Interface (MPI) [7] However in thepresent paper we will concentrate on SPARK because of itsspeed simplicity and scalability [8]

In this study we consider regression-based forecastingfor the case where the data has a nonlinear structurewhich is common in real-world datasets This implies thatlinear regression cannot make accurate forecasts and thuswe resort to nonparametric and semiparametric regressionmethods which do not require linearity and are more robustto outliers However themain disadvantage of thesemethodsis that they are very time-consuming and therefore theterm ldquobig datardquo for such methods starts much earlier thanwith parametrical approaches In the case of big datasets

HindawiApplied Computational Intelligence and So ComputingVolume 2017 Article ID 5134962 13 pageshttpsdoiorg10115520175134962

2 Applied Computational Intelligence and Soft Computing

traditional nonparallel realizations are not capable of pro-cessing all the available data This makes it imperative toadapt to existing techniques and to develop new ones thatovercome this disadvantage The distributed parallel SPARKframework gives us the possibility of addressing this dif-ficulty and increasing the scalability of nonparametric andsemiparametric regression methods allowing us to deal withbigger datasets

There are some approaches in the current literature toaddress nonparametric or semiparametric regression modelsfor parallel processing of big datasets [9] for example usingR add-on packages MPI Our study examines a novel fastparallel and distributed realization of the algorithms basedon the modern version of Apache SPARK which is a prom-ising tool for the efficient realization of different machinelearning and data mining algorithms [3]

The main objective of this study is to enable a paralleldistributed version of nonparametric and semiparametricregression models particularly kernel-density-based andpartial linear models to be applied on big data To realizethis a SPARK MapReduce based algorithm has been devel-oped which splits the data and performs various algorithmprocesses in parallel in the map phase and then combines thesolutions in the reduce phase to merge the results

More specifically the contribution of this study is (i) todesign novel distributed parallel kernel density regressionand partial linear regression algorithms over the SPARKMapReduce paradigm for big data and (ii) to validate thealgorithms analyzing their accuracy scalability and speed-up by means of numerical experiments

The remainder of this paper is organized as followsSection 2 reviews the traditional regression models to beanalyzed Section 3 reviews the existent distributed compu-tation frameworks for big datasets In Section 4 we proposeparallel versions of kernel-density-based and partial linearregression model algorithms based on SPARK MapReduceparadigm In Section 5 we present the experimental setupand in Section 6 we discuss the experimental framework andanalysis Section 7 concludes the paper and discusses futureresearch opportunities

2 Background Regression Models

21 Linear Multivariate Regression We start with linearregression which is the only regression model realized inthe current version of SPARK to compare the results of theproposed methods

Let us first consider the classical multivariate linearregression model 119864(119884 | X) = X120573 [10 11]

Y = X120573 + 120576 (1)

where 119899 is a number of observations 119889 is the number offactors Y119899times1 is a vector of dependent variables 120573119889times1 is avector of unknown parameters 120576119899times1 is a vector of randomerrors andX119899times119889 is amatrix of explanatory variablesThe rowsof the matrix X correspond to observations and the columnscorrespond to factors We suppose that 120576119894 are mutuallyindependent and have zero expectation and equal variances

(1) while not converged do(2) for all 119895 isin 0 119889 do(3) 120573119895 larr 120573119895 minus 120572 120597120597120573119895 119869(XY)(4) end for(5) end while

Algorithm 1 Stochastic Gradient Descent algorithm

The well-known least square estimator (LSE) of 120573 is

= (X119879X)minus1 X119879Y (2)

Further let (x119894 119910119894)119899119894=1 be the observations sampled fromthe distribution of (X 119884) After the estimation of the param-eters 120573 we can make a forecast for a certain 119896th (future) timemoment as 119864(119884119896) = x119896 where x119896 is a vector of observedvalues of explanatory variables for the 119896th time moment

For big data it is a problem to perform the matrix opera-tions in (2) For this purpose other optimization techniquescan be used One effective option is to use the StochasticGradient Descent algorithm [12] which is realized in SPARKThe generalized cost function to be minimized is119869 (120573XY) = 12119899 (X120573 minus Y)119879 (X120573 minus Y) (3)

Algorithm 1 presents the Stochastic Gradient Descentalgorithm where 120572 is a learning rate parameter

Algorithm 1 can be executed iteratively (incrementally)that is easy to parallelize

22 Kernel Density Regression The first alternative to thelinear regression model we want to consider is the kerneldensity estimation-based regression which is of big impor-tance in the case when data have a nonlinear structure Thisnonparametric approach for estimating a regression curvehas four main purposes First it provides a versatile methodfor exploring a general relationship between two variablesSecond it can predict observations without a reference toa fixed parametric model Third it provides a tool forfinding spurious observations by studying the influence ofisolated points Fourth it constitutes a flexible method ofsubstitution or interpolation between adjacent 119883-values formissing values [13]

Let us consider a nonparametric regression model119898(119909) = 119864(119884 | 119883 = 119909) [14] with the same 119884 119883 and 120576 asfor a linear model 119910 = 119898 (119909) + 120576 (4)

The Nadaraya-Watson kernel estimator [15 16] of119898(119909) is119898119899 (119909) = sum119899119894=1119870((119909 minus 119909119894) ℎ) 119910119894sum119899119894=1119870((119909 minus 119909119894) ℎ)= sum119899119894=1119870ℎ (119909 minus 119909119894) 119910119894sum119899119894=1119870ℎ (119909 minus 119909119894) (5)

Applied Computational Intelligence and Soft Computing 3

where119870(119906) is the kernel function of 119877119889 which is a nonnega-tive real-valued integrable function satisfying the followingtwo requirements intinfin

minusinfin119870(119906) 119889119906 = 1 and 119870(minus119906) = 119870(119906)

for all values of 119906 ℎ gt 0 is a smoothing parameter calledbandwidth119870ℎ(119906) = (1ℎ)119870(119906ℎ) We can see that each valueof the historical forecast 119910119894 is taken with some weight ofthe corresponding independent variable value of the sameobservation 119909119894

In a multidimensional case 119864(119884 | X) = 119898(X) the kernelfunction119870H(x) = |H|minus1119870(Hminus1x) whereH is a 119889times119889matrix ofsmoothing parameters The choice of the bandwidth matrixH is the most important factor affecting the accuracy of theestimator since it controls the orientation and amount ofsmoothing induced The simplest choice is H = ℎI119889 whereℎ is a unidimensional smoothing parameter and I119889 is 119889 times 119889identitymatrixThenwe have the same amount of smoothingapplied in all coordinate directions Another relatively easyto manage choice is to take the bandwidth matrix equal toa diagonal matrix H = diag(ℎ1 ℎ2 ℎ119889) which allows fordifferent amounts of smoothing in each of the coordinatesWe implemented the latter in our experiments The multidi-mensional kernel function 119870(u) = 119870(1199061 1199062 119906119889) then iseasy to present with univariate kernel functions as 119870(u) =119870(1199061) sdot119870(1199062) sdot sdot119870(119906119889) We used the Gaussian kernel in ourexperiments Then (5) for the multidimensional case can berewritten as119898119899 (x119896) = sum119899119894=1119870H (x119896 minus x119894) 119910119894sum119899119894=1119870H (x119896 minus x119894)= sum119899119894=1prod119889119895=1119870H (x119896119895 minus x119894119895) 119910119894sum119899119894=1prod119889119895=1119870H (x119896119895 minus x119894119895) (6)

An important problem in a kernel density estimationis the selection of the appropriate bandwidth ℎ It has aninfluence on the structure of the neighborhood in (5) thebigger the value of ℎ selected the higher the significantinfluence that 119909119894 points have on the estimator 119898119899(x119896) In themultidimensional caseH also regulates the balance betweenfactors The most popular heuristic methods of bandwidthselection are plug-in and resampling methods Howeverthose heuristics require a substantial amount of computa-tions which is not possible for big datasets For our casewe used a well-known rule-of-thumb approach proposedby Silverman [17] which works for Gaussian kernels In thecase of no correlation between explanatory variables thereis a simple and useful formula for bandwidth selection withScottrsquos rule [18] ℎ119895 = 119899minus1(119889+4)120590119895 119895 = 1 2 119889 where 120590119895 is avariance of the 119895th factor

It is known [14] that the curse of dimensionality is oneof the major problems that arises when using nonparametricmultivariate regression techniques For the practitioner anadditional problem is that for more than two regressorsgraphical illustrations or interpretations of the results arehard or even impossible Truly multivariate regression mod-els are often far too flexible and general for making a detailedinference

A very suitable property of the kernel function is itsadditive natureThis property makes the kernel function easy

to use for distributed data [13] Unfortunately such modelseven with current parallelization possibilities remain verytime-consuming

23 Partial Linear Models Another alternative to the linearregressionmodel is a semiparametric partial linear regressionmodel (PLM) Currently several efforts have been allocatedto developing methods that reduce the complexity of highdimensional regression problems [14] The models alloweasier interpretation of the effect of each variable and may bepreferable to a completely nonparametric model These referto the reduction of dimensionality and provide an allowancefor partly parametric modelling On the other hand PLMsaremore flexible than the standard linearmodel because theycombine both parametric and nonparametric componentsIt is assumed that the response variable Y depends onvariable U in a linear way but is nonlinearly related to otherindependent variables T [19] The resulting models can begrouped together as so-called semiparametric models PLMof regression consists of two additive components a linearand a nonparametric part PLM of regression is given as119864 (119884 | UT) = U120573 + 119898 (T) (7)

where 120573119901times1 is a finite dimensional vector of parameters of alinear regression part and 119898(sdot) is a smooth function Herewe assume the decomposition of the explanatory variables Xinto two vectors U and T The vector U denotes a 119901-variaterandom vector which typically covers categorical explanatoryvariables or variables that are known to influence the indexin a linear way The vector T is a 119902-variate random vectorof continuous explanatory variables that is to be modelledin a nonparametric way so 119901 + 119902 = 119889 Economic theory orintuition should ideally guide the inclusion of the regressorsin U or T respectively

An algorithm for the estimation of the PLMs was pro-posed in [13] which is based on the likelihood estimator andknown as generalized Speckman [20] estimatorWe reformu-lated this algorithm (Algorithm 2) in terms of functions anddata structures which can be easily parallelizable in Section 4119875119871119872 function is the primary function of Algorithm 2which takes as parameters the training dataset [YUT]bandwidth vector h and a test dataset [U1015840T1015840] The first stepin the estimation is to execute the function 119878119898119900119900119905ℎ119872119886119905119903119894119909to compute a smoother matrix S based on the trainingdata of the nonlinear parts T and h This helps us toobtain the smoother matrix which transforms the vectorof observations into fitted values Next we estimate thelinear coefficients of the linear part of the model with the119871119894119899119890119886119903119862119900119890119891119891119894119888119894119890119899119905119904 function First we take into accountthe influence of the nonlinear part on the linear part ofthe independent variable U larr (I minus S)U and on thedependent variable Y larr (Iminus S)Y Then we use the ordinaryequation (2) or Algorithm 1 to obtain the linear coefficientsWith these coefficients we calculate the linear part of theforecast Next we calculate the nonlinear part of the forecastusing Nadaraya-Watson estimator (6) Here we recalculatea smoother matrix taking into account the test-set data ofthe nonlinear part We take also into account the influence


(1) function SmoothMatrix(TT1015840 h)(2) 119899 larr 119903119900119908119904(T) 1198991015840 larr 119903119900119908119904(T1015840)(3) for all 119894 isin 1 119899 119895 isin 1 1198991015840 do(4) W 119882119894119895 larr 119870119867(T119894 minus T1015840119895) S 119878119894119895 larr119882119894119895sum119899119894=1119882119894119895(5) end for(6) return S(7) end function(8)(9) function LinearCoefficients(S [YUT])(10) U larr (I minus S)U Y larr (I minus S)Y(11) return 120573larr (U119879U)minus1U119879Y(12) end function(13)(14) function KernelPart(120573 [YUT] hT1015840)(15) S larr 119878119898119900119900119905ℎ119872119886119905119903119894119909(TT1015840 h)(16) return 119898(T1015840) larr S(Y minus U120573)(17) end function(18)(19) function PLM([YUT] h [U1015840T1015840])(20) Smooth matrix S larr 119878119898119900119900119905ℎ119872119886119905119903119894119909(TT h)(21) 120573larr 119871119894119899119890119886119903119862119900119890119891119891119894119888119894119890119899119905119904(S [YUT])(22) 119898(T1015840) larr 119870119890119903119899119890119897119875119886119903119905(120573 [YUT] hT1015840)(23) return Y1015840 larr U1015840120573 + 119898(T1015840)(24) end function

Algorithm 2 PLM estimation training set [YUT] test set [Y1015840U1015840T1015840]of the linear part Y minus U120573 Finally we sum the linear andnonlinear part of the forecast to obtain the final result Y1015840 larrU1015840120573 + 119898(T1015840)3 Background MapReduce Hadoopand SPARK

MapReduce [2] is one of the most popular programmingmodels to deal with big data It was proposed by Google in2004 and was designed for processing huge amounts of datausing a cluster of machines The MapReduce paradigm iscomposed of two phases map and reduce In general termsin the map phase the input dataset is processed in parallelproducing some intermediate results Then the reduce phasecombines these results in some way to form the final out-put The most popular implementation of the MapReduceprogramming model is Apache Hadoop an open-sourceframework that allows the processing and managementof large datasets in a distributed computing environmentHadoop works on top of the Hadoop Distributed File System(HDFS) which replicates the data files inmany storage nodesfacilitates rapid data transfer rates among nodes and allowsthe system to continue operating uninterruptedly in case of anode failure

Another Apache project that also uses MapReduce as aprogramming model but with much richer APIs in JavaScala Python and R is SPARK [3] SPARK is intended toenhance not replace the Hadoop stack SPARK is more thana distributed computational framework originally developedin the UC Berkeley AMP Lab for large-scale data processingthat improves the efficiency by the use of intensivememory It

also provides several prebuilt components empowering usersto implement applications faster and easier SPARKusesmoreRAM instead of network and disk IO and is relatively fast ascompared to Hadoop MapReduce

Froman architecture perspective Apache SPARK is basedon two key concepts Resilient Distributed Datasets (RDDs)and directed acyclic graph (DAG) execution engine Withregard to datasets SPARK supports two types of RDDsparallelized collections that are based on existing Scalacollections and Hadoop datasets that are created from thefiles stored by the HDFS RDDs support two kinds of opera-tions transformations and actions Transformations createnew datasets from the input (eg map or filter operationsare transformations) whereas actions return a value afterexecuting calculations on the dataset (eg reduce or countoperations are actions) The DAG engine helps to eliminatethe MapReduce multistage execution model and offers sig-nificant performance improvements

RDDs [21] are distributed memory abstractions thatallow programmers to perform in-memory computations onlarge clusters while retaining the fault tolerance of data flowmodels like MapReduce RDDs are motivated by two typesof applications that current data flow systems handle ineffi-ciently iterative algorithms which are common in graphapplications and machine learning and interactive datamining tools In both cases keeping data in memory canimprove performance by an order of magnitude To achievefault tolerance efficiently RDDs provide a highly restrictedform of shared memory they are read-only datasets that canonly be constructed through bulk operations on other RDDsThis in-memory processing is a faster process as there is


Spark master(driver)

Cluster manager

Sparkworker

Datanode

Sparkworker

Datanode

Sparkworker

Datanode

Sparkworker

Datanode

Distributed storage platform

middot middot middot


Figure 1 SPARK distributed architecture

no time spent in moving the dataprocesses in and out ofthe disk whereas MapReduce requires a substantial amountof time to perform these IO operations thereby increasinglatency

SPARK uses a masterworker architecture There is adriver that talks to a single coordinator called themaster thatmanages workers in which executors run (see Figure 1)

SPARK uses HDFS and has high-level libraries for streamprocessing machine learning and graph processing suchas MLlib [22] For this study linear regression realizationincluded in MLlib [23] was used to evaluate the proposeddistributed PLM algorithm

In this paper Apache SPARK is used to implementthe proposed distributed PML algorithm as described inSection 4

4 Distributed Parallel Partial LinearRegression Model

41 Distributed Partial Linear Model Estimation In thissection we continue to discuss kernel-density-based andpartial linear models (PLMs) described in Section 2 Nextwe consider kernel-density-based regression as a specificcase of PLM when the parametric linear part is emptyThe general realization of PLM algorithm allows us to con-duct experiments with nonparametric kernel-density-basedregression We discuss how to distribute the computationsof Algorithm 2 to increase the speed and scalability ofthe approach As we can see from the previous subsectionPLM presupposes several matrix operations which requiresa substantial amount of computational demands [24] andtherefore is not feasible for application to big datasets Inthis section we focus our attention on (1) how to organizematrix computations in a maximally parallel and effectivemanner and (2) how to parallelize the overall computationprocess of the PLM estimation SPARK provides us withvarious types of parallel structures Our first task is toselect the appropriate SPARK data structures to facilitate thecomputation process Next we develop algorithms whichexecute a PLM estimation (Algorithm 2) using MapReduceand SPARK principles

Driver Worker iPartial linear model training

Preprocessesand sharesAccesses RDD

Smoother matrix

Shares RDD

12

3

456

7

8

9

1011

12

Collects and

Accesses RDD

Computes and shares

Initializes standardparallel computing of 훽

Accesses RDD 훽

Computes 훽 parallely

[퐘(i) 퐔(i) 퐓(i)]

broadcasts [퐘 퐔 퐓]

RDD [퐘(i) 퐔(i) 퐓(i)]

Receives [퐘 퐔 퐓]퐖(i)

RDD 퐔(i) = (퐈 minus 퐒(i))퐔(i)

퐘(i) = (퐈 minus 퐒(i))퐘(i)

and collects 퐘 퐔

퐈 minus 퐒(i)

sumW(i)⟨j⟩

퐒(i)kj

= 퐖(i)kj

sumW(i)⟨j⟩

Figure 2 PLM parallel training

Traditional methods of data analysis especially basedon matrix computations must be adapted for running on acluster as we cannot readily reuse linear algebra algorithmsthat are available for single-machine situations A key idea isto distribute operations separating algorithms into portionsthat require matrix operations versus vector operationsSince matrices are often quadratically larger than vectors areasonable assumption is that vectors are stored in memoryon a single machine while for matrices it is not reasonableor feasible Instead the matrices are stored in blocks (rectan-gular pieces or some rowscolumns) and the correspondingalgorithms for block matrices are implemented The mostchallenging of tasks here is the rebuilding of algorithms fromsingle-core modes of computation to operate on distributedmatrices in parallel [24]

Similar to a linear regression forecasting process PLMforecasting because of its parametric component assumestraining and estimating steps Distributed architectures ofHadoop and SPARK assume one driver computer and severalworker computers which perform computations in parallelWe take this architecture into account while developing ouralgorithms

Let us describe the training procedurewith the purpose ofcomputing the 120573 parameter presented in Figure 2 This partwas very computationally intensive and could not be calcu-lated without appropriate parallelization of computations

First the driver computer reads the training data[YUT] from a file or database divides the data intovarious RDD partitions and distributes among the workers[Y(119894)U(119894)T(119894)] Next the training process involves the follow-ing steps

(1) Each worker makes an initial preprocessing andtransformation of [Y(119894)U(119894)T(119894)] which include scal-ing and formatting


Driver

Partial linear model forecasting

Broadcasts the

and a new observationof k records

Receives

Smoother vectorComputes and

Computes

Smoother

Computes

1

2

3

4

5

6

8

7

9

10

1112

13

Collects and

Worker i

coefficients 훽

[퐮㰀 퐭㰀]

Accesses m(퐭㰀)(i)


ith RDD part

훽 [퐮㰀 퐭㰀]퐖(i)

vector 퐒(i)

퐔(i)훽 퐘(i) minus 퐔(i)훽

m(퐭㰀)(i) = 퐒(i)(퐘(i) minus 퐔(i)훽)and shares RDD m(t㰀)(i)

broadcasts sumi sumW(i)

퐒(i)j = 퐖(i)

j sumi sumW(i)

sumi sumW(i)Receives

sumW(i)Receives

shares RDD sumW(i)

y㰀 = u㰀훽 + sumim(t㰀)(i)

Figure 3 PLM parallel forecasting

(2) The driver program accesses all preprocessed data(3) The driver program collects all the preprocessed data

and broadcasts them to all the workers(4) Then each worker makes its own copy of the training

data [YUT](5) A smoother matrixW is computed in parallel Since

the rows ofW are divided into partitions each workerthen computes its partW(119894)

(6) Each worker computes the sum of elements in eachrow within its partition W(119894) sum119895W(119894)119895 which is thensaved as a RDD

(7) We do not actually need the normalized matrix Sitself which could be computed as W(119894)119895 sum119895W(119894)119895 but only (I minus S) as RDD which is the part directlycomputed by each worker (I minus S(119894))

(8) Using the 119894th part ofmatrix (IminusS) wemultiply it fromthe right side with the corresponding elements ofmatrixU Thus we obtain the 119894th part of transformedmatrix U(119894) as a RDD The 119894th part of transformedmatrix Y(119894) as RDD is computed analogously

(9) Finally the driver program accesses and collectsmatrices U and Y

(10) The driver program initializes computing of the stan-dard linear regression algorithm LinearRegression-ModelWithSGD which is realized in SPARK

(11) The parameters 120573 are computed in parallel

(12) Then 120573 are accessed by the driver computer

Let us now describe the forecasting process (Figure 3)which occurs after the parametric part of the PLM hascompleted its estimation Note that the kernel part of themodel is nonparametric and is directly computed for eachforecasted data point

Since the forecasting is performed after the trainingprocess then the preprocessed training set is already dis-tributed among the workers and is available as a RDD ThePLMdistributed forecasting procedure includes the followingsteps

(1) The driver program broadcasts the estimated linearpart coefficients 120573 to the workers

(2) Receiving new observation [u1015840 t1015840] the driver pro-gram broadcasts it to all the workers

(3) Each worker receives its copy of [u1015840 t1015840] and 120573(4) We compute a smoother matrix W which is in a

vector form This computation is also partitionedamong the workers (see (5)) so the 119894th workercomputes W(119894) as particular columns ofW

(5) Each worker computes partial sums of the rows of thematrixW(119894) elements and shares the partial sums withthe driver program

(6) The driver program accesses the partial sums andperforms reducing step to obtain the final sumsum119894W(119894)

(7) The driver program broadcasts the final sum to all theworkers

(8) Each worker receives the final sum sum119894W(119894)(9) Each worker computes its columns of the smoother

matrix S(119894)

(10) Based on the RDD part [Y(119894)U(119894)T(119894)] (known fromthe training step) each worker computes the linearpart of the forecast for training data It was necessaryto identify the kernel part of the forecast performingsubtraction of the linear part from the total forecastthe values of which were known from the trainingset

(11) Each worker computes the kernel part of the forecast119898(t1015840)(119894) as RDD and shares with the driver pro-gram

(12) The driver program accesses the partial kernel fore-casts119898(t1015840)(119894)

(13) The driver program performs the reducing step tomake the final forecast combining the accessed kernelparts and computing the linear part y1015840 = u1015840120573 +sum119894119898(t1015840)(119894)


42 Case Studies We illustrate the distributed execution ofPLM algorithm with a simple numerical example First thedriver computer reads the training data [YUT]

[YUT] =((

Y U T12341 22 32 12 2

1 22 32 11 1))

(8)

We suppose that the algorithm is being executed inparallel on two workers The driver accesses [YUT] andcreates the corresponding RDDs

First data preprocessing takes place This job is sharedbetween the workers in a standard way and the result isreturned to the driverWe do not change our data at this stagefor illustrative purposes

The goal of the training process is to compute thecoefficients 120573 of the linear part First the matrices [YUT]are divided into the parts by rows[YUT] = (Y(1)U(1)T(1)

Y(2)U(2)T(2))=((((((

Y(1) U(1) T(1)1 1 2 1 22 2 3 2 3Y(1) U(2) T(2)3 2 1 2 14 2 2 1 1

))))))

(9)

Each worker 119894 accesses its part of data [Y(119894)U(119894)T(119894)] andit has access to the whole data

Each worker computes its part of the smoother matrixS(119894) in particular

S = (S(1)S(2)

) = (119878119898119900119900119905ℎ119872119886119905119903119894119909 (T(1)T)119878119898119900119900119905ℎ119872119886119905119903119894119909 (T(2)T)) (10)

The first worker obtains the elements of smoother matrixS(1) (it gets directly matrix I minus S(119894)) the correspondingelements of matrix W(1) should be computed first Forexample to get the element W(1)13 according to the function119878119898119900119900119905ℎ119872119886119905119903119894119909

W(1)13 = 119870119867 (T(1)1 minus T3) = 119870119867 ((1 2) minus (2 1))= 119870119867 ((minus1 1)) = 119870 ((minus2 2)) = 0007 (11)

Then matrixW for both workers is the following

W = (W(1)W(2)

) = (0399 0007 0007 00540007 0399 0000 00000007 0000 0399 00540054 0000 0054 0399) (12)

To get the element (I minus S(1))13 knowing matrix W(1)according to the function SmoothMatrix we compute(I minus S(1))

13= minus 0007(0399 + 0007 + 0007 + 0005)= minus0016 (13)

Then matrix (I minus S) for both workers is the following(I minus S) = ((I minus S)(1)(I minus S)(2))= ( 0147 minus0016 minus0016 minus0115minus0018 0018 minus0000 minus0000minus0016 minus0000 0133 minus0117minus0106 minus0000 minus0107 0213 ) (14)

Then the corresponding smoothed matrices U and Ycomputed by both workers are

U = (U(1)U(2)

) = (minus0147 minus00000018 00190016 minus01340106 0106 ) Y = (Y(1)

Y(2)) = (minus03930018minus00850426 )

(15)

Matrices U(1) and U(2) are sent to the driver and collectedinto the matrix U but Y(1) and Y(2) are collected into thevector Y

The driver program calls the standard procedure ofregression coefficient calculation which is shared betweenworkers in a standard way The resulting coefficients 120573 arecollected on the driver

120573 = (120573012057311205732) = (minus000227531041 ) (16)

At the forecasting stage each worker receives a set ofpoints for forecasting Now we illustrate the algorithm for asingle point [u1015840 t1015840] = [(2 1) (1 1)] and two workers

Each worker 119894 accesses its part of training data[Y(119894)U(119894)T(119894)] and computes then the elements of matrixS(119894) = 119878119898119900119900119905ℎ119872119886119905119903119894119909(T(119894) t1015840) First smoother matrix W(119894)should be computed For example worker 1 computes theelement

W(1)11 = 119870119867 ((1 2) minus (2 1)) = 119870119867 ((minus1 1))= 119870 ((minus2 2)) = 0007 (17)

So workers 1 and 2 obtain matrices

W = (W(1)W(2)

) = (0007 00000399 0054) (18)


Now each worker computes partial sums (so for worker1W(1)11 +W(1)12 = 0007 + 0000 = 0007 for worker 2 0399 +0054 = 0453) and shares them to the driver program Thedriver computes the total sum 0007+0453 = 046 and sendsit to the workers

Now each worker computes matrix S(119894) For exampleworker 1 computes the element S(1)11 = 0007046 = 0016

Then matrices S(119894) for both workers are

S = (S(1)S(2)

) = (0016 00000867 0117) (19)

Now each worker computesU(119894)120573 and the correspondingdifference Y minus U(119894)120573

U120573 = (U(1)120573U(2)120573) = (483863655759)

Y minus U120573 = (Y(1) minus U(1)120573Y(2) minus U(2)120573

) = (minus383minus663minus355minus359) (20)

Each worker computes the smoothed value of the kernelpart of the forecast119898(119894) (t1015840) = S(119894) (Y(119894) minus U(119894)120573) (21)

so 119898(1)(t1015840) = minus00628 and 119898(2)(t1015840) = minus349 These values aresharedwith the driver program It computes their sum whichis considered as a kernel part of the prediction119898(t1015840) = minus356

The driver program computes the linear part of theprediction u1015840120573 = 379 and the final forecast y1015840 = u1015840120573 +119898(t1015840) = 0236

In the next sections we evaluate the performance of theproposed algorithms by various numerical experiments

5 Experimental Setup

51 Performance Evaluation Metrics

511 Parameters In our study we divided the availabledatasets into two parts training dataset and test dataset Thetraining dataset was used to train the model and the testdataset was used to check the accuracy of the resultsThe sizesof the training and test datasets were the important param-eters as they influenced accuracy scalability and speed-upmetrics

Another important parameterwas the level of parallelismwhich we considered in the number of cores used We variedthis parameter between 1 (no parallelism) and 48 cores

We also considered processing (learning) time as animportant parameter to compare the methods In most ofour experiments we varied one parameter and fixed theremaining parameters

We conducted three kinds of experiments

512 Accuracy Experiments In the accuracy experiments weevaluated the goodness of fit that depended on the sizes oftraining and test datasets As the accuracy criterion we usedthe coefficient of determination 1198772 which is usually a qualitymetric for regression model It is defined as a relative part ofthe sum of squares explained by the regression and can becalculated as

1198772 = 1 minus 119878119878res119878119878tot = 1 minus sum119894 (119910119894 minus 119910119894)2sum119894 (119910119894 minus 119910)2 (22)

where 119910 = sum119899119894=1 119910119894119899 and 119910119894 is the estimation (prediction) of119910119894 calculated by the regression modelNote that we calculated 1198772 using test dataset Thus the

results were more reliable because we trained our model onone piece of data (training set) but checked the accuracy ofthe model on the other one (test set)

513 Scalability Experiments Next we tested the proposedmethods on big data We analyzed how increasing the size ofthe dataset influences the time consumption of the algorithmFirst we discussed how the execution time changed with theincreasing of the size of the dataset Then we analyzed howmethods accuracy depends on the execution (training) timeunder different conditions Scalability was measured with thefixed number of cores

514 Speed-Up Experiments Finally we analyzed the rela-tionship between time consumption and the degree of par-allelism (number of cores) in the distributed realization Wevaried the various parallelization degrees and compared thespeed of work of various methods The speed was measuredwith the fixed accuracy and scale

52 Hardware and Software Used The experiments were car-ried out on a cluster of 16 computing nodes Each one of thesecomputing nodes had the following features processors 2xIntel Xeon CPU E5-2620 cores 4 per node (8 threads) clockspeed 200GHz cache 15MB network QDR InfiniBand(40Gbps) hard drive 2TB RAM 16GB per node

Both Hadoop master processes the NameNode and theJobTracker were hosted in the master node The former con-trolled the HDFS coordinating the slave machines by meansof their respective DataNode processes while the latter wasin charge of the TaskTrackers of each computing node whichexecuted the MapReduce framework We used a standaloneSPARK cluster which followed a similar configuration asthe master process was located on the master node and theworker processes were executed on the slave machines Bothframeworks shared the underlying HDFS file system

These are the details of the software used for the exper-iments MapReduce implementation Hadoop version 260SPARK version 152 operating system CentOS 66

53 Datasets We used three datasets synthetic data airlinesdata and Hanover traffic data The main characteristics ofthese datasets are summarized in Table 1 Table 2 presentssample records for each dataset


Table 1 Characteristics of the datasets

Dataset Number of records Number of factorsSynthetic data 10000 2Traffic data 6500 7Airlines delays 120000000 (13000) 29 + 22 (10)

Synthetic Data We started with synthetic data in orderto check how the partially linear regression performs onpartially linear data and to compute the basic performancecharacteristics that depended on training and test set sizesWe took the following model119910 = 051199091 + 1199092 sin (1199092) + 120576 (23)

where 1199091 and 1199092 were independently uniformly distributedin the interval [0 1000] and 120576 sim 119873(0 50) The dependenceon 1199092 was clearly nonlinear and strictly fluctuating

Hanover Traffic Data In [25] we simulated a traffic networkin the southern part of Hanover Germany on the base ofreal traffic data In this study we used this simulation dataas a dataset The network contained three parallel and fiveperpendicular streets which formed 15 intersections with aflow of approximately 5000 vehicles per hour For our exper-iments 6 variables were used 119910 travel time (min) 1199091 lengthof the route (km) 1199092 average speed in the system (kmh)1199093 average number of stops in the system (unitsmin) 1199094congestion level of the flow (vehh) 1199095 traffic lights in theroute (units) and 1199096 left turns in the route (units)

For the Hanover traffic data we solved the problem oftravel time predictions We did not filter the dataset andinstead used the data in the original form Principally theanalysis of outliers allows deleting suspicious observationsto obtain more accurate results however some importantinformation can be lost [26]

Taking different subsets of variables we found that thefollowing variable assignment to linear and kernel parts ofPLM is optimal

(i) Linear part U 1199094 1199096(ii) Kernel part T 1199091 1199092 1199093 1199095

Airlines DataThe data consists of flight arrival and departuredetails for all commercial flights within the USA fromOctober 1987 to April 2008 [27] This is a large dataset thereare nearly 120 million records in total taking up 12 gigabytesEvery row in the dataset includes 29 variables This data wasused in various studies for analyzing the efficiency ofmethodsproposed for big data processing including regression [28]In order to complement this data and to obtain a betterprediction we added weather average daily informationincluding daily temperatures (minmax) wind speed snowconditions and precipitation which is freely available fromthe site httpswwwwundergroundcom This provided anadditional 22 factors for each of the 731 daysThe airlines andweather datasets were joined on the corresponding date

For this data we aimed to solve the problem of the depar-ture delay prediction which corresponded to the DepDelaycolumn of data

For test purposes we selected the Salt Lake City Interna-tional Airport (SLC)Our experiments showed that construc-tion of a model for several airports resulted in poor resultsthis required special clustering [29] and could be the subjectof future research

Therefore we selected two years (1995 and 1996) for ouranalysis This yielded approximately 170000 records

An additional factor which was added to this data wasthe number of flights 30 minutes before and 30 minutes afterthe specific flight

The initial analysis showed the heterogeneous structureof this data Clustering showed that two variables are themost important for separating the data departure time andtravel distanceA very promising clusterwith goodpredictionpotential was short late flights (DepTimege 2145 andDistancele 1000Km) They could provide relatively good predictionshowever the distribution of delays did not significantly differfrom the original dataset This cluster contained approxi-mately 13000 records

In subsequent experiments we used subsets from thisdata in order to demonstrate the influence of dataset sizes onthe prediction quality

For the PLM algorithm the first important issue washow to divide the variables for the linear and kernel partsof the regression Taking different subsets of variables andincludingexcluding variables one by one we found 10 mostsignificant variables and the following optimal subsets

(i) Linear part U distance of the flight (Distance)average visibility distance (MeanVis) average windspeed (MeanWind) average precipitation (Precipita-tionmm)wind direction (WindDirDegrees) numberof flights in 30-minute interval (Num) and departuretime (DepTime)

(ii) Kernel part T day of the week (DayOfWeek) thun-derstorm (0 or 1) and destination (Dest)

Table 2 presents data fragments of 10 random recordsfrom each training dataset

6 Experimental Framework and Analysis

61 Accuracy Experiments We compared the goodness-of-fitmetric (1198772) of the partially linear linear and kernel regressionfor the datasets by varying the size of the training datasetNote that the accuracy did not depend on the size of the testset (which was taken equal to 1000)The results are presentedin Figures 4 5 and 6

We observed very different results For synthetic dataone variable was linear by definition and another one wasnonlinear the partially linear model was preferable as itproduced 1198772 with a value of 095 against 012 for both linearand kernel models For the airlines dataset partially linearmodel was also preferable but the difference was not sosignificant In contrast for the Hanover data the kernelmodel was preferable compared with the partially linear one


Table 2 Fragments of datasets

(a) Synthetic dataset119910 1199091 1199092231 087 126145 127 minus036minus05 047 minus006minus19 minus094 151minus151 033 172minus009 minus01 minus171017 024 16418 007 077minus05 045 minus022076 094 132

(b) Hanover dataset119910 travel time 1199091 length 1199092 speed 1199093 stops 1199094 congestion 1199095 tr lights 1199096 left turns256 210751 3030 210 4243 225 0284 234974 2236 489 8556 289 4162 124851 1933 927 8591 81 1448 234680 2058 839 8660 289 1248 35267 1933 927 8591 25 1327 90730 2354 396 8695 100 04435 109329 2201 544 8866 169 0294 34835 2368 381 8933 25 01255 123662 1897 1065 8521 81 15115 35723 1996 766 8485 25 1

(c) Airlines dataset

DepDelay DayOfWeek Distance MeanVis MeanWind Thunderstorm Precipitationmm WindDirDegrees Num Dest DepTime0 3 588 24 14 0 127 333 18 SNA 215063 7 546 22 13 0 178 153 2 GEG 2256143 5 919 24 18 0 914 308 27 MCI 2203minus4 6 599 22 16 0 1168 161 23 SFO 21474 6 368 24 14 0 762 151 22 LAS 215919 5 188 20 35 0 102 170 23 IDA 220425 7 291 23 19 0 0 128 22 BOI 22001 6 585 17 10 0 66 144 25 SJC 21510 3 507 28 13 0 991 353 24 PHX 215038 2 590 28 23 0 0 176 7 LAX 2243

starting from some point One possible explanation is that thedimension of traffic data was less than airlines data so thekernel model worked better when it had sufficient data

62 Scalability Experiments We compared the speed (execu-tion time) of the partially linear linear and kernel regressionfor the datasets by varying the size of training and testdatasets Note that both sizes of training and test datasetsinfluenced the execution timeThe results for airlines data arepresented in Figures 7 and 8 Other datasets produced similarresults

For the partially linear regression we could see quadraticrelationship between execution time and the size of the

training set and a linear dependence of execution timefrom the size of the test set These results could be easilyinterpreted because the computation of U and Y (steps (9)and (10) of PLM training Figure 2) required a smoothermatrix of size 119899times119899 where 119899 is the size of the training datasetOn the other hand the test set participated in the forecastingstep only and the complexity had a linear relationshipwith itssize For kernel and linear regressions the relationship withexecution time was linear for both training and test set sizes

Next we demonstrated how much resources (executiontime) should be spent to reach some quality level (1198772) forthe partially linear regressionmodelThe results for syntheticdata are presented in Figure 9 This graph was constructed


2000 4000 6000 8000 10000Training set size

PLMLinearKernel

10

08

06

04

02

00

R2

Figure 4 Forecast quality dependence on training set size forsynthetic data


PLMLinearKernel

10

08

06

04

02

00

R2

Figure 5 Forecast quality dependence on training set size for trafficdata

by fixing the test set size to 900 executing the algorithmfor different training set sizes obtaining the correspondingexecution time and accuracy and plotting them on the graph

We could see relatively fast growth at the beginning butit slowed towards the end and successive execution timeinvestments did not increase the goodness of fit 119877263 Speed-Up Experiments Finally we examined how theexecution time changes with the number of available cores

2000 4000 6000 8000 10000

Training set sizePLMLinearKernel

10

08

06

04

02

00

R2

Figure 6 Forecast quality dependence on training set size forairlines data

2000 4000 6000 8000 10000

0

500

1000

1500

Training set sizePLM test set 1000PLM test set 500PLM test set 200

Linear test set 1000Kernel test set 1000

R2

Figure 7 Execution timedependence on training set size for airlinesdata

The results for the Hanover traffic data are presented inFigure 10 and the remaining datasets produced similar results

We could see that the execution time decreased until itreached a threshold of 5ndash10 cores and then slightly increased(this is true for the PLM and the kernel regression for thelinear regression the minimum was reached with 2 cores)This is explained by the fact that data transfer among alarge number of cores takes significantly more time thancomputations This meant that SPARK still needed a better


200 400 600 800 1000

0

500

1000

1500

2000

2500

Test set size

Exec

utio

n tim

e

PLM training set 10000PLM training set 5000PLM training set 3000

Linear training set 10000Kernel training set 10000

Figure 8 Execution time dependence on test set size for airlinesdata

200 400 600 800 1000 1200 1400Execution time

10

09

08

07

06

R2

Figure 9 Forecast quality dependence on execution time of PLMalgorithm for synthetic data

optimization mechanism for parallel task execution A sim-ilar issue has been reported by other researchers in [30] forclustering techniques

7 Concluding Remarks

This paper presents distributed parallel versions of kernel-density-based and partially linear regression modelsintended to process big datasets in a reasonable time Thealgorithms deploy Apache SPARK which is a well-knowndistributed computation framework The algorithms weretested over three different datasets of approximately 10000

0 10 20 30 40 50

0

100

200

300

400

Number of cores

Exec

utio

n tim

e

PLM training set 2500 test set 1000PLM training set 2500 test set 500PLM training set 1000 test set 500Kernel training set 2500 test set 1000Linear training set 2500 test set 1000

Figure 10 PLM algorithm execution time depending on thenumber of processing cores for traffic data

records which cannot be processed by a single computerin a reasonable time Thus it was tested over a cluster of 16nodesWe conducted three kinds of experiments First in theAccuracy Experiments the results indicated that they couldnot be well forecasted using the linear regression model Weconducted the experiments with a more common situationwhen the structure of the data is not linear or partially linearFor all the datasets (non)semiparametric models (kerneland PLM) showed better results taking as an efficiencycriterion coefficient of determination 1198772 As discussedkernel regression experiences problem with increasing thedimensionality because it is difficult to find the points in theneighborhood of the specified point in big dimensions In ourexperiments benchmark data and real-world data had manyvariables and we showed that semiparametric models gavemore accurate results We also conducted experiments byincreasing the training set to show that it resulted in increasedaccuracy Next in the Scalability Experiments we changedthe sizes of training and test sets with the aim of analyzingthe algorithmsrsquo computation time with the same fixed levelof parallelism (number of working nodescores) All theexperiments showed that the training set influenced the timenonlinearly (atmost quadratically) but the test set influencedthe time linearly Finally in the Speed-Up Experiments ourpurpose was to show the importance of parallel realizationof the algorithms to work with big datasets taking intoaccount the fact that nonparametric and semiparametricestimation methods are very computationally intensive Wedemonstrated the feasibility of processing datasets of varyingsizes that were otherwise not feasible to process with a singlemachine An interesting aspect was that for each combination


(dataset algorithm) we could find the optimal amount ofresources (number of cores) to minimize the algorithmsexecution time For example for the PLM regression of theairlines data with the training set size equal to 2500 and testset size equal to 500 the optimal number of cores was 5 andwith the same training set size and the test set size equal to1000 the optimal number of cores was 9 After this optimalpoint the execution time of the algorithm starts to increaseWe could explain this phenomenon as the distributionexpenses in this case were more than the award from the par-allel executionThus we could conclude that it is important tofind the optimal amount of the resources for each experiment

Competing Interests

The authors declare that there are no competing interestsregarding the publication of this paper

References

[1] C Long Ed Data Science and Big Data Analytics DiscoveringAnalyzing Visualizing and Presenting Data John Wiley andSons Inc New York NY USA 2015

[2] J Dean and S Ghemawat ldquoMap reduce a flexible data process-ing toolrdquo Communications of the ACM vol 53 no 1 pp 72ndash772010

[3] Spark spark cluster computing framework httpsparkapacheorg

[4] D Peralta S del Rıo S Ramırez-Gallego I Triguero J MBenitez and F Herrera ldquoEvolutionary feature selection forbig data classification a MapReduce approachrdquo MathematicalProblems in Engineering vol 2015 Article ID 246139 11 pages2015

[5] ldquoCRAN task view High-performance and parallel computingwith Rrdquo httpscranr-projectorgwebviewsHighPerform-anceComputinghtml

[6] SparkR(Ronspark) httpssparkapacheorgdocslatestsparkrhtml

[7] P D Michailidis and K G Margaritis ldquoAccelerating kerneldensity estimation on the GPU using the CUDA frameworkrdquoApplied Mathematical Sciences vol 7 no 29-32 pp 1447ndash14762013

[8] A Fernandez S del Rıo V Lopez et al ldquoBig data with cloudcomputing an insight on the computing environment mapre-duce and programming frameworksrdquo Wiley InterdisciplinaryReviews Data Mining and Knowledge Discovery vol 4 no 5pp 380ndash409 2014

[9] N E Helwig ldquoSemiparametric regression of big data in Rrdquo inProceedings of the CSE Big Data Workshop May 2014

[10] N Draper and H Smith Applied Regression Analysis JohnWiley amp Sons New York NY USA 1986

[11] M S Srivastava Methods of multivariate statistics Wiley Seriesin Probability and Statistics Wiley-Interscience [John Wiley ampSons] NY USA 2002

[12] L Bottou ldquoStochastic gradient descent tricksrdquo Lecture Notes inComputer Science (including subseries Lecture Notes in ArtificialIntelligence and Lecture Notes in Bioinformatics) vol 7700 pp421ndash436 2012

[13] W Hardle Applied Nonparametric Regression Cambridge Uni-versity Press Cambridge UK 2002

[14] W Hardle M Muller S Sperlich and A Werwatz Nonpara-metric and Semiparametric Models Springer Series in StatisticsSpringer-Verlag New York NY USA 2004

[15] E Nadaraya ldquoOn estimating regressionrdquo Theory of Probabilityand its Applications vol 9 no 1 pp 141ndash142 1964

[16] G S Watson ldquoSmooth regression analysisrdquo The Indian Journalof Statistics Series A vol 26 pp 359ndash372 1964

[17] B W Silverman Density Estimation for Statistics and DataAnalysis Chapman amp Hall 1986

[18] D W Scott Multivariate Density Estimation Theory Practiceand Visualization Wiley-Interscience 1992

[19] H Liang ldquoEstimation in partially linear models and numericalcomparisonsrdquoComputational Statistics ampData Analysis vol 50no 3 pp 675ndash687 2006

[20] P Speckman ldquoKernel smoothing in partial linear modelsrdquoJournal of the Royal Statistical Society B vol 50 no 3 pp 413ndash436 1988

[21] M Zaharia M Chowdhury T Das et al ldquoResilient distributeddatasets a fault-tolerant abstraction for in-memory clustercomputingrdquo in Proceedings of the 9th USENIX Conference onNetworked Systems Design and Implementation (NSDI rsquo12)USENIX Association Berkeley Calif USA 2012

[22] Mllib guide httpsparkapacheorgdocslatestmllib-guidehtml

[23] N Pentreath Machine Learning with Spark Packt PublishingLtd Birmingham UK 2015

[24] R B Zadeh X Meng A Ulanov et al ldquoMatrix computationsand optimisation in apache sparkrdquo in Proceedings of the 22ndACM SIGKDD International Conference on Knowledge Discov-ery and Data Mining (KDD rsquo16) 38 31 pages San FranciscoCalif USA August 2016

[25] M Fiosins J Fiosina J P Muller and J Gormer ldquoAgent-basedintegrated decision making for autonomous vehicles in Urbantrafficrdquo Advances in Intelligent and Soft Computing vol 88 pp173ndash178 2011

[26] J Fiosina and M Fiosins ldquoCooperative regression-based fore-casting in distributed traffic networksrdquo in Distributed NetworkIntelligence Security and Applications Q A Memon Edchapter 1 p 337 CRC Press Taylor amp Francis Group 2013

[27] Airline on-time performance data asa sections on statisti-cal computing statistical graphics httpstat-computingorgdataexpo2009

[28] Data science with hadoop Predicting airline delays part 2httpdehortonworkscomblogdata-science-hadoop-spark-scala-part-2

[29] J Fiosina M Fiosins and J P Muller ldquoBig data processing andmining for next generation intelligent transportation systemsrdquoJurnal Teknologi vol 63 no 3 2013

[30] J Dromard G Roudire and P Owezarski ldquoUnsupervised net-work anomaly detection in real-time on big datardquo in Communi-cations in Computer and Information Science ADBIS 2015 ShortPapers and Workshops BigDap DCSA GID MEBIS OAISSW4CH WISARD Poitiers France September 8ndash11 2015 Pro-ceedings vol 539 of Communications in Computer and Inform-ation Science pp 197ndash206 Springer Berlin Germany 2015

Submit your manuscripts athttpswwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Distributed Sensor Networks


Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014


ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014


Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications


Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia


Biomedical Imaging


ArtificialNeural Systems

Advances in


RoboticsJournal of



Computational Intelligence and Neuroscience

Industrial EngineeringJournal of


Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014


Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in



traditional nonparallel realizations are not capable of pro-cessing all the available data This makes it imperative toadapt to existing techniques and to develop new ones thatovercome this disadvantage The distributed parallel SPARKframework gives us the possibility of addressing this dif-ficulty and increasing the scalability of nonparametric andsemiparametric regression methods allowing us to deal withbigger datasets

There are some approaches in the current literature toaddress nonparametric or semiparametric regression modelsfor parallel processing of big datasets [9] for example usingR add-on packages MPI Our study examines a novel fastparallel and distributed realization of the algorithms basedon the modern version of Apache SPARK which is a prom-ising tool for the efficient realization of different machinelearning and data mining algorithms [3]

The main objective of this study is to enable a paralleldistributed version of nonparametric and semiparametricregression models particularly kernel-density-based andpartial linear models to be applied on big data To realizethis a SPARK MapReduce based algorithm has been devel-oped which splits the data and performs various algorithmprocesses in parallel in the map phase and then combines thesolutions in the reduce phase to merge the results

More specifically the contribution of this study is (i) todesign novel distributed parallel kernel density regressionand partial linear regression algorithms over the SPARKMapReduce paradigm for big data and (ii) to validate thealgorithms analyzing their accuracy scalability and speed-up by means of numerical experiments

The remainder of this paper is organized as followsSection 2 reviews the traditional regression models to beanalyzed Section 3 reviews the existent distributed compu-tation frameworks for big datasets In Section 4 we proposeparallel versions of kernel-density-based and partial linearregression model algorithms based on SPARK MapReduceparadigm In Section 5 we present the experimental setupand in Section 6 we discuss the experimental framework andanalysis Section 7 concludes the paper and discusses futureresearch opportunities

2 Background Regression Models

21 Linear Multivariate Regression We start with linearregression which is the only regression model realized inthe current version of SPARK to compare the results of theproposed methods

Let us first consider the classical multivariate linearregression model 119864(119884 | X) = X120573 [10 11]

Y = X120573 + 120576 (1)

where 119899 is a number of observations 119889 is the number offactors Y119899times1 is a vector of dependent variables 120573119889times1 is avector of unknown parameters 120576119899times1 is a vector of randomerrors andX119899times119889 is amatrix of explanatory variablesThe rowsof the matrix X correspond to observations and the columnscorrespond to factors We suppose that 120576119894 are mutuallyindependent and have zero expectation and equal variances

(1) while not converged do(2) for all 119895 isin 0 119889 do(3) 120573119895 larr 120573119895 minus 120572 120597120597120573119895 119869(XY)(4) end for(5) end while

Algorithm 1 Stochastic Gradient Descent algorithm

The well-known least square estimator (LSE) of 120573 is

= (X119879X)minus1 X119879Y (2)

Further let (x119894 119910119894)119899119894=1 be the observations sampled fromthe distribution of (X 119884) After the estimation of the param-eters 120573 we can make a forecast for a certain 119896th (future) timemoment as 119864(119884119896) = x119896 where x119896 is a vector of observedvalues of explanatory variables for the 119896th time moment

For big data it is a problem to perform the matrix opera-tions in (2) For this purpose other optimization techniquescan be used One effective option is to use the StochasticGradient Descent algorithm [12] which is realized in SPARKThe generalized cost function to be minimized is119869 (120573XY) = 12119899 (X120573 minus Y)119879 (X120573 minus Y) (3)

Algorithm 1 presents the Stochastic Gradient Descentalgorithm where 120572 is a learning rate parameter

Algorithm 1 can be executed iteratively (incrementally)that is easy to parallelize

22 Kernel Density Regression The first alternative to thelinear regression model we want to consider is the kerneldensity estimation-based regression which is of big impor-tance in the case when data have a nonlinear structure Thisnonparametric approach for estimating a regression curvehas four main purposes First it provides a versatile methodfor exploring a general relationship between two variablesSecond it can predict observations without a reference toa fixed parametric model Third it provides a tool forfinding spurious observations by studying the influence ofisolated points Fourth it constitutes a flexible method ofsubstitution or interpolation between adjacent 119883-values formissing values [13]

Let us consider a nonparametric regression model119898(119909) = 119864(119884 | 119883 = 119909) [14] with the same 119884 119883 and 120576 asfor a linear model 119910 = 119898 (119909) + 120576 (4)

The Nadaraya-Watson kernel estimator [15 16] of119898(119909) is119898119899 (119909) = sum119899119894=1119870((119909 minus 119909119894) ℎ) 119910119894sum119899119894=1119870((119909 minus 119909119894) ℎ)= sum119899119894=1119870ℎ (119909 minus 119909119894) 119910119894sum119899119894=1119870ℎ (119909 minus 119909119894) (5)























Cluster manager

Sparkworker

Datanode

Sparkworker

Datanode

Sparkworker

Datanode

Sparkworker

Datanode













Smoother matrix

Shares RDD

12

3

456

7

8

9

1011

12

Collects and

Accesses RDD

Computes and shares


Accesses RDD 훽


[퐘(i) 퐔(i) 퐓(i)]







퐈 minus 퐒(i)

sumW(i)⟨j⟩

퐒(i)kj

= 퐖(i)kj

sumW(i)⟨j⟩








Driver


Broadcasts the


Receives


Computes

Smoother

Computes

1

2

3

4

5

6

8

7

9

10

1112

13

Collects and

Worker i

coefficients 훽

[퐮㰀 퐭㰀]



ith RDD part


vector 퐒(i)




퐒(i)j = 퐖(i)

j sumi sumW(i)


sumW(i)Receives

shares RDD sumW(i)
























matrix S(119894)







[YUT] =((

Y U T12341 22 32 12 2

1 22 32 11 1))

(8)




Y(2)U(2)T(2))=((((((

Y(1) U(1) T(1)1 1 2 1 22 2 3 2 3Y(1) U(2) T(2)3 2 1 2 14 2 2 1 1

))))))

(9)



S = (S(1)S(2)

) = (119878119898119900119900119905ℎ119872119886119905119903119894119909 (T(1)T)119878119898119900119900119905ℎ119872119886119905119903119894119909 (T(2)T)) (10)




W = (W(1)W(2)

) = (0399 0007 0007 00540007 0399 0000 00000007 0000 0399 00540054 0000 0054 0399) (12)


13= minus 0007(0399 + 0007 + 0007 + 0005)= minus0016 (13)



U = (U(1)U(2)


Y(2)) = (minus03930018minus00850426 )

(15)



120573 = (120573012057311205732) = (minus000227531041 ) (16)





W = (W(1)W(2)

) = (0007 00000399 0054) (18)





S = (S(1)S(2)

) = (0016 00000867 0117) (19)


U120573 = (U(1)120573U(2)120573) = (483863655759)



























































PLMLinearKernel

10

08

06

04

02

00

R2



PLMLinearKernel

10

08

06

04

02

00

R2




2000 4000 6000 8000 10000


10

08

06

04

02

00

R2


2000 4000 6000 8000 10000

0

500

1000

1500



R2





200 400 600 800 1000

0

500

1000

1500

2000

2500

Test set size

Exec

utio

n tim

e




200 400 600 800 1000 1200 1400Execution time

10

09

08

07

06

R2





0 10 20 30 40 50

0

100

200

300

400

Number of cores

Exec

utio

n tim

e






Competing Interests


References






































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in

























Cluster manager

Sparkworker

Datanode

Sparkworker

Datanode

Sparkworker

Datanode

Sparkworker

Datanode













Smoother matrix

Shares RDD

12

3

456

7

8

9

1011

12

Collects and

Accesses RDD

Computes and shares


Accesses RDD 훽


[퐘(i) 퐔(i) 퐓(i)]







퐈 minus 퐒(i)

sumW(i)⟨j⟩

퐒(i)kj

= 퐖(i)kj

sumW(i)⟨j⟩








Driver


Broadcasts the


Receives


Computes

Smoother

Computes

1

2

3

4

5

6

8

7

9

10

1112

13

Collects and

Worker i

coefficients 훽

[퐮㰀 퐭㰀]



ith RDD part


vector 퐒(i)




퐒(i)j = 퐖(i)

j sumi sumW(i)


sumW(i)Receives

shares RDD sumW(i)
























matrix S(119894)







[YUT] =((

Y U T12341 22 32 12 2

1 22 32 11 1))

(8)




Y(2)U(2)T(2))=((((((

Y(1) U(1) T(1)1 1 2 1 22 2 3 2 3Y(1) U(2) T(2)3 2 1 2 14 2 2 1 1

))))))

(9)



S = (S(1)S(2)

) = (119878119898119900119900119905ℎ119872119886119905119903119894119909 (T(1)T)119878119898119900119900119905ℎ119872119886119905119903119894119909 (T(2)T)) (10)




W = (W(1)W(2)

) = (0399 0007 0007 00540007 0399 0000 00000007 0000 0399 00540054 0000 0054 0399) (12)


13= minus 0007(0399 + 0007 + 0007 + 0005)= minus0016 (13)



U = (U(1)U(2)


Y(2)) = (minus03930018minus00850426 )

(15)



120573 = (120573012057311205732) = (minus000227531041 ) (16)





W = (W(1)W(2)

) = (0007 00000399 0054) (18)





S = (S(1)S(2)

) = (0016 00000867 0117) (19)


U120573 = (U(1)120573U(2)120573) = (483863655759)



























































PLMLinearKernel

10

08

06

04

02

00

R2



PLMLinearKernel

10

08

06

04

02

00

R2




2000 4000 6000 8000 10000


10

08

06

04

02

00

R2


2000 4000 6000 8000 10000

0

500

1000

1500



R2





200 400 600 800 1000

0

500

1000

1500

2000

2500

Test set size

Exec

utio

n tim

e




200 400 600 800 1000 1200 1400Execution time

10

09

08

07

06

R2





0 10 20 30 40 50

0

100

200

300

400

Number of cores

Exec

utio

n tim

e






Competing Interests


References






































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in




Driver


Broadcasts the


Receives


Computes

Smoother

Computes

1

2

3

4

5

6

8

7

9

10

1112

13

Collects and

Worker i

coefficients 훽

[퐮㰀 퐭㰀]



ith RDD part


vector 퐒(i)




퐒(i)j = 퐖(i)

j sumi sumW(i)


sumW(i)Receives

shares RDD sumW(i)
























matrix S(119894)







[YUT] =((

Y U T12341 22 32 12 2

1 22 32 11 1))

(8)




Y(2)U(2)T(2))=((((((

Y(1) U(1) T(1)1 1 2 1 22 2 3 2 3Y(1) U(2) T(2)3 2 1 2 14 2 2 1 1

))))))

(9)



S = (S(1)S(2)

) = (119878119898119900119900119905ℎ119872119886119905119903119894119909 (T(1)T)119878119898119900119900119905ℎ119872119886119905119903119894119909 (T(2)T)) (10)




W = (W(1)W(2)

) = (0399 0007 0007 00540007 0399 0000 00000007 0000 0399 00540054 0000 0054 0399) (12)


13= minus 0007(0399 + 0007 + 0007 + 0005)= minus0016 (13)



U = (U(1)U(2)


Y(2)) = (minus03930018minus00850426 )

(15)



120573 = (120573012057311205732) = (minus000227531041 ) (16)





W = (W(1)W(2)

) = (0007 00000399 0054) (18)





S = (S(1)S(2)

) = (0016 00000867 0117) (19)


U120573 = (U(1)120573U(2)120573) = (483863655759)



























































PLMLinearKernel

10

08

06

04

02

00

R2



PLMLinearKernel

10

08

06

04

02

00

R2




2000 4000 6000 8000 10000


10

08

06

04

02

00

R2


2000 4000 6000 8000 10000

0

500

1000

1500



R2





200 400 600 800 1000

0

500

1000

1500

2000

2500

Test set size

Exec

utio

n tim

e




200 400 600 800 1000 1200 1400Execution time

10

09

08

07

06

R2





0 10 20 30 40 50

0

100

200

300

400

Number of cores

Exec

utio

n tim

e






Competing Interests


References






































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in





[YUT] =((

Y U T12341 22 32 12 2

1 22 32 11 1))

(8)




Y(2)U(2)T(2))=((((((

Y(1) U(1) T(1)1 1 2 1 22 2 3 2 3Y(1) U(2) T(2)3 2 1 2 14 2 2 1 1

))))))

(9)



S = (S(1)S(2)

) = (119878119898119900119900119905ℎ119872119886119905119903119894119909 (T(1)T)119878119898119900119900119905ℎ119872119886119905119903119894119909 (T(2)T)) (10)




W = (W(1)W(2)

) = (0399 0007 0007 00540007 0399 0000 00000007 0000 0399 00540054 0000 0054 0399) (12)


13= minus 0007(0399 + 0007 + 0007 + 0005)= minus0016 (13)



U = (U(1)U(2)


Y(2)) = (minus03930018minus00850426 )

(15)



120573 = (120573012057311205732) = (minus000227531041 ) (16)





W = (W(1)W(2)

) = (0007 00000399 0054) (18)





S = (S(1)S(2)

) = (0016 00000867 0117) (19)


U120573 = (U(1)120573U(2)120573) = (483863655759)



























































PLMLinearKernel

10

08

06

04

02

00

R2



PLMLinearKernel

10

08

06

04

02

00

R2




2000 4000 6000 8000 10000


10

08

06

04

02

00

R2


2000 4000 6000 8000 10000

0

500

1000

1500



R2





200 400 600 800 1000

0

500

1000

1500

2000

2500

Test set size

Exec

utio

n tim

e




200 400 600 800 1000 1200 1400Execution time

10

09

08

07

06

R2





0 10 20 30 40 50

0

100

200

300

400

Number of cores

Exec

utio

n tim

e






Competing Interests


References






































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in







S = (S(1)S(2)

) = (0016 00000867 0117) (19)


U120573 = (U(1)120573U(2)120573) = (483863655759)



























































PLMLinearKernel

10

08

06

04

02

00

R2



PLMLinearKernel

10

08

06

04

02

00

R2




2000 4000 6000 8000 10000


10

08

06

04

02

00

R2


2000 4000 6000 8000 10000

0

500

1000

1500



R2





200 400 600 800 1000

0

500

1000

1500

2000

2500

Test set size

Exec

utio

n tim

e




200 400 600 800 1000 1200 1400Execution time

10

09

08

07

06

R2





0 10 20 30 40 50

0

100

200

300

400

Number of cores

Exec

utio

n tim

e






Competing Interests


References






































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in







































PLMLinearKernel

10

08

06

04

02

00

R2



PLMLinearKernel

10

08

06

04

02

00

R2




2000 4000 6000 8000 10000


10

08

06

04

02

00

R2


2000 4000 6000 8000 10000

0

500

1000

1500



R2





200 400 600 800 1000

0

500

1000

1500

2000

2500

Test set size

Exec

utio

n tim

e




200 400 600 800 1000 1200 1400Execution time

10

09

08

07

06

R2





0 10 20 30 40 50

0

100

200

300

400

Number of cores

Exec

utio

n tim

e






Competing Interests


References






































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in




200 400 600 800 1000

0

500

1000

1500

2000

2500

Test set size

Exec

utio

n tim

e




200 400 600 800 1000 1200 1400Execution time

10

09

08

07

06

R2





0 10 20 30 40 50

0

100

200

300

400

Number of cores

Exec

utio

n tim

e






Competing Interests


References






































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in





Competing Interests


References






































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in










Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in



distributed nonparametric and semiparametric regression on...

Documents