using numerical libraries on spark - nag · commercial in confidence experts in numerical...

44
Commercial in Confidence Experts in numerical algorithms and HPC services Using Numerical Libraries on Spark Brian Spector London Spark Users Meetup August 18 th , 2015

Upload: dangkien

Post on 26-May-2018

224 views

Category:

Documents


1 download

TRANSCRIPT

Commercial in Confidence

Experts in numerical algorithms and HPC services

Using Numerical Libraries on Spark

Brian Spector

London Spark Users Meetup

August 18th 2015

Commercial in Confidence 2Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across workers

How to use existing libraries on Spark

Commercial in Confidence 3Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 4Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 5

The Numerical Algorithms Group

Founded in 1970 as a co-operative project out of academia in UK

Support and Maintain Numerical Library ~1700 MathematicalStatistical Routines

NAGrsquos code is embedded in many vendor libraries (eg AMD Intel)

Numerical Library

HPCServices

Numerical Services

Commercial in Confidence 6

NAG Library Full Contents

Root Finding

Summation of Series

Quadrature

Ordinary Differential Equations

Partial Differential Equations

Numerical Differentiation

Integral Equations

Mesh Generation

Interpolation

Curve and Surface Fitting

Optimization

Approximations of Special Functions

Dense Linear Algebra

Sparse Linear Algebra

Correlation amp Regression Analysis

Multivariate Methods

Analysis of Variance

Random Number Generators

Univariate Estimation

Nonparametric Statistics

Smoothing in Statistics

Contingency Table Analysis

Survival Analysis

Time Series Analysis

Operations Research

Commercial in Confidence 7Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 8Numerical ExcellenceCommercial in Confidence

At NAG we have 40 years of algorithms

Linear Algebra Regression Optimization

Data is all in-memory and call a compiled libraryvoid nag_1d_spline_interpolant (Integer m const double x[] const double y[] Nag_Spline spline NagError fail)

Algorithms take advantage of

AVXAVX2

Multi-core

LAPACKBLAS

MKL

Efficient Algorithms

Commercial in Confidence 9Numerical ExcellenceCommercial in Confidence

Modern programming languages and compilers (hopefully) take care of some efficiencies for us

Python numpyscipy

Compiler flags (-O2)

Users worry less about efficient computing and more about solving their problem

Efficient Algorithms

Commercial in Confidence 10Numerical ExcellenceCommercial in Confidence

hellipBut what happens when it comes to Big Data

x = (double) malloc(10000000 10000 sizeof(double))

Past efficient algorithms break down

Need different ways of solving same problem

How do we use our existing functions (libraries) on Spark

We must reformulate the problem

An examplehellip (Warning Mathematics Approaching)

Big Data

Commercial in Confidence 11Numerical ExcellenceCommercial in Confidence

General Problem

Given a set of input measurements 11990911199092 hellip119909119901 and an

outcome measurement 119910 fit a linear model

119910 = 119883 119861 + ε

119910 =

1199101

⋮119910119901

119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901

Three ways of solving the same

problem

Linear Regression Example

Commercial in Confidence 12Numerical ExcellenceCommercial in Confidence

Solution 1 (Optimal Solution)

119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901

= 119876119877lowast where 119877lowast =1198770

119877 is 119901 by 119901 Upper Triangular 119876 orthogonal

119877 119861 = 1198881 where 119888 = 119876119879119910 hellip

hellip lots of linear algebra

but we have an algorithm

Linear Regression Example

Commercial in Confidence 13Numerical ExcellenceCommercial in Confidence

Solution 2 (Normal Equations)

119883119879119883 119861 = 119883119879 119910

If we 3 independent variables

119883 =1 2 3⋮ ⋮ ⋮4 5 6

119910 =7⋮8

Linear Regression Example

Commercial in Confidence 14Numerical ExcellenceCommercial in Confidence

119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6

1 2 3⋮ ⋮ ⋮4 5 6

=

11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133

119861 = 119883119879119883 minus1119883119879 119910

This computation can be map out to slave nodes

Linear Regression Example

Commercial in Confidence 15Numerical ExcellenceCommercial in Confidence

Solution 3 (Optimization)

min 119910 minus 119883 1198612

Iterative over the data

Final answer is an approximate to

the solution

Can add constraints to variables

(LASSO)

Linear Regression Example

Commercial in Confidence 16Numerical ExcellenceCommercial in Confidence

Machine LearningOptimization

MLlib uses solution 3 for many applications

Classification

Regression

Gaussian Mixture Model

Kmeans

Slave nodes are used for callbacks to access data

Map Reduce Repeat

MLlib Algorithms

Commercial in Confidence 17Numerical ExcellenceCommercial in Confidence

MLlib Algorithms

MasterLibraries

Slaves Slaves

Commercial in Confidence 18Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence

Use Normal Equations (Solution 2) to compute Linear Regression on large dataset

NAG Libraries were distributed across MasterSlave Nodes

Steps

1 Read in data

2 Call routine to compute sum-of-squares matrix on partitions (chucks)

Default partition size is 64mb

3 Call routine to aggregate SSQ matrix together

4 Call routine to compute optimal coefficients

Example on Spark

Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence

Linear Regression Example

Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL

Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence

Linear Regression Example Data

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2

rddparallelize(data)cache()

Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2LIBRARY_CALL(data)

Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence

Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a

function to all elements of this RDD

flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results

collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence

Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies

a function f to each partition of this RDD

mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions

Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence

public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception

if(dataSize numVars datapointspartitions()size() gt 10000000)

throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)

DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())

hellip

G02BW g02bw = new G02BW()

g02bweval(numVars finalptssq ifail)

if(g02bwgetIFAIL() gt 0)

Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())

throw new NAGSparkException(Error from g02bw)

How it looks in Spark

Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence

static class ComputeSSQ implements

FlatMapFunctionltIteratorltLabeledPointgt DataClassgt

Override

public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception

ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()

while(iterhasNext()) mypointsadd(iternext())

int length = mypointssize()

int numvars = mypointsget(0)features()size() + 1

double[] x = new double[lengthnumvars]

for(int i=0 iltlength i++)

for(int j=0 jltnumvars-1 j++)

x[(j) length + i] = mypointsget(i)features()apply(j)

x[(numvars-1) length + i] = mypointsget(i)label()

G02BU g02bu = new G02BU()

g02bueval(M U length numvars x length x 00 means ssq ifail)

return ArraysasList(new DataClass(lengthmeansssq))

ComputeSSQ (Mapper)

Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence

static class CombineData implements Function2ltDataClassDataClassDataClassgt

Override

public DataClass call(DataClass data1 DataClass data2) throws Exception

G02BZ g02bz = new G02BZ()

g02bzeval(M data1meanslength data1length data1means

data1ssq data2length data2means data2ssq IFAIL)

data1length = (g02bzgetXSW())

return data1

CombineData (Reducer)

Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master

Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence

Test 1

Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster

Used an 8 slave xlarge cluster (16 GB RAM)

Test 2

Varied the number of slave nodes from 2 - 16

Used 16 GB of data to see how algorithms scale

Linear Regression Example

Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence

Cluster Utilization

Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence

Results of Linear Regression

Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence

Results of Scaling

Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence

NAG vs MLlib

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 2Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across workers

How to use existing libraries on Spark

Commercial in Confidence 3Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 4Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 5

The Numerical Algorithms Group

Founded in 1970 as a co-operative project out of academia in UK

Support and Maintain Numerical Library ~1700 MathematicalStatistical Routines

NAGrsquos code is embedded in many vendor libraries (eg AMD Intel)

Numerical Library

HPCServices

Numerical Services

Commercial in Confidence 6

NAG Library Full Contents

Root Finding

Summation of Series

Quadrature

Ordinary Differential Equations

Partial Differential Equations

Numerical Differentiation

Integral Equations

Mesh Generation

Interpolation

Curve and Surface Fitting

Optimization

Approximations of Special Functions

Dense Linear Algebra

Sparse Linear Algebra

Correlation amp Regression Analysis

Multivariate Methods

Analysis of Variance

Random Number Generators

Univariate Estimation

Nonparametric Statistics

Smoothing in Statistics

Contingency Table Analysis

Survival Analysis

Time Series Analysis

Operations Research

Commercial in Confidence 7Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 8Numerical ExcellenceCommercial in Confidence

At NAG we have 40 years of algorithms

Linear Algebra Regression Optimization

Data is all in-memory and call a compiled libraryvoid nag_1d_spline_interpolant (Integer m const double x[] const double y[] Nag_Spline spline NagError fail)

Algorithms take advantage of

AVXAVX2

Multi-core

LAPACKBLAS

MKL

Efficient Algorithms

Commercial in Confidence 9Numerical ExcellenceCommercial in Confidence

Modern programming languages and compilers (hopefully) take care of some efficiencies for us

Python numpyscipy

Compiler flags (-O2)

Users worry less about efficient computing and more about solving their problem

Efficient Algorithms

Commercial in Confidence 10Numerical ExcellenceCommercial in Confidence

hellipBut what happens when it comes to Big Data

x = (double) malloc(10000000 10000 sizeof(double))

Past efficient algorithms break down

Need different ways of solving same problem

How do we use our existing functions (libraries) on Spark

We must reformulate the problem

An examplehellip (Warning Mathematics Approaching)

Big Data

Commercial in Confidence 11Numerical ExcellenceCommercial in Confidence

General Problem

Given a set of input measurements 11990911199092 hellip119909119901 and an

outcome measurement 119910 fit a linear model

119910 = 119883 119861 + ε

119910 =

1199101

⋮119910119901

119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901

Three ways of solving the same

problem

Linear Regression Example

Commercial in Confidence 12Numerical ExcellenceCommercial in Confidence

Solution 1 (Optimal Solution)

119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901

= 119876119877lowast where 119877lowast =1198770

119877 is 119901 by 119901 Upper Triangular 119876 orthogonal

119877 119861 = 1198881 where 119888 = 119876119879119910 hellip

hellip lots of linear algebra

but we have an algorithm

Linear Regression Example

Commercial in Confidence 13Numerical ExcellenceCommercial in Confidence

Solution 2 (Normal Equations)

119883119879119883 119861 = 119883119879 119910

If we 3 independent variables

119883 =1 2 3⋮ ⋮ ⋮4 5 6

119910 =7⋮8

Linear Regression Example

Commercial in Confidence 14Numerical ExcellenceCommercial in Confidence

119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6

1 2 3⋮ ⋮ ⋮4 5 6

=

11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133

119861 = 119883119879119883 minus1119883119879 119910

This computation can be map out to slave nodes

Linear Regression Example

Commercial in Confidence 15Numerical ExcellenceCommercial in Confidence

Solution 3 (Optimization)

min 119910 minus 119883 1198612

Iterative over the data

Final answer is an approximate to

the solution

Can add constraints to variables

(LASSO)

Linear Regression Example

Commercial in Confidence 16Numerical ExcellenceCommercial in Confidence

Machine LearningOptimization

MLlib uses solution 3 for many applications

Classification

Regression

Gaussian Mixture Model

Kmeans

Slave nodes are used for callbacks to access data

Map Reduce Repeat

MLlib Algorithms

Commercial in Confidence 17Numerical ExcellenceCommercial in Confidence

MLlib Algorithms

MasterLibraries

Slaves Slaves

Commercial in Confidence 18Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence

Use Normal Equations (Solution 2) to compute Linear Regression on large dataset

NAG Libraries were distributed across MasterSlave Nodes

Steps

1 Read in data

2 Call routine to compute sum-of-squares matrix on partitions (chucks)

Default partition size is 64mb

3 Call routine to aggregate SSQ matrix together

4 Call routine to compute optimal coefficients

Example on Spark

Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence

Linear Regression Example

Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL

Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence

Linear Regression Example Data

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2

rddparallelize(data)cache()

Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2LIBRARY_CALL(data)

Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence

Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a

function to all elements of this RDD

flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results

collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence

Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies

a function f to each partition of this RDD

mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions

Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence

public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception

if(dataSize numVars datapointspartitions()size() gt 10000000)

throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)

DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())

hellip

G02BW g02bw = new G02BW()

g02bweval(numVars finalptssq ifail)

if(g02bwgetIFAIL() gt 0)

Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())

throw new NAGSparkException(Error from g02bw)

How it looks in Spark

Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence

static class ComputeSSQ implements

FlatMapFunctionltIteratorltLabeledPointgt DataClassgt

Override

public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception

ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()

while(iterhasNext()) mypointsadd(iternext())

int length = mypointssize()

int numvars = mypointsget(0)features()size() + 1

double[] x = new double[lengthnumvars]

for(int i=0 iltlength i++)

for(int j=0 jltnumvars-1 j++)

x[(j) length + i] = mypointsget(i)features()apply(j)

x[(numvars-1) length + i] = mypointsget(i)label()

G02BU g02bu = new G02BU()

g02bueval(M U length numvars x length x 00 means ssq ifail)

return ArraysasList(new DataClass(lengthmeansssq))

ComputeSSQ (Mapper)

Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence

static class CombineData implements Function2ltDataClassDataClassDataClassgt

Override

public DataClass call(DataClass data1 DataClass data2) throws Exception

G02BZ g02bz = new G02BZ()

g02bzeval(M data1meanslength data1length data1means

data1ssq data2length data2means data2ssq IFAIL)

data1length = (g02bzgetXSW())

return data1

CombineData (Reducer)

Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master

Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence

Test 1

Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster

Used an 8 slave xlarge cluster (16 GB RAM)

Test 2

Varied the number of slave nodes from 2 - 16

Used 16 GB of data to see how algorithms scale

Linear Regression Example

Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence

Cluster Utilization

Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence

Results of Linear Regression

Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence

Results of Scaling

Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence

NAG vs MLlib

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 3Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 4Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 5

The Numerical Algorithms Group

Founded in 1970 as a co-operative project out of academia in UK

Support and Maintain Numerical Library ~1700 MathematicalStatistical Routines

NAGrsquos code is embedded in many vendor libraries (eg AMD Intel)

Numerical Library

HPCServices

Numerical Services

Commercial in Confidence 6

NAG Library Full Contents

Root Finding

Summation of Series

Quadrature

Ordinary Differential Equations

Partial Differential Equations

Numerical Differentiation

Integral Equations

Mesh Generation

Interpolation

Curve and Surface Fitting

Optimization

Approximations of Special Functions

Dense Linear Algebra

Sparse Linear Algebra

Correlation amp Regression Analysis

Multivariate Methods

Analysis of Variance

Random Number Generators

Univariate Estimation

Nonparametric Statistics

Smoothing in Statistics

Contingency Table Analysis

Survival Analysis

Time Series Analysis

Operations Research

Commercial in Confidence 7Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 8Numerical ExcellenceCommercial in Confidence

At NAG we have 40 years of algorithms

Linear Algebra Regression Optimization

Data is all in-memory and call a compiled libraryvoid nag_1d_spline_interpolant (Integer m const double x[] const double y[] Nag_Spline spline NagError fail)

Algorithms take advantage of

AVXAVX2

Multi-core

LAPACKBLAS

MKL

Efficient Algorithms

Commercial in Confidence 9Numerical ExcellenceCommercial in Confidence

Modern programming languages and compilers (hopefully) take care of some efficiencies for us

Python numpyscipy

Compiler flags (-O2)

Users worry less about efficient computing and more about solving their problem

Efficient Algorithms

Commercial in Confidence 10Numerical ExcellenceCommercial in Confidence

hellipBut what happens when it comes to Big Data

x = (double) malloc(10000000 10000 sizeof(double))

Past efficient algorithms break down

Need different ways of solving same problem

How do we use our existing functions (libraries) on Spark

We must reformulate the problem

An examplehellip (Warning Mathematics Approaching)

Big Data

Commercial in Confidence 11Numerical ExcellenceCommercial in Confidence

General Problem

Given a set of input measurements 11990911199092 hellip119909119901 and an

outcome measurement 119910 fit a linear model

119910 = 119883 119861 + ε

119910 =

1199101

⋮119910119901

119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901

Three ways of solving the same

problem

Linear Regression Example

Commercial in Confidence 12Numerical ExcellenceCommercial in Confidence

Solution 1 (Optimal Solution)

119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901

= 119876119877lowast where 119877lowast =1198770

119877 is 119901 by 119901 Upper Triangular 119876 orthogonal

119877 119861 = 1198881 where 119888 = 119876119879119910 hellip

hellip lots of linear algebra

but we have an algorithm

Linear Regression Example

Commercial in Confidence 13Numerical ExcellenceCommercial in Confidence

Solution 2 (Normal Equations)

119883119879119883 119861 = 119883119879 119910

If we 3 independent variables

119883 =1 2 3⋮ ⋮ ⋮4 5 6

119910 =7⋮8

Linear Regression Example

Commercial in Confidence 14Numerical ExcellenceCommercial in Confidence

119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6

1 2 3⋮ ⋮ ⋮4 5 6

=

11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133

119861 = 119883119879119883 minus1119883119879 119910

This computation can be map out to slave nodes

Linear Regression Example

Commercial in Confidence 15Numerical ExcellenceCommercial in Confidence

Solution 3 (Optimization)

min 119910 minus 119883 1198612

Iterative over the data

Final answer is an approximate to

the solution

Can add constraints to variables

(LASSO)

Linear Regression Example

Commercial in Confidence 16Numerical ExcellenceCommercial in Confidence

Machine LearningOptimization

MLlib uses solution 3 for many applications

Classification

Regression

Gaussian Mixture Model

Kmeans

Slave nodes are used for callbacks to access data

Map Reduce Repeat

MLlib Algorithms

Commercial in Confidence 17Numerical ExcellenceCommercial in Confidence

MLlib Algorithms

MasterLibraries

Slaves Slaves

Commercial in Confidence 18Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence

Use Normal Equations (Solution 2) to compute Linear Regression on large dataset

NAG Libraries were distributed across MasterSlave Nodes

Steps

1 Read in data

2 Call routine to compute sum-of-squares matrix on partitions (chucks)

Default partition size is 64mb

3 Call routine to aggregate SSQ matrix together

4 Call routine to compute optimal coefficients

Example on Spark

Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence

Linear Regression Example

Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL

Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence

Linear Regression Example Data

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2

rddparallelize(data)cache()

Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2LIBRARY_CALL(data)

Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence

Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a

function to all elements of this RDD

flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results

collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence

Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies

a function f to each partition of this RDD

mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions

Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence

public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception

if(dataSize numVars datapointspartitions()size() gt 10000000)

throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)

DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())

hellip

G02BW g02bw = new G02BW()

g02bweval(numVars finalptssq ifail)

if(g02bwgetIFAIL() gt 0)

Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())

throw new NAGSparkException(Error from g02bw)

How it looks in Spark

Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence

static class ComputeSSQ implements

FlatMapFunctionltIteratorltLabeledPointgt DataClassgt

Override

public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception

ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()

while(iterhasNext()) mypointsadd(iternext())

int length = mypointssize()

int numvars = mypointsget(0)features()size() + 1

double[] x = new double[lengthnumvars]

for(int i=0 iltlength i++)

for(int j=0 jltnumvars-1 j++)

x[(j) length + i] = mypointsget(i)features()apply(j)

x[(numvars-1) length + i] = mypointsget(i)label()

G02BU g02bu = new G02BU()

g02bueval(M U length numvars x length x 00 means ssq ifail)

return ArraysasList(new DataClass(lengthmeansssq))

ComputeSSQ (Mapper)

Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence

static class CombineData implements Function2ltDataClassDataClassDataClassgt

Override

public DataClass call(DataClass data1 DataClass data2) throws Exception

G02BZ g02bz = new G02BZ()

g02bzeval(M data1meanslength data1length data1means

data1ssq data2length data2means data2ssq IFAIL)

data1length = (g02bzgetXSW())

return data1

CombineData (Reducer)

Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master

Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence

Test 1

Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster

Used an 8 slave xlarge cluster (16 GB RAM)

Test 2

Varied the number of slave nodes from 2 - 16

Used 16 GB of data to see how algorithms scale

Linear Regression Example

Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence

Cluster Utilization

Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence

Results of Linear Regression

Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence

Results of Scaling

Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence

NAG vs MLlib

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 4Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 5

The Numerical Algorithms Group

Founded in 1970 as a co-operative project out of academia in UK

Support and Maintain Numerical Library ~1700 MathematicalStatistical Routines

NAGrsquos code is embedded in many vendor libraries (eg AMD Intel)

Numerical Library

HPCServices

Numerical Services

Commercial in Confidence 6

NAG Library Full Contents

Root Finding

Summation of Series

Quadrature

Ordinary Differential Equations

Partial Differential Equations

Numerical Differentiation

Integral Equations

Mesh Generation

Interpolation

Curve and Surface Fitting

Optimization

Approximations of Special Functions

Dense Linear Algebra

Sparse Linear Algebra

Correlation amp Regression Analysis

Multivariate Methods

Analysis of Variance

Random Number Generators

Univariate Estimation

Nonparametric Statistics

Smoothing in Statistics

Contingency Table Analysis

Survival Analysis

Time Series Analysis

Operations Research

Commercial in Confidence 7Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 8Numerical ExcellenceCommercial in Confidence

At NAG we have 40 years of algorithms

Linear Algebra Regression Optimization

Data is all in-memory and call a compiled libraryvoid nag_1d_spline_interpolant (Integer m const double x[] const double y[] Nag_Spline spline NagError fail)

Algorithms take advantage of

AVXAVX2

Multi-core

LAPACKBLAS

MKL

Efficient Algorithms

Commercial in Confidence 9Numerical ExcellenceCommercial in Confidence

Modern programming languages and compilers (hopefully) take care of some efficiencies for us

Python numpyscipy

Compiler flags (-O2)

Users worry less about efficient computing and more about solving their problem

Efficient Algorithms

Commercial in Confidence 10Numerical ExcellenceCommercial in Confidence

hellipBut what happens when it comes to Big Data

x = (double) malloc(10000000 10000 sizeof(double))

Past efficient algorithms break down

Need different ways of solving same problem

How do we use our existing functions (libraries) on Spark

We must reformulate the problem

An examplehellip (Warning Mathematics Approaching)

Big Data

Commercial in Confidence 11Numerical ExcellenceCommercial in Confidence

General Problem

Given a set of input measurements 11990911199092 hellip119909119901 and an

outcome measurement 119910 fit a linear model

119910 = 119883 119861 + ε

119910 =

1199101

⋮119910119901

119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901

Three ways of solving the same

problem

Linear Regression Example

Commercial in Confidence 12Numerical ExcellenceCommercial in Confidence

Solution 1 (Optimal Solution)

119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901

= 119876119877lowast where 119877lowast =1198770

119877 is 119901 by 119901 Upper Triangular 119876 orthogonal

119877 119861 = 1198881 where 119888 = 119876119879119910 hellip

hellip lots of linear algebra

but we have an algorithm

Linear Regression Example

Commercial in Confidence 13Numerical ExcellenceCommercial in Confidence

Solution 2 (Normal Equations)

119883119879119883 119861 = 119883119879 119910

If we 3 independent variables

119883 =1 2 3⋮ ⋮ ⋮4 5 6

119910 =7⋮8

Linear Regression Example

Commercial in Confidence 14Numerical ExcellenceCommercial in Confidence

119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6

1 2 3⋮ ⋮ ⋮4 5 6

=

11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133

119861 = 119883119879119883 minus1119883119879 119910

This computation can be map out to slave nodes

Linear Regression Example

Commercial in Confidence 15Numerical ExcellenceCommercial in Confidence

Solution 3 (Optimization)

min 119910 minus 119883 1198612

Iterative over the data

Final answer is an approximate to

the solution

Can add constraints to variables

(LASSO)

Linear Regression Example

Commercial in Confidence 16Numerical ExcellenceCommercial in Confidence

Machine LearningOptimization

MLlib uses solution 3 for many applications

Classification

Regression

Gaussian Mixture Model

Kmeans

Slave nodes are used for callbacks to access data

Map Reduce Repeat

MLlib Algorithms

Commercial in Confidence 17Numerical ExcellenceCommercial in Confidence

MLlib Algorithms

MasterLibraries

Slaves Slaves

Commercial in Confidence 18Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence

Use Normal Equations (Solution 2) to compute Linear Regression on large dataset

NAG Libraries were distributed across MasterSlave Nodes

Steps

1 Read in data

2 Call routine to compute sum-of-squares matrix on partitions (chucks)

Default partition size is 64mb

3 Call routine to aggregate SSQ matrix together

4 Call routine to compute optimal coefficients

Example on Spark

Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence

Linear Regression Example

Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL

Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence

Linear Regression Example Data

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2

rddparallelize(data)cache()

Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2LIBRARY_CALL(data)

Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence

Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a

function to all elements of this RDD

flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results

collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence

Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies

a function f to each partition of this RDD

mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions

Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence

public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception

if(dataSize numVars datapointspartitions()size() gt 10000000)

throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)

DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())

hellip

G02BW g02bw = new G02BW()

g02bweval(numVars finalptssq ifail)

if(g02bwgetIFAIL() gt 0)

Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())

throw new NAGSparkException(Error from g02bw)

How it looks in Spark

Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence

static class ComputeSSQ implements

FlatMapFunctionltIteratorltLabeledPointgt DataClassgt

Override

public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception

ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()

while(iterhasNext()) mypointsadd(iternext())

int length = mypointssize()

int numvars = mypointsget(0)features()size() + 1

double[] x = new double[lengthnumvars]

for(int i=0 iltlength i++)

for(int j=0 jltnumvars-1 j++)

x[(j) length + i] = mypointsget(i)features()apply(j)

x[(numvars-1) length + i] = mypointsget(i)label()

G02BU g02bu = new G02BU()

g02bueval(M U length numvars x length x 00 means ssq ifail)

return ArraysasList(new DataClass(lengthmeansssq))

ComputeSSQ (Mapper)

Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence

static class CombineData implements Function2ltDataClassDataClassDataClassgt

Override

public DataClass call(DataClass data1 DataClass data2) throws Exception

G02BZ g02bz = new G02BZ()

g02bzeval(M data1meanslength data1length data1means

data1ssq data2length data2means data2ssq IFAIL)

data1length = (g02bzgetXSW())

return data1

CombineData (Reducer)

Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master

Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence

Test 1

Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster

Used an 8 slave xlarge cluster (16 GB RAM)

Test 2

Varied the number of slave nodes from 2 - 16

Used 16 GB of data to see how algorithms scale

Linear Regression Example

Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence

Cluster Utilization

Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence

Results of Linear Regression

Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence

Results of Scaling

Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence

NAG vs MLlib

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 5

The Numerical Algorithms Group

Founded in 1970 as a co-operative project out of academia in UK

Support and Maintain Numerical Library ~1700 MathematicalStatistical Routines

NAGrsquos code is embedded in many vendor libraries (eg AMD Intel)

Numerical Library

HPCServices

Numerical Services

Commercial in Confidence 6

NAG Library Full Contents

Root Finding

Summation of Series

Quadrature

Ordinary Differential Equations

Partial Differential Equations

Numerical Differentiation

Integral Equations

Mesh Generation

Interpolation

Curve and Surface Fitting

Optimization

Approximations of Special Functions

Dense Linear Algebra

Sparse Linear Algebra

Correlation amp Regression Analysis

Multivariate Methods

Analysis of Variance

Random Number Generators

Univariate Estimation

Nonparametric Statistics

Smoothing in Statistics

Contingency Table Analysis

Survival Analysis

Time Series Analysis

Operations Research

Commercial in Confidence 7Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 8Numerical ExcellenceCommercial in Confidence

At NAG we have 40 years of algorithms

Linear Algebra Regression Optimization

Data is all in-memory and call a compiled libraryvoid nag_1d_spline_interpolant (Integer m const double x[] const double y[] Nag_Spline spline NagError fail)

Algorithms take advantage of

AVXAVX2

Multi-core

LAPACKBLAS

MKL

Efficient Algorithms

Commercial in Confidence 9Numerical ExcellenceCommercial in Confidence

Modern programming languages and compilers (hopefully) take care of some efficiencies for us

Python numpyscipy

Compiler flags (-O2)

Users worry less about efficient computing and more about solving their problem

Efficient Algorithms

Commercial in Confidence 10Numerical ExcellenceCommercial in Confidence

hellipBut what happens when it comes to Big Data

x = (double) malloc(10000000 10000 sizeof(double))

Past efficient algorithms break down

Need different ways of solving same problem

How do we use our existing functions (libraries) on Spark

We must reformulate the problem

An examplehellip (Warning Mathematics Approaching)

Big Data

Commercial in Confidence 11Numerical ExcellenceCommercial in Confidence

General Problem

Given a set of input measurements 11990911199092 hellip119909119901 and an

outcome measurement 119910 fit a linear model

119910 = 119883 119861 + ε

119910 =

1199101

⋮119910119901

119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901

Three ways of solving the same

problem

Linear Regression Example

Commercial in Confidence 12Numerical ExcellenceCommercial in Confidence

Solution 1 (Optimal Solution)

119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901

= 119876119877lowast where 119877lowast =1198770

119877 is 119901 by 119901 Upper Triangular 119876 orthogonal

119877 119861 = 1198881 where 119888 = 119876119879119910 hellip

hellip lots of linear algebra

but we have an algorithm

Linear Regression Example

Commercial in Confidence 13Numerical ExcellenceCommercial in Confidence

Solution 2 (Normal Equations)

119883119879119883 119861 = 119883119879 119910

If we 3 independent variables

119883 =1 2 3⋮ ⋮ ⋮4 5 6

119910 =7⋮8

Linear Regression Example

Commercial in Confidence 14Numerical ExcellenceCommercial in Confidence

119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6

1 2 3⋮ ⋮ ⋮4 5 6

=

11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133

119861 = 119883119879119883 minus1119883119879 119910

This computation can be map out to slave nodes

Linear Regression Example

Commercial in Confidence 15Numerical ExcellenceCommercial in Confidence

Solution 3 (Optimization)

min 119910 minus 119883 1198612

Iterative over the data

Final answer is an approximate to

the solution

Can add constraints to variables

(LASSO)

Linear Regression Example

Commercial in Confidence 16Numerical ExcellenceCommercial in Confidence

Machine LearningOptimization

MLlib uses solution 3 for many applications

Classification

Regression

Gaussian Mixture Model

Kmeans

Slave nodes are used for callbacks to access data

Map Reduce Repeat

MLlib Algorithms

Commercial in Confidence 17Numerical ExcellenceCommercial in Confidence

MLlib Algorithms

MasterLibraries

Slaves Slaves

Commercial in Confidence 18Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence

Use Normal Equations (Solution 2) to compute Linear Regression on large dataset

NAG Libraries were distributed across MasterSlave Nodes

Steps

1 Read in data

2 Call routine to compute sum-of-squares matrix on partitions (chucks)

Default partition size is 64mb

3 Call routine to aggregate SSQ matrix together

4 Call routine to compute optimal coefficients

Example on Spark

Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence

Linear Regression Example

Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL

Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence

Linear Regression Example Data

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2

rddparallelize(data)cache()

Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2LIBRARY_CALL(data)

Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence

Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a

function to all elements of this RDD

flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results

collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence

Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies

a function f to each partition of this RDD

mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions

Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence

public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception

if(dataSize numVars datapointspartitions()size() gt 10000000)

throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)

DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())

hellip

G02BW g02bw = new G02BW()

g02bweval(numVars finalptssq ifail)

if(g02bwgetIFAIL() gt 0)

Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())

throw new NAGSparkException(Error from g02bw)

How it looks in Spark

Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence

static class ComputeSSQ implements

FlatMapFunctionltIteratorltLabeledPointgt DataClassgt

Override

public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception

ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()

while(iterhasNext()) mypointsadd(iternext())

int length = mypointssize()

int numvars = mypointsget(0)features()size() + 1

double[] x = new double[lengthnumvars]

for(int i=0 iltlength i++)

for(int j=0 jltnumvars-1 j++)

x[(j) length + i] = mypointsget(i)features()apply(j)

x[(numvars-1) length + i] = mypointsget(i)label()

G02BU g02bu = new G02BU()

g02bueval(M U length numvars x length x 00 means ssq ifail)

return ArraysasList(new DataClass(lengthmeansssq))

ComputeSSQ (Mapper)

Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence

static class CombineData implements Function2ltDataClassDataClassDataClassgt

Override

public DataClass call(DataClass data1 DataClass data2) throws Exception

G02BZ g02bz = new G02BZ()

g02bzeval(M data1meanslength data1length data1means

data1ssq data2length data2means data2ssq IFAIL)

data1length = (g02bzgetXSW())

return data1

CombineData (Reducer)

Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master

Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence

Test 1

Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster

Used an 8 slave xlarge cluster (16 GB RAM)

Test 2

Varied the number of slave nodes from 2 - 16

Used 16 GB of data to see how algorithms scale

Linear Regression Example

Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence

Cluster Utilization

Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence

Results of Linear Regression

Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence

Results of Scaling

Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence

NAG vs MLlib

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 6

NAG Library Full Contents

Root Finding

Summation of Series

Quadrature

Ordinary Differential Equations

Partial Differential Equations

Numerical Differentiation

Integral Equations

Mesh Generation

Interpolation

Curve and Surface Fitting

Optimization

Approximations of Special Functions

Dense Linear Algebra

Sparse Linear Algebra

Correlation amp Regression Analysis

Multivariate Methods

Analysis of Variance

Random Number Generators

Univariate Estimation

Nonparametric Statistics

Smoothing in Statistics

Contingency Table Analysis

Survival Analysis

Time Series Analysis

Operations Research

Commercial in Confidence 7Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 8Numerical ExcellenceCommercial in Confidence

At NAG we have 40 years of algorithms

Linear Algebra Regression Optimization

Data is all in-memory and call a compiled libraryvoid nag_1d_spline_interpolant (Integer m const double x[] const double y[] Nag_Spline spline NagError fail)

Algorithms take advantage of

AVXAVX2

Multi-core

LAPACKBLAS

MKL

Efficient Algorithms

Commercial in Confidence 9Numerical ExcellenceCommercial in Confidence

Modern programming languages and compilers (hopefully) take care of some efficiencies for us

Python numpyscipy

Compiler flags (-O2)

Users worry less about efficient computing and more about solving their problem

Efficient Algorithms

Commercial in Confidence 10Numerical ExcellenceCommercial in Confidence

hellipBut what happens when it comes to Big Data

x = (double) malloc(10000000 10000 sizeof(double))

Past efficient algorithms break down

Need different ways of solving same problem

How do we use our existing functions (libraries) on Spark

We must reformulate the problem

An examplehellip (Warning Mathematics Approaching)

Big Data

Commercial in Confidence 11Numerical ExcellenceCommercial in Confidence

General Problem

Given a set of input measurements 11990911199092 hellip119909119901 and an

outcome measurement 119910 fit a linear model

119910 = 119883 119861 + ε

119910 =

1199101

⋮119910119901

119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901

Three ways of solving the same

problem

Linear Regression Example

Commercial in Confidence 12Numerical ExcellenceCommercial in Confidence

Solution 1 (Optimal Solution)

119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901

= 119876119877lowast where 119877lowast =1198770

119877 is 119901 by 119901 Upper Triangular 119876 orthogonal

119877 119861 = 1198881 where 119888 = 119876119879119910 hellip

hellip lots of linear algebra

but we have an algorithm

Linear Regression Example

Commercial in Confidence 13Numerical ExcellenceCommercial in Confidence

Solution 2 (Normal Equations)

119883119879119883 119861 = 119883119879 119910

If we 3 independent variables

119883 =1 2 3⋮ ⋮ ⋮4 5 6

119910 =7⋮8

Linear Regression Example

Commercial in Confidence 14Numerical ExcellenceCommercial in Confidence

119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6

1 2 3⋮ ⋮ ⋮4 5 6

=

11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133

119861 = 119883119879119883 minus1119883119879 119910

This computation can be map out to slave nodes

Linear Regression Example

Commercial in Confidence 15Numerical ExcellenceCommercial in Confidence

Solution 3 (Optimization)

min 119910 minus 119883 1198612

Iterative over the data

Final answer is an approximate to

the solution

Can add constraints to variables

(LASSO)

Linear Regression Example

Commercial in Confidence 16Numerical ExcellenceCommercial in Confidence

Machine LearningOptimization

MLlib uses solution 3 for many applications

Classification

Regression

Gaussian Mixture Model

Kmeans

Slave nodes are used for callbacks to access data

Map Reduce Repeat

MLlib Algorithms

Commercial in Confidence 17Numerical ExcellenceCommercial in Confidence

MLlib Algorithms

MasterLibraries

Slaves Slaves

Commercial in Confidence 18Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence

Use Normal Equations (Solution 2) to compute Linear Regression on large dataset

NAG Libraries were distributed across MasterSlave Nodes

Steps

1 Read in data

2 Call routine to compute sum-of-squares matrix on partitions (chucks)

Default partition size is 64mb

3 Call routine to aggregate SSQ matrix together

4 Call routine to compute optimal coefficients

Example on Spark

Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence

Linear Regression Example

Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL

Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence

Linear Regression Example Data

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2

rddparallelize(data)cache()

Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2LIBRARY_CALL(data)

Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence

Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a

function to all elements of this RDD

flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results

collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence

Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies

a function f to each partition of this RDD

mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions

Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence

public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception

if(dataSize numVars datapointspartitions()size() gt 10000000)

throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)

DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())

hellip

G02BW g02bw = new G02BW()

g02bweval(numVars finalptssq ifail)

if(g02bwgetIFAIL() gt 0)

Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())

throw new NAGSparkException(Error from g02bw)

How it looks in Spark

Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence

static class ComputeSSQ implements

FlatMapFunctionltIteratorltLabeledPointgt DataClassgt

Override

public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception

ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()

while(iterhasNext()) mypointsadd(iternext())

int length = mypointssize()

int numvars = mypointsget(0)features()size() + 1

double[] x = new double[lengthnumvars]

for(int i=0 iltlength i++)

for(int j=0 jltnumvars-1 j++)

x[(j) length + i] = mypointsget(i)features()apply(j)

x[(numvars-1) length + i] = mypointsget(i)label()

G02BU g02bu = new G02BU()

g02bueval(M U length numvars x length x 00 means ssq ifail)

return ArraysasList(new DataClass(lengthmeansssq))

ComputeSSQ (Mapper)

Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence

static class CombineData implements Function2ltDataClassDataClassDataClassgt

Override

public DataClass call(DataClass data1 DataClass data2) throws Exception

G02BZ g02bz = new G02BZ()

g02bzeval(M data1meanslength data1length data1means

data1ssq data2length data2means data2ssq IFAIL)

data1length = (g02bzgetXSW())

return data1

CombineData (Reducer)

Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master

Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence

Test 1

Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster

Used an 8 slave xlarge cluster (16 GB RAM)

Test 2

Varied the number of slave nodes from 2 - 16

Used 16 GB of data to see how algorithms scale

Linear Regression Example

Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence

Cluster Utilization

Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence

Results of Linear Regression

Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence

Results of Scaling

Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence

NAG vs MLlib

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 7Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 8Numerical ExcellenceCommercial in Confidence

At NAG we have 40 years of algorithms

Linear Algebra Regression Optimization

Data is all in-memory and call a compiled libraryvoid nag_1d_spline_interpolant (Integer m const double x[] const double y[] Nag_Spline spline NagError fail)

Algorithms take advantage of

AVXAVX2

Multi-core

LAPACKBLAS

MKL

Efficient Algorithms

Commercial in Confidence 9Numerical ExcellenceCommercial in Confidence

Modern programming languages and compilers (hopefully) take care of some efficiencies for us

Python numpyscipy

Compiler flags (-O2)

Users worry less about efficient computing and more about solving their problem

Efficient Algorithms

Commercial in Confidence 10Numerical ExcellenceCommercial in Confidence

hellipBut what happens when it comes to Big Data

x = (double) malloc(10000000 10000 sizeof(double))

Past efficient algorithms break down

Need different ways of solving same problem

How do we use our existing functions (libraries) on Spark

We must reformulate the problem

An examplehellip (Warning Mathematics Approaching)

Big Data

Commercial in Confidence 11Numerical ExcellenceCommercial in Confidence

General Problem

Given a set of input measurements 11990911199092 hellip119909119901 and an

outcome measurement 119910 fit a linear model

119910 = 119883 119861 + ε

119910 =

1199101

⋮119910119901

119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901

Three ways of solving the same

problem

Linear Regression Example

Commercial in Confidence 12Numerical ExcellenceCommercial in Confidence

Solution 1 (Optimal Solution)

119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901

= 119876119877lowast where 119877lowast =1198770

119877 is 119901 by 119901 Upper Triangular 119876 orthogonal

119877 119861 = 1198881 where 119888 = 119876119879119910 hellip

hellip lots of linear algebra

but we have an algorithm

Linear Regression Example

Commercial in Confidence 13Numerical ExcellenceCommercial in Confidence

Solution 2 (Normal Equations)

119883119879119883 119861 = 119883119879 119910

If we 3 independent variables

119883 =1 2 3⋮ ⋮ ⋮4 5 6

119910 =7⋮8

Linear Regression Example

Commercial in Confidence 14Numerical ExcellenceCommercial in Confidence

119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6

1 2 3⋮ ⋮ ⋮4 5 6

=

11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133

119861 = 119883119879119883 minus1119883119879 119910

This computation can be map out to slave nodes

Linear Regression Example

Commercial in Confidence 15Numerical ExcellenceCommercial in Confidence

Solution 3 (Optimization)

min 119910 minus 119883 1198612

Iterative over the data

Final answer is an approximate to

the solution

Can add constraints to variables

(LASSO)

Linear Regression Example

Commercial in Confidence 16Numerical ExcellenceCommercial in Confidence

Machine LearningOptimization

MLlib uses solution 3 for many applications

Classification

Regression

Gaussian Mixture Model

Kmeans

Slave nodes are used for callbacks to access data

Map Reduce Repeat

MLlib Algorithms

Commercial in Confidence 17Numerical ExcellenceCommercial in Confidence

MLlib Algorithms

MasterLibraries

Slaves Slaves

Commercial in Confidence 18Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence

Use Normal Equations (Solution 2) to compute Linear Regression on large dataset

NAG Libraries were distributed across MasterSlave Nodes

Steps

1 Read in data

2 Call routine to compute sum-of-squares matrix on partitions (chucks)

Default partition size is 64mb

3 Call routine to aggregate SSQ matrix together

4 Call routine to compute optimal coefficients

Example on Spark

Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence

Linear Regression Example

Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL

Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence

Linear Regression Example Data

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2

rddparallelize(data)cache()

Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2LIBRARY_CALL(data)

Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence

Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a

function to all elements of this RDD

flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results

collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence

Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies

a function f to each partition of this RDD

mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions

Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence

public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception

if(dataSize numVars datapointspartitions()size() gt 10000000)

throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)

DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())

hellip

G02BW g02bw = new G02BW()

g02bweval(numVars finalptssq ifail)

if(g02bwgetIFAIL() gt 0)

Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())

throw new NAGSparkException(Error from g02bw)

How it looks in Spark

Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence

static class ComputeSSQ implements

FlatMapFunctionltIteratorltLabeledPointgt DataClassgt

Override

public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception

ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()

while(iterhasNext()) mypointsadd(iternext())

int length = mypointssize()

int numvars = mypointsget(0)features()size() + 1

double[] x = new double[lengthnumvars]

for(int i=0 iltlength i++)

for(int j=0 jltnumvars-1 j++)

x[(j) length + i] = mypointsget(i)features()apply(j)

x[(numvars-1) length + i] = mypointsget(i)label()

G02BU g02bu = new G02BU()

g02bueval(M U length numvars x length x 00 means ssq ifail)

return ArraysasList(new DataClass(lengthmeansssq))

ComputeSSQ (Mapper)

Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence

static class CombineData implements Function2ltDataClassDataClassDataClassgt

Override

public DataClass call(DataClass data1 DataClass data2) throws Exception

G02BZ g02bz = new G02BZ()

g02bzeval(M data1meanslength data1length data1means

data1ssq data2length data2means data2ssq IFAIL)

data1length = (g02bzgetXSW())

return data1

CombineData (Reducer)

Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master

Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence

Test 1

Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster

Used an 8 slave xlarge cluster (16 GB RAM)

Test 2

Varied the number of slave nodes from 2 - 16

Used 16 GB of data to see how algorithms scale

Linear Regression Example

Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence

Cluster Utilization

Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence

Results of Linear Regression

Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence

Results of Scaling

Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence

NAG vs MLlib

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 8Numerical ExcellenceCommercial in Confidence

At NAG we have 40 years of algorithms

Linear Algebra Regression Optimization

Data is all in-memory and call a compiled libraryvoid nag_1d_spline_interpolant (Integer m const double x[] const double y[] Nag_Spline spline NagError fail)

Algorithms take advantage of

AVXAVX2

Multi-core

LAPACKBLAS

MKL

Efficient Algorithms

Commercial in Confidence 9Numerical ExcellenceCommercial in Confidence

Modern programming languages and compilers (hopefully) take care of some efficiencies for us

Python numpyscipy

Compiler flags (-O2)

Users worry less about efficient computing and more about solving their problem

Efficient Algorithms

Commercial in Confidence 10Numerical ExcellenceCommercial in Confidence

hellipBut what happens when it comes to Big Data

x = (double) malloc(10000000 10000 sizeof(double))

Past efficient algorithms break down

Need different ways of solving same problem

How do we use our existing functions (libraries) on Spark

We must reformulate the problem

An examplehellip (Warning Mathematics Approaching)

Big Data

Commercial in Confidence 11Numerical ExcellenceCommercial in Confidence

General Problem

Given a set of input measurements 11990911199092 hellip119909119901 and an

outcome measurement 119910 fit a linear model

119910 = 119883 119861 + ε

119910 =

1199101

⋮119910119901

119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901

Three ways of solving the same

problem

Linear Regression Example

Commercial in Confidence 12Numerical ExcellenceCommercial in Confidence

Solution 1 (Optimal Solution)

119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901

= 119876119877lowast where 119877lowast =1198770

119877 is 119901 by 119901 Upper Triangular 119876 orthogonal

119877 119861 = 1198881 where 119888 = 119876119879119910 hellip

hellip lots of linear algebra

but we have an algorithm

Linear Regression Example

Commercial in Confidence 13Numerical ExcellenceCommercial in Confidence

Solution 2 (Normal Equations)

119883119879119883 119861 = 119883119879 119910

If we 3 independent variables

119883 =1 2 3⋮ ⋮ ⋮4 5 6

119910 =7⋮8

Linear Regression Example

Commercial in Confidence 14Numerical ExcellenceCommercial in Confidence

119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6

1 2 3⋮ ⋮ ⋮4 5 6

=

11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133

119861 = 119883119879119883 minus1119883119879 119910

This computation can be map out to slave nodes

Linear Regression Example

Commercial in Confidence 15Numerical ExcellenceCommercial in Confidence

Solution 3 (Optimization)

min 119910 minus 119883 1198612

Iterative over the data

Final answer is an approximate to

the solution

Can add constraints to variables

(LASSO)

Linear Regression Example

Commercial in Confidence 16Numerical ExcellenceCommercial in Confidence

Machine LearningOptimization

MLlib uses solution 3 for many applications

Classification

Regression

Gaussian Mixture Model

Kmeans

Slave nodes are used for callbacks to access data

Map Reduce Repeat

MLlib Algorithms

Commercial in Confidence 17Numerical ExcellenceCommercial in Confidence

MLlib Algorithms

MasterLibraries

Slaves Slaves

Commercial in Confidence 18Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence

Use Normal Equations (Solution 2) to compute Linear Regression on large dataset

NAG Libraries were distributed across MasterSlave Nodes

Steps

1 Read in data

2 Call routine to compute sum-of-squares matrix on partitions (chucks)

Default partition size is 64mb

3 Call routine to aggregate SSQ matrix together

4 Call routine to compute optimal coefficients

Example on Spark

Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence

Linear Regression Example

Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL

Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence

Linear Regression Example Data

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2

rddparallelize(data)cache()

Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2LIBRARY_CALL(data)

Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence

Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a

function to all elements of this RDD

flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results

collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence

Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies

a function f to each partition of this RDD

mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions

Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence

public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception

if(dataSize numVars datapointspartitions()size() gt 10000000)

throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)

DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())

hellip

G02BW g02bw = new G02BW()

g02bweval(numVars finalptssq ifail)

if(g02bwgetIFAIL() gt 0)

Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())

throw new NAGSparkException(Error from g02bw)

How it looks in Spark

Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence

static class ComputeSSQ implements

FlatMapFunctionltIteratorltLabeledPointgt DataClassgt

Override

public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception

ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()

while(iterhasNext()) mypointsadd(iternext())

int length = mypointssize()

int numvars = mypointsget(0)features()size() + 1

double[] x = new double[lengthnumvars]

for(int i=0 iltlength i++)

for(int j=0 jltnumvars-1 j++)

x[(j) length + i] = mypointsget(i)features()apply(j)

x[(numvars-1) length + i] = mypointsget(i)label()

G02BU g02bu = new G02BU()

g02bueval(M U length numvars x length x 00 means ssq ifail)

return ArraysasList(new DataClass(lengthmeansssq))

ComputeSSQ (Mapper)

Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence

static class CombineData implements Function2ltDataClassDataClassDataClassgt

Override

public DataClass call(DataClass data1 DataClass data2) throws Exception

G02BZ g02bz = new G02BZ()

g02bzeval(M data1meanslength data1length data1means

data1ssq data2length data2means data2ssq IFAIL)

data1length = (g02bzgetXSW())

return data1

CombineData (Reducer)

Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master

Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence

Test 1

Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster

Used an 8 slave xlarge cluster (16 GB RAM)

Test 2

Varied the number of slave nodes from 2 - 16

Used 16 GB of data to see how algorithms scale

Linear Regression Example

Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence

Cluster Utilization

Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence

Results of Linear Regression

Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence

Results of Scaling

Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence

NAG vs MLlib

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 9Numerical ExcellenceCommercial in Confidence

Modern programming languages and compilers (hopefully) take care of some efficiencies for us

Python numpyscipy

Compiler flags (-O2)

Users worry less about efficient computing and more about solving their problem

Efficient Algorithms

Commercial in Confidence 10Numerical ExcellenceCommercial in Confidence

hellipBut what happens when it comes to Big Data

x = (double) malloc(10000000 10000 sizeof(double))

Past efficient algorithms break down

Need different ways of solving same problem

How do we use our existing functions (libraries) on Spark

We must reformulate the problem

An examplehellip (Warning Mathematics Approaching)

Big Data

Commercial in Confidence 11Numerical ExcellenceCommercial in Confidence

General Problem

Given a set of input measurements 11990911199092 hellip119909119901 and an

outcome measurement 119910 fit a linear model

119910 = 119883 119861 + ε

119910 =

1199101

⋮119910119901

119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901

Three ways of solving the same

problem

Linear Regression Example

Commercial in Confidence 12Numerical ExcellenceCommercial in Confidence

Solution 1 (Optimal Solution)

119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901

= 119876119877lowast where 119877lowast =1198770

119877 is 119901 by 119901 Upper Triangular 119876 orthogonal

119877 119861 = 1198881 where 119888 = 119876119879119910 hellip

hellip lots of linear algebra

but we have an algorithm

Linear Regression Example

Commercial in Confidence 13Numerical ExcellenceCommercial in Confidence

Solution 2 (Normal Equations)

119883119879119883 119861 = 119883119879 119910

If we 3 independent variables

119883 =1 2 3⋮ ⋮ ⋮4 5 6

119910 =7⋮8

Linear Regression Example

Commercial in Confidence 14Numerical ExcellenceCommercial in Confidence

119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6

1 2 3⋮ ⋮ ⋮4 5 6

=

11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133

119861 = 119883119879119883 minus1119883119879 119910

This computation can be map out to slave nodes

Linear Regression Example

Commercial in Confidence 15Numerical ExcellenceCommercial in Confidence

Solution 3 (Optimization)

min 119910 minus 119883 1198612

Iterative over the data

Final answer is an approximate to

the solution

Can add constraints to variables

(LASSO)

Linear Regression Example

Commercial in Confidence 16Numerical ExcellenceCommercial in Confidence

Machine LearningOptimization

MLlib uses solution 3 for many applications

Classification

Regression

Gaussian Mixture Model

Kmeans

Slave nodes are used for callbacks to access data

Map Reduce Repeat

MLlib Algorithms

Commercial in Confidence 17Numerical ExcellenceCommercial in Confidence

MLlib Algorithms

MasterLibraries

Slaves Slaves

Commercial in Confidence 18Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence

Use Normal Equations (Solution 2) to compute Linear Regression on large dataset

NAG Libraries were distributed across MasterSlave Nodes

Steps

1 Read in data

2 Call routine to compute sum-of-squares matrix on partitions (chucks)

Default partition size is 64mb

3 Call routine to aggregate SSQ matrix together

4 Call routine to compute optimal coefficients

Example on Spark

Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence

Linear Regression Example

Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL

Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence

Linear Regression Example Data

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2

rddparallelize(data)cache()

Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2LIBRARY_CALL(data)

Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence

Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a

function to all elements of this RDD

flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results

collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence

Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies

a function f to each partition of this RDD

mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions

Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence

public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception

if(dataSize numVars datapointspartitions()size() gt 10000000)

throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)

DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())

hellip

G02BW g02bw = new G02BW()

g02bweval(numVars finalptssq ifail)

if(g02bwgetIFAIL() gt 0)

Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())

throw new NAGSparkException(Error from g02bw)

How it looks in Spark

Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence

static class ComputeSSQ implements

FlatMapFunctionltIteratorltLabeledPointgt DataClassgt

Override

public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception

ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()

while(iterhasNext()) mypointsadd(iternext())

int length = mypointssize()

int numvars = mypointsget(0)features()size() + 1

double[] x = new double[lengthnumvars]

for(int i=0 iltlength i++)

for(int j=0 jltnumvars-1 j++)

x[(j) length + i] = mypointsget(i)features()apply(j)

x[(numvars-1) length + i] = mypointsget(i)label()

G02BU g02bu = new G02BU()

g02bueval(M U length numvars x length x 00 means ssq ifail)

return ArraysasList(new DataClass(lengthmeansssq))

ComputeSSQ (Mapper)

Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence

static class CombineData implements Function2ltDataClassDataClassDataClassgt

Override

public DataClass call(DataClass data1 DataClass data2) throws Exception

G02BZ g02bz = new G02BZ()

g02bzeval(M data1meanslength data1length data1means

data1ssq data2length data2means data2ssq IFAIL)

data1length = (g02bzgetXSW())

return data1

CombineData (Reducer)

Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master

Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence

Test 1

Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster

Used an 8 slave xlarge cluster (16 GB RAM)

Test 2

Varied the number of slave nodes from 2 - 16

Used 16 GB of data to see how algorithms scale

Linear Regression Example

Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence

Cluster Utilization

Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence

Results of Linear Regression

Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence

Results of Scaling

Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence

NAG vs MLlib

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 10Numerical ExcellenceCommercial in Confidence

hellipBut what happens when it comes to Big Data

x = (double) malloc(10000000 10000 sizeof(double))

Past efficient algorithms break down

Need different ways of solving same problem

How do we use our existing functions (libraries) on Spark

We must reformulate the problem

An examplehellip (Warning Mathematics Approaching)

Big Data

Commercial in Confidence 11Numerical ExcellenceCommercial in Confidence

General Problem

Given a set of input measurements 11990911199092 hellip119909119901 and an

outcome measurement 119910 fit a linear model

119910 = 119883 119861 + ε

119910 =

1199101

⋮119910119901

119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901

Three ways of solving the same

problem

Linear Regression Example

Commercial in Confidence 12Numerical ExcellenceCommercial in Confidence

Solution 1 (Optimal Solution)

119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901

= 119876119877lowast where 119877lowast =1198770

119877 is 119901 by 119901 Upper Triangular 119876 orthogonal

119877 119861 = 1198881 where 119888 = 119876119879119910 hellip

hellip lots of linear algebra

but we have an algorithm

Linear Regression Example

Commercial in Confidence 13Numerical ExcellenceCommercial in Confidence

Solution 2 (Normal Equations)

119883119879119883 119861 = 119883119879 119910

If we 3 independent variables

119883 =1 2 3⋮ ⋮ ⋮4 5 6

119910 =7⋮8

Linear Regression Example

Commercial in Confidence 14Numerical ExcellenceCommercial in Confidence

119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6

1 2 3⋮ ⋮ ⋮4 5 6

=

11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133

119861 = 119883119879119883 minus1119883119879 119910

This computation can be map out to slave nodes

Linear Regression Example

Commercial in Confidence 15Numerical ExcellenceCommercial in Confidence

Solution 3 (Optimization)

min 119910 minus 119883 1198612

Iterative over the data

Final answer is an approximate to

the solution

Can add constraints to variables

(LASSO)

Linear Regression Example

Commercial in Confidence 16Numerical ExcellenceCommercial in Confidence

Machine LearningOptimization

MLlib uses solution 3 for many applications

Classification

Regression

Gaussian Mixture Model

Kmeans

Slave nodes are used for callbacks to access data

Map Reduce Repeat

MLlib Algorithms

Commercial in Confidence 17Numerical ExcellenceCommercial in Confidence

MLlib Algorithms

MasterLibraries

Slaves Slaves

Commercial in Confidence 18Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence

Use Normal Equations (Solution 2) to compute Linear Regression on large dataset

NAG Libraries were distributed across MasterSlave Nodes

Steps

1 Read in data

2 Call routine to compute sum-of-squares matrix on partitions (chucks)

Default partition size is 64mb

3 Call routine to aggregate SSQ matrix together

4 Call routine to compute optimal coefficients

Example on Spark

Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence

Linear Regression Example

Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL

Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence

Linear Regression Example Data

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2

rddparallelize(data)cache()

Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2LIBRARY_CALL(data)

Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence

Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a

function to all elements of this RDD

flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results

collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence

Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies

a function f to each partition of this RDD

mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions

Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence

public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception

if(dataSize numVars datapointspartitions()size() gt 10000000)

throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)

DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())

hellip

G02BW g02bw = new G02BW()

g02bweval(numVars finalptssq ifail)

if(g02bwgetIFAIL() gt 0)

Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())

throw new NAGSparkException(Error from g02bw)

How it looks in Spark

Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence

static class ComputeSSQ implements

FlatMapFunctionltIteratorltLabeledPointgt DataClassgt

Override

public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception

ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()

while(iterhasNext()) mypointsadd(iternext())

int length = mypointssize()

int numvars = mypointsget(0)features()size() + 1

double[] x = new double[lengthnumvars]

for(int i=0 iltlength i++)

for(int j=0 jltnumvars-1 j++)

x[(j) length + i] = mypointsget(i)features()apply(j)

x[(numvars-1) length + i] = mypointsget(i)label()

G02BU g02bu = new G02BU()

g02bueval(M U length numvars x length x 00 means ssq ifail)

return ArraysasList(new DataClass(lengthmeansssq))

ComputeSSQ (Mapper)

Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence

static class CombineData implements Function2ltDataClassDataClassDataClassgt

Override

public DataClass call(DataClass data1 DataClass data2) throws Exception

G02BZ g02bz = new G02BZ()

g02bzeval(M data1meanslength data1length data1means

data1ssq data2length data2means data2ssq IFAIL)

data1length = (g02bzgetXSW())

return data1

CombineData (Reducer)

Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master

Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence

Test 1

Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster

Used an 8 slave xlarge cluster (16 GB RAM)

Test 2

Varied the number of slave nodes from 2 - 16

Used 16 GB of data to see how algorithms scale

Linear Regression Example

Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence

Cluster Utilization

Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence

Results of Linear Regression

Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence

Results of Scaling

Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence

NAG vs MLlib

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 11Numerical ExcellenceCommercial in Confidence

General Problem

Given a set of input measurements 11990911199092 hellip119909119901 and an

outcome measurement 119910 fit a linear model

119910 = 119883 119861 + ε

119910 =

1199101

⋮119910119901

119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901

Three ways of solving the same

problem

Linear Regression Example

Commercial in Confidence 12Numerical ExcellenceCommercial in Confidence

Solution 1 (Optimal Solution)

119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901

= 119876119877lowast where 119877lowast =1198770

119877 is 119901 by 119901 Upper Triangular 119876 orthogonal

119877 119861 = 1198881 where 119888 = 119876119879119910 hellip

hellip lots of linear algebra

but we have an algorithm

Linear Regression Example

Commercial in Confidence 13Numerical ExcellenceCommercial in Confidence

Solution 2 (Normal Equations)

119883119879119883 119861 = 119883119879 119910

If we 3 independent variables

119883 =1 2 3⋮ ⋮ ⋮4 5 6

119910 =7⋮8

Linear Regression Example

Commercial in Confidence 14Numerical ExcellenceCommercial in Confidence

119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6

1 2 3⋮ ⋮ ⋮4 5 6

=

11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133

119861 = 119883119879119883 minus1119883119879 119910

This computation can be map out to slave nodes

Linear Regression Example

Commercial in Confidence 15Numerical ExcellenceCommercial in Confidence

Solution 3 (Optimization)

min 119910 minus 119883 1198612

Iterative over the data

Final answer is an approximate to

the solution

Can add constraints to variables

(LASSO)

Linear Regression Example

Commercial in Confidence 16Numerical ExcellenceCommercial in Confidence

Machine LearningOptimization

MLlib uses solution 3 for many applications

Classification

Regression

Gaussian Mixture Model

Kmeans

Slave nodes are used for callbacks to access data

Map Reduce Repeat

MLlib Algorithms

Commercial in Confidence 17Numerical ExcellenceCommercial in Confidence

MLlib Algorithms

MasterLibraries

Slaves Slaves

Commercial in Confidence 18Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence

Use Normal Equations (Solution 2) to compute Linear Regression on large dataset

NAG Libraries were distributed across MasterSlave Nodes

Steps

1 Read in data

2 Call routine to compute sum-of-squares matrix on partitions (chucks)

Default partition size is 64mb

3 Call routine to aggregate SSQ matrix together

4 Call routine to compute optimal coefficients

Example on Spark

Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence

Linear Regression Example

Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL

Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence

Linear Regression Example Data

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2

rddparallelize(data)cache()

Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2LIBRARY_CALL(data)

Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence

Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a

function to all elements of this RDD

flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results

collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence

Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies

a function f to each partition of this RDD

mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions

Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence

public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception

if(dataSize numVars datapointspartitions()size() gt 10000000)

throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)

DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())

hellip

G02BW g02bw = new G02BW()

g02bweval(numVars finalptssq ifail)

if(g02bwgetIFAIL() gt 0)

Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())

throw new NAGSparkException(Error from g02bw)

How it looks in Spark

Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence

static class ComputeSSQ implements

FlatMapFunctionltIteratorltLabeledPointgt DataClassgt

Override

public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception

ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()

while(iterhasNext()) mypointsadd(iternext())

int length = mypointssize()

int numvars = mypointsget(0)features()size() + 1

double[] x = new double[lengthnumvars]

for(int i=0 iltlength i++)

for(int j=0 jltnumvars-1 j++)

x[(j) length + i] = mypointsget(i)features()apply(j)

x[(numvars-1) length + i] = mypointsget(i)label()

G02BU g02bu = new G02BU()

g02bueval(M U length numvars x length x 00 means ssq ifail)

return ArraysasList(new DataClass(lengthmeansssq))

ComputeSSQ (Mapper)

Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence

static class CombineData implements Function2ltDataClassDataClassDataClassgt

Override

public DataClass call(DataClass data1 DataClass data2) throws Exception

G02BZ g02bz = new G02BZ()

g02bzeval(M data1meanslength data1length data1means

data1ssq data2length data2means data2ssq IFAIL)

data1length = (g02bzgetXSW())

return data1

CombineData (Reducer)

Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master

Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence

Test 1

Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster

Used an 8 slave xlarge cluster (16 GB RAM)

Test 2

Varied the number of slave nodes from 2 - 16

Used 16 GB of data to see how algorithms scale

Linear Regression Example

Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence

Cluster Utilization

Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence

Results of Linear Regression

Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence

Results of Scaling

Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence

NAG vs MLlib

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 12Numerical ExcellenceCommercial in Confidence

Solution 1 (Optimal Solution)

119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901

= 119876119877lowast where 119877lowast =1198770

119877 is 119901 by 119901 Upper Triangular 119876 orthogonal

119877 119861 = 1198881 where 119888 = 119876119879119910 hellip

hellip lots of linear algebra

but we have an algorithm

Linear Regression Example

Commercial in Confidence 13Numerical ExcellenceCommercial in Confidence

Solution 2 (Normal Equations)

119883119879119883 119861 = 119883119879 119910

If we 3 independent variables

119883 =1 2 3⋮ ⋮ ⋮4 5 6

119910 =7⋮8

Linear Regression Example

Commercial in Confidence 14Numerical ExcellenceCommercial in Confidence

119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6

1 2 3⋮ ⋮ ⋮4 5 6

=

11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133

119861 = 119883119879119883 minus1119883119879 119910

This computation can be map out to slave nodes

Linear Regression Example

Commercial in Confidence 15Numerical ExcellenceCommercial in Confidence

Solution 3 (Optimization)

min 119910 minus 119883 1198612

Iterative over the data

Final answer is an approximate to

the solution

Can add constraints to variables

(LASSO)

Linear Regression Example

Commercial in Confidence 16Numerical ExcellenceCommercial in Confidence

Machine LearningOptimization

MLlib uses solution 3 for many applications

Classification

Regression

Gaussian Mixture Model

Kmeans

Slave nodes are used for callbacks to access data

Map Reduce Repeat

MLlib Algorithms

Commercial in Confidence 17Numerical ExcellenceCommercial in Confidence

MLlib Algorithms

MasterLibraries

Slaves Slaves

Commercial in Confidence 18Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence

Use Normal Equations (Solution 2) to compute Linear Regression on large dataset

NAG Libraries were distributed across MasterSlave Nodes

Steps

1 Read in data

2 Call routine to compute sum-of-squares matrix on partitions (chucks)

Default partition size is 64mb

3 Call routine to aggregate SSQ matrix together

4 Call routine to compute optimal coefficients

Example on Spark

Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence

Linear Regression Example

Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL

Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence

Linear Regression Example Data

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2

rddparallelize(data)cache()

Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2LIBRARY_CALL(data)

Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence

Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a

function to all elements of this RDD

flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results

collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence

Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies

a function f to each partition of this RDD

mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions

Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence

public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception

if(dataSize numVars datapointspartitions()size() gt 10000000)

throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)

DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())

hellip

G02BW g02bw = new G02BW()

g02bweval(numVars finalptssq ifail)

if(g02bwgetIFAIL() gt 0)

Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())

throw new NAGSparkException(Error from g02bw)

How it looks in Spark

Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence

static class ComputeSSQ implements

FlatMapFunctionltIteratorltLabeledPointgt DataClassgt

Override

public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception

ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()

while(iterhasNext()) mypointsadd(iternext())

int length = mypointssize()

int numvars = mypointsget(0)features()size() + 1

double[] x = new double[lengthnumvars]

for(int i=0 iltlength i++)

for(int j=0 jltnumvars-1 j++)

x[(j) length + i] = mypointsget(i)features()apply(j)

x[(numvars-1) length + i] = mypointsget(i)label()

G02BU g02bu = new G02BU()

g02bueval(M U length numvars x length x 00 means ssq ifail)

return ArraysasList(new DataClass(lengthmeansssq))

ComputeSSQ (Mapper)

Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence

static class CombineData implements Function2ltDataClassDataClassDataClassgt

Override

public DataClass call(DataClass data1 DataClass data2) throws Exception

G02BZ g02bz = new G02BZ()

g02bzeval(M data1meanslength data1length data1means

data1ssq data2length data2means data2ssq IFAIL)

data1length = (g02bzgetXSW())

return data1

CombineData (Reducer)

Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master

Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence

Test 1

Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster

Used an 8 slave xlarge cluster (16 GB RAM)

Test 2

Varied the number of slave nodes from 2 - 16

Used 16 GB of data to see how algorithms scale

Linear Regression Example

Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence

Cluster Utilization

Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence

Results of Linear Regression

Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence

Results of Scaling

Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence

NAG vs MLlib

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 13Numerical ExcellenceCommercial in Confidence

Solution 2 (Normal Equations)

119883119879119883 119861 = 119883119879 119910

If we 3 independent variables

119883 =1 2 3⋮ ⋮ ⋮4 5 6

119910 =7⋮8

Linear Regression Example

Commercial in Confidence 14Numerical ExcellenceCommercial in Confidence

119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6

1 2 3⋮ ⋮ ⋮4 5 6

=

11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133

119861 = 119883119879119883 minus1119883119879 119910

This computation can be map out to slave nodes

Linear Regression Example

Commercial in Confidence 15Numerical ExcellenceCommercial in Confidence

Solution 3 (Optimization)

min 119910 minus 119883 1198612

Iterative over the data

Final answer is an approximate to

the solution

Can add constraints to variables

(LASSO)

Linear Regression Example

Commercial in Confidence 16Numerical ExcellenceCommercial in Confidence

Machine LearningOptimization

MLlib uses solution 3 for many applications

Classification

Regression

Gaussian Mixture Model

Kmeans

Slave nodes are used for callbacks to access data

Map Reduce Repeat

MLlib Algorithms

Commercial in Confidence 17Numerical ExcellenceCommercial in Confidence

MLlib Algorithms

MasterLibraries

Slaves Slaves

Commercial in Confidence 18Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence

Use Normal Equations (Solution 2) to compute Linear Regression on large dataset

NAG Libraries were distributed across MasterSlave Nodes

Steps

1 Read in data

2 Call routine to compute sum-of-squares matrix on partitions (chucks)

Default partition size is 64mb

3 Call routine to aggregate SSQ matrix together

4 Call routine to compute optimal coefficients

Example on Spark

Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence

Linear Regression Example

Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL

Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence

Linear Regression Example Data

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2

rddparallelize(data)cache()

Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2LIBRARY_CALL(data)

Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence

Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a

function to all elements of this RDD

flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results

collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence

Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies

a function f to each partition of this RDD

mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions

Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence

public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception

if(dataSize numVars datapointspartitions()size() gt 10000000)

throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)

DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())

hellip

G02BW g02bw = new G02BW()

g02bweval(numVars finalptssq ifail)

if(g02bwgetIFAIL() gt 0)

Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())

throw new NAGSparkException(Error from g02bw)

How it looks in Spark

Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence

static class ComputeSSQ implements

FlatMapFunctionltIteratorltLabeledPointgt DataClassgt

Override

public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception

ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()

while(iterhasNext()) mypointsadd(iternext())

int length = mypointssize()

int numvars = mypointsget(0)features()size() + 1

double[] x = new double[lengthnumvars]

for(int i=0 iltlength i++)

for(int j=0 jltnumvars-1 j++)

x[(j) length + i] = mypointsget(i)features()apply(j)

x[(numvars-1) length + i] = mypointsget(i)label()

G02BU g02bu = new G02BU()

g02bueval(M U length numvars x length x 00 means ssq ifail)

return ArraysasList(new DataClass(lengthmeansssq))

ComputeSSQ (Mapper)

Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence

static class CombineData implements Function2ltDataClassDataClassDataClassgt

Override

public DataClass call(DataClass data1 DataClass data2) throws Exception

G02BZ g02bz = new G02BZ()

g02bzeval(M data1meanslength data1length data1means

data1ssq data2length data2means data2ssq IFAIL)

data1length = (g02bzgetXSW())

return data1

CombineData (Reducer)

Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master

Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence

Test 1

Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster

Used an 8 slave xlarge cluster (16 GB RAM)

Test 2

Varied the number of slave nodes from 2 - 16

Used 16 GB of data to see how algorithms scale

Linear Regression Example

Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence

Cluster Utilization

Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence

Results of Linear Regression

Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence

Results of Scaling

Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence

NAG vs MLlib

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 14Numerical ExcellenceCommercial in Confidence

119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6

1 2 3⋮ ⋮ ⋮4 5 6

=

11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133

119861 = 119883119879119883 minus1119883119879 119910

This computation can be map out to slave nodes

Linear Regression Example

Commercial in Confidence 15Numerical ExcellenceCommercial in Confidence

Solution 3 (Optimization)

min 119910 minus 119883 1198612

Iterative over the data

Final answer is an approximate to

the solution

Can add constraints to variables

(LASSO)

Linear Regression Example

Commercial in Confidence 16Numerical ExcellenceCommercial in Confidence

Machine LearningOptimization

MLlib uses solution 3 for many applications

Classification

Regression

Gaussian Mixture Model

Kmeans

Slave nodes are used for callbacks to access data

Map Reduce Repeat

MLlib Algorithms

Commercial in Confidence 17Numerical ExcellenceCommercial in Confidence

MLlib Algorithms

MasterLibraries

Slaves Slaves

Commercial in Confidence 18Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence

Use Normal Equations (Solution 2) to compute Linear Regression on large dataset

NAG Libraries were distributed across MasterSlave Nodes

Steps

1 Read in data

2 Call routine to compute sum-of-squares matrix on partitions (chucks)

Default partition size is 64mb

3 Call routine to aggregate SSQ matrix together

4 Call routine to compute optimal coefficients

Example on Spark

Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence

Linear Regression Example

Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL

Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence

Linear Regression Example Data

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2

rddparallelize(data)cache()

Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2LIBRARY_CALL(data)

Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence

Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a

function to all elements of this RDD

flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results

collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence

Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies

a function f to each partition of this RDD

mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions

Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence

public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception

if(dataSize numVars datapointspartitions()size() gt 10000000)

throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)

DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())

hellip

G02BW g02bw = new G02BW()

g02bweval(numVars finalptssq ifail)

if(g02bwgetIFAIL() gt 0)

Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())

throw new NAGSparkException(Error from g02bw)

How it looks in Spark

Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence

static class ComputeSSQ implements

FlatMapFunctionltIteratorltLabeledPointgt DataClassgt

Override

public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception

ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()

while(iterhasNext()) mypointsadd(iternext())

int length = mypointssize()

int numvars = mypointsget(0)features()size() + 1

double[] x = new double[lengthnumvars]

for(int i=0 iltlength i++)

for(int j=0 jltnumvars-1 j++)

x[(j) length + i] = mypointsget(i)features()apply(j)

x[(numvars-1) length + i] = mypointsget(i)label()

G02BU g02bu = new G02BU()

g02bueval(M U length numvars x length x 00 means ssq ifail)

return ArraysasList(new DataClass(lengthmeansssq))

ComputeSSQ (Mapper)

Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence

static class CombineData implements Function2ltDataClassDataClassDataClassgt

Override

public DataClass call(DataClass data1 DataClass data2) throws Exception

G02BZ g02bz = new G02BZ()

g02bzeval(M data1meanslength data1length data1means

data1ssq data2length data2means data2ssq IFAIL)

data1length = (g02bzgetXSW())

return data1

CombineData (Reducer)

Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master

Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence

Test 1

Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster

Used an 8 slave xlarge cluster (16 GB RAM)

Test 2

Varied the number of slave nodes from 2 - 16

Used 16 GB of data to see how algorithms scale

Linear Regression Example

Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence

Cluster Utilization

Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence

Results of Linear Regression

Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence

Results of Scaling

Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence

NAG vs MLlib

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 15Numerical ExcellenceCommercial in Confidence

Solution 3 (Optimization)

min 119910 minus 119883 1198612

Iterative over the data

Final answer is an approximate to

the solution

Can add constraints to variables

(LASSO)

Linear Regression Example

Commercial in Confidence 16Numerical ExcellenceCommercial in Confidence

Machine LearningOptimization

MLlib uses solution 3 for many applications

Classification

Regression

Gaussian Mixture Model

Kmeans

Slave nodes are used for callbacks to access data

Map Reduce Repeat

MLlib Algorithms

Commercial in Confidence 17Numerical ExcellenceCommercial in Confidence

MLlib Algorithms

MasterLibraries

Slaves Slaves

Commercial in Confidence 18Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence

Use Normal Equations (Solution 2) to compute Linear Regression on large dataset

NAG Libraries were distributed across MasterSlave Nodes

Steps

1 Read in data

2 Call routine to compute sum-of-squares matrix on partitions (chucks)

Default partition size is 64mb

3 Call routine to aggregate SSQ matrix together

4 Call routine to compute optimal coefficients

Example on Spark

Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence

Linear Regression Example

Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL

Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence

Linear Regression Example Data

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2

rddparallelize(data)cache()

Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2LIBRARY_CALL(data)

Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence

Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a

function to all elements of this RDD

flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results

collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence

Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies

a function f to each partition of this RDD

mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions

Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence

public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception

if(dataSize numVars datapointspartitions()size() gt 10000000)

throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)

DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())

hellip

G02BW g02bw = new G02BW()

g02bweval(numVars finalptssq ifail)

if(g02bwgetIFAIL() gt 0)

Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())

throw new NAGSparkException(Error from g02bw)

How it looks in Spark

Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence

static class ComputeSSQ implements

FlatMapFunctionltIteratorltLabeledPointgt DataClassgt

Override

public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception

ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()

while(iterhasNext()) mypointsadd(iternext())

int length = mypointssize()

int numvars = mypointsget(0)features()size() + 1

double[] x = new double[lengthnumvars]

for(int i=0 iltlength i++)

for(int j=0 jltnumvars-1 j++)

x[(j) length + i] = mypointsget(i)features()apply(j)

x[(numvars-1) length + i] = mypointsget(i)label()

G02BU g02bu = new G02BU()

g02bueval(M U length numvars x length x 00 means ssq ifail)

return ArraysasList(new DataClass(lengthmeansssq))

ComputeSSQ (Mapper)

Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence

static class CombineData implements Function2ltDataClassDataClassDataClassgt

Override

public DataClass call(DataClass data1 DataClass data2) throws Exception

G02BZ g02bz = new G02BZ()

g02bzeval(M data1meanslength data1length data1means

data1ssq data2length data2means data2ssq IFAIL)

data1length = (g02bzgetXSW())

return data1

CombineData (Reducer)

Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master

Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence

Test 1

Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster

Used an 8 slave xlarge cluster (16 GB RAM)

Test 2

Varied the number of slave nodes from 2 - 16

Used 16 GB of data to see how algorithms scale

Linear Regression Example

Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence

Cluster Utilization

Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence

Results of Linear Regression

Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence

Results of Scaling

Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence

NAG vs MLlib

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 16Numerical ExcellenceCommercial in Confidence

Machine LearningOptimization

MLlib uses solution 3 for many applications

Classification

Regression

Gaussian Mixture Model

Kmeans

Slave nodes are used for callbacks to access data

Map Reduce Repeat

MLlib Algorithms

Commercial in Confidence 17Numerical ExcellenceCommercial in Confidence

MLlib Algorithms

MasterLibraries

Slaves Slaves

Commercial in Confidence 18Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence

Use Normal Equations (Solution 2) to compute Linear Regression on large dataset

NAG Libraries were distributed across MasterSlave Nodes

Steps

1 Read in data

2 Call routine to compute sum-of-squares matrix on partitions (chucks)

Default partition size is 64mb

3 Call routine to aggregate SSQ matrix together

4 Call routine to compute optimal coefficients

Example on Spark

Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence

Linear Regression Example

Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL

Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence

Linear Regression Example Data

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2

rddparallelize(data)cache()

Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2LIBRARY_CALL(data)

Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence

Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a

function to all elements of this RDD

flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results

collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence

Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies

a function f to each partition of this RDD

mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions

Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence

public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception

if(dataSize numVars datapointspartitions()size() gt 10000000)

throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)

DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())

hellip

G02BW g02bw = new G02BW()

g02bweval(numVars finalptssq ifail)

if(g02bwgetIFAIL() gt 0)

Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())

throw new NAGSparkException(Error from g02bw)

How it looks in Spark

Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence

static class ComputeSSQ implements

FlatMapFunctionltIteratorltLabeledPointgt DataClassgt

Override

public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception

ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()

while(iterhasNext()) mypointsadd(iternext())

int length = mypointssize()

int numvars = mypointsget(0)features()size() + 1

double[] x = new double[lengthnumvars]

for(int i=0 iltlength i++)

for(int j=0 jltnumvars-1 j++)

x[(j) length + i] = mypointsget(i)features()apply(j)

x[(numvars-1) length + i] = mypointsget(i)label()

G02BU g02bu = new G02BU()

g02bueval(M U length numvars x length x 00 means ssq ifail)

return ArraysasList(new DataClass(lengthmeansssq))

ComputeSSQ (Mapper)

Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence

static class CombineData implements Function2ltDataClassDataClassDataClassgt

Override

public DataClass call(DataClass data1 DataClass data2) throws Exception

G02BZ g02bz = new G02BZ()

g02bzeval(M data1meanslength data1length data1means

data1ssq data2length data2means data2ssq IFAIL)

data1length = (g02bzgetXSW())

return data1

CombineData (Reducer)

Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master

Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence

Test 1

Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster

Used an 8 slave xlarge cluster (16 GB RAM)

Test 2

Varied the number of slave nodes from 2 - 16

Used 16 GB of data to see how algorithms scale

Linear Regression Example

Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence

Cluster Utilization

Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence

Results of Linear Regression

Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence

Results of Scaling

Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence

NAG vs MLlib

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 17Numerical ExcellenceCommercial in Confidence

MLlib Algorithms

MasterLibraries

Slaves Slaves

Commercial in Confidence 18Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence

Use Normal Equations (Solution 2) to compute Linear Regression on large dataset

NAG Libraries were distributed across MasterSlave Nodes

Steps

1 Read in data

2 Call routine to compute sum-of-squares matrix on partitions (chucks)

Default partition size is 64mb

3 Call routine to aggregate SSQ matrix together

4 Call routine to compute optimal coefficients

Example on Spark

Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence

Linear Regression Example

Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL

Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence

Linear Regression Example Data

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2

rddparallelize(data)cache()

Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2LIBRARY_CALL(data)

Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence

Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a

function to all elements of this RDD

flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results

collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence

Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies

a function f to each partition of this RDD

mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions

Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence

public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception

if(dataSize numVars datapointspartitions()size() gt 10000000)

throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)

DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())

hellip

G02BW g02bw = new G02BW()

g02bweval(numVars finalptssq ifail)

if(g02bwgetIFAIL() gt 0)

Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())

throw new NAGSparkException(Error from g02bw)

How it looks in Spark

Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence

static class ComputeSSQ implements

FlatMapFunctionltIteratorltLabeledPointgt DataClassgt

Override

public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception

ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()

while(iterhasNext()) mypointsadd(iternext())

int length = mypointssize()

int numvars = mypointsget(0)features()size() + 1

double[] x = new double[lengthnumvars]

for(int i=0 iltlength i++)

for(int j=0 jltnumvars-1 j++)

x[(j) length + i] = mypointsget(i)features()apply(j)

x[(numvars-1) length + i] = mypointsget(i)label()

G02BU g02bu = new G02BU()

g02bueval(M U length numvars x length x 00 means ssq ifail)

return ArraysasList(new DataClass(lengthmeansssq))

ComputeSSQ (Mapper)

Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence

static class CombineData implements Function2ltDataClassDataClassDataClassgt

Override

public DataClass call(DataClass data1 DataClass data2) throws Exception

G02BZ g02bz = new G02BZ()

g02bzeval(M data1meanslength data1length data1means

data1ssq data2length data2means data2ssq IFAIL)

data1length = (g02bzgetXSW())

return data1

CombineData (Reducer)

Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master

Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence

Test 1

Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster

Used an 8 slave xlarge cluster (16 GB RAM)

Test 2

Varied the number of slave nodes from 2 - 16

Used 16 GB of data to see how algorithms scale

Linear Regression Example

Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence

Cluster Utilization

Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence

Results of Linear Regression

Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence

Results of Scaling

Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence

NAG vs MLlib

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 18Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence

Use Normal Equations (Solution 2) to compute Linear Regression on large dataset

NAG Libraries were distributed across MasterSlave Nodes

Steps

1 Read in data

2 Call routine to compute sum-of-squares matrix on partitions (chucks)

Default partition size is 64mb

3 Call routine to aggregate SSQ matrix together

4 Call routine to compute optimal coefficients

Example on Spark

Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence

Linear Regression Example

Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL

Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence

Linear Regression Example Data

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2

rddparallelize(data)cache()

Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2LIBRARY_CALL(data)

Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence

Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a

function to all elements of this RDD

flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results

collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence

Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies

a function f to each partition of this RDD

mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions

Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence

public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception

if(dataSize numVars datapointspartitions()size() gt 10000000)

throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)

DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())

hellip

G02BW g02bw = new G02BW()

g02bweval(numVars finalptssq ifail)

if(g02bwgetIFAIL() gt 0)

Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())

throw new NAGSparkException(Error from g02bw)

How it looks in Spark

Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence

static class ComputeSSQ implements

FlatMapFunctionltIteratorltLabeledPointgt DataClassgt

Override

public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception

ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()

while(iterhasNext()) mypointsadd(iternext())

int length = mypointssize()

int numvars = mypointsget(0)features()size() + 1

double[] x = new double[lengthnumvars]

for(int i=0 iltlength i++)

for(int j=0 jltnumvars-1 j++)

x[(j) length + i] = mypointsget(i)features()apply(j)

x[(numvars-1) length + i] = mypointsget(i)label()

G02BU g02bu = new G02BU()

g02bueval(M U length numvars x length x 00 means ssq ifail)

return ArraysasList(new DataClass(lengthmeansssq))

ComputeSSQ (Mapper)

Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence

static class CombineData implements Function2ltDataClassDataClassDataClassgt

Override

public DataClass call(DataClass data1 DataClass data2) throws Exception

G02BZ g02bz = new G02BZ()

g02bzeval(M data1meanslength data1length data1means

data1ssq data2length data2means data2ssq IFAIL)

data1length = (g02bzgetXSW())

return data1

CombineData (Reducer)

Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master

Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence

Test 1

Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster

Used an 8 slave xlarge cluster (16 GB RAM)

Test 2

Varied the number of slave nodes from 2 - 16

Used 16 GB of data to see how algorithms scale

Linear Regression Example

Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence

Cluster Utilization

Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence

Results of Linear Regression

Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence

Results of Scaling

Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence

NAG vs MLlib

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence

Use Normal Equations (Solution 2) to compute Linear Regression on large dataset

NAG Libraries were distributed across MasterSlave Nodes

Steps

1 Read in data

2 Call routine to compute sum-of-squares matrix on partitions (chucks)

Default partition size is 64mb

3 Call routine to aggregate SSQ matrix together

4 Call routine to compute optimal coefficients

Example on Spark

Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence

Linear Regression Example

Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL

Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence

Linear Regression Example Data

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2

rddparallelize(data)cache()

Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2LIBRARY_CALL(data)

Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence

Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a

function to all elements of this RDD

flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results

collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence

Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies

a function f to each partition of this RDD

mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions

Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence

public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception

if(dataSize numVars datapointspartitions()size() gt 10000000)

throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)

DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())

hellip

G02BW g02bw = new G02BW()

g02bweval(numVars finalptssq ifail)

if(g02bwgetIFAIL() gt 0)

Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())

throw new NAGSparkException(Error from g02bw)

How it looks in Spark

Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence

static class ComputeSSQ implements

FlatMapFunctionltIteratorltLabeledPointgt DataClassgt

Override

public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception

ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()

while(iterhasNext()) mypointsadd(iternext())

int length = mypointssize()

int numvars = mypointsget(0)features()size() + 1

double[] x = new double[lengthnumvars]

for(int i=0 iltlength i++)

for(int j=0 jltnumvars-1 j++)

x[(j) length + i] = mypointsget(i)features()apply(j)

x[(numvars-1) length + i] = mypointsget(i)label()

G02BU g02bu = new G02BU()

g02bueval(M U length numvars x length x 00 means ssq ifail)

return ArraysasList(new DataClass(lengthmeansssq))

ComputeSSQ (Mapper)

Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence

static class CombineData implements Function2ltDataClassDataClassDataClassgt

Override

public DataClass call(DataClass data1 DataClass data2) throws Exception

G02BZ g02bz = new G02BZ()

g02bzeval(M data1meanslength data1length data1means

data1ssq data2length data2means data2ssq IFAIL)

data1length = (g02bzgetXSW())

return data1

CombineData (Reducer)

Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master

Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence

Test 1

Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster

Used an 8 slave xlarge cluster (16 GB RAM)

Test 2

Varied the number of slave nodes from 2 - 16

Used 16 GB of data to see how algorithms scale

Linear Regression Example

Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence

Cluster Utilization

Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence

Results of Linear Regression

Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence

Results of Scaling

Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence

NAG vs MLlib

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence

Linear Regression Example

Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL

Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence

Linear Regression Example Data

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2

rddparallelize(data)cache()

Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2LIBRARY_CALL(data)

Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence

Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a

function to all elements of this RDD

flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results

collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence

Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies

a function f to each partition of this RDD

mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions

Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence

public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception

if(dataSize numVars datapointspartitions()size() gt 10000000)

throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)

DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())

hellip

G02BW g02bw = new G02BW()

g02bweval(numVars finalptssq ifail)

if(g02bwgetIFAIL() gt 0)

Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())

throw new NAGSparkException(Error from g02bw)

How it looks in Spark

Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence

static class ComputeSSQ implements

FlatMapFunctionltIteratorltLabeledPointgt DataClassgt

Override

public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception

ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()

while(iterhasNext()) mypointsadd(iternext())

int length = mypointssize()

int numvars = mypointsget(0)features()size() + 1

double[] x = new double[lengthnumvars]

for(int i=0 iltlength i++)

for(int j=0 jltnumvars-1 j++)

x[(j) length + i] = mypointsget(i)features()apply(j)

x[(numvars-1) length + i] = mypointsget(i)label()

G02BU g02bu = new G02BU()

g02bueval(M U length numvars x length x 00 means ssq ifail)

return ArraysasList(new DataClass(lengthmeansssq))

ComputeSSQ (Mapper)

Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence

static class CombineData implements Function2ltDataClassDataClassDataClassgt

Override

public DataClass call(DataClass data1 DataClass data2) throws Exception

G02BZ g02bz = new G02BZ()

g02bzeval(M data1meanslength data1length data1means

data1ssq data2length data2means data2ssq IFAIL)

data1length = (g02bzgetXSW())

return data1

CombineData (Reducer)

Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master

Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence

Test 1

Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster

Used an 8 slave xlarge cluster (16 GB RAM)

Test 2

Varied the number of slave nodes from 2 - 16

Used 16 GB of data to see how algorithms scale

Linear Regression Example

Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence

Cluster Utilization

Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence

Results of Linear Regression

Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence

Results of Scaling

Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence

NAG vs MLlib

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence

Linear Regression Example Data

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2

rddparallelize(data)cache()

Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2LIBRARY_CALL(data)

Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence

Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a

function to all elements of this RDD

flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results

collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence

Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies

a function f to each partition of this RDD

mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions

Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence

public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception

if(dataSize numVars datapointspartitions()size() gt 10000000)

throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)

DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())

hellip

G02BW g02bw = new G02BW()

g02bweval(numVars finalptssq ifail)

if(g02bwgetIFAIL() gt 0)

Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())

throw new NAGSparkException(Error from g02bw)

How it looks in Spark

Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence

static class ComputeSSQ implements

FlatMapFunctionltIteratorltLabeledPointgt DataClassgt

Override

public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception

ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()

while(iterhasNext()) mypointsadd(iternext())

int length = mypointssize()

int numvars = mypointsget(0)features()size() + 1

double[] x = new double[lengthnumvars]

for(int i=0 iltlength i++)

for(int j=0 jltnumvars-1 j++)

x[(j) length + i] = mypointsget(i)features()apply(j)

x[(numvars-1) length + i] = mypointsget(i)label()

G02BU g02bu = new G02BU()

g02bueval(M U length numvars x length x 00 means ssq ifail)

return ArraysasList(new DataClass(lengthmeansssq))

ComputeSSQ (Mapper)

Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence

static class CombineData implements Function2ltDataClassDataClassDataClassgt

Override

public DataClass call(DataClass data1 DataClass data2) throws Exception

G02BZ g02bz = new G02BZ()

g02bzeval(M data1meanslength data1length data1means

data1ssq data2length data2means data2ssq IFAIL)

data1length = (g02bzgetXSW())

return data1

CombineData (Reducer)

Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master

Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence

Test 1

Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster

Used an 8 slave xlarge cluster (16 GB RAM)

Test 2

Varied the number of slave nodes from 2 - 16

Used 16 GB of data to see how algorithms scale

Linear Regression Example

Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence

Cluster Utilization

Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence

Results of Linear Regression

Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence

Results of Scaling

Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence

NAG vs MLlib

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2

rddparallelize(data)cache()

Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2LIBRARY_CALL(data)

Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence

Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a

function to all elements of this RDD

flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results

collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence

Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies

a function f to each partition of this RDD

mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions

Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence

public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception

if(dataSize numVars datapointspartitions()size() gt 10000000)

throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)

DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())

hellip

G02BW g02bw = new G02BW()

g02bweval(numVars finalptssq ifail)

if(g02bwgetIFAIL() gt 0)

Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())

throw new NAGSparkException(Error from g02bw)

How it looks in Spark

Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence

static class ComputeSSQ implements

FlatMapFunctionltIteratorltLabeledPointgt DataClassgt

Override

public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception

ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()

while(iterhasNext()) mypointsadd(iternext())

int length = mypointssize()

int numvars = mypointsget(0)features()size() + 1

double[] x = new double[lengthnumvars]

for(int i=0 iltlength i++)

for(int j=0 jltnumvars-1 j++)

x[(j) length + i] = mypointsget(i)features()apply(j)

x[(numvars-1) length + i] = mypointsget(i)label()

G02BU g02bu = new G02BU()

g02bueval(M U length numvars x length x 00 means ssq ifail)

return ArraysasList(new DataClass(lengthmeansssq))

ComputeSSQ (Mapper)

Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence

static class CombineData implements Function2ltDataClassDataClassDataClassgt

Override

public DataClass call(DataClass data1 DataClass data2) throws Exception

G02BZ g02bz = new G02BZ()

g02bzeval(M data1meanslength data1length data1means

data1ssq data2length data2means data2ssq IFAIL)

data1length = (g02bzgetXSW())

return data1

CombineData (Reducer)

Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master

Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence

Test 1

Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster

Used an 8 slave xlarge cluster (16 GB RAM)

Test 2

Varied the number of slave nodes from 2 - 16

Used 16 GB of data to see how algorithms scale

Linear Regression Example

Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence

Cluster Utilization

Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence

Results of Linear Regression

Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence

Results of Scaling

Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence

NAG vs MLlib

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2LIBRARY_CALL(data)

Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence

Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a

function to all elements of this RDD

flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results

collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence

Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies

a function f to each partition of this RDD

mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions

Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence

public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception

if(dataSize numVars datapointspartitions()size() gt 10000000)

throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)

DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())

hellip

G02BW g02bw = new G02BW()

g02bweval(numVars finalptssq ifail)

if(g02bwgetIFAIL() gt 0)

Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())

throw new NAGSparkException(Error from g02bw)

How it looks in Spark

Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence

static class ComputeSSQ implements

FlatMapFunctionltIteratorltLabeledPointgt DataClassgt

Override

public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception

ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()

while(iterhasNext()) mypointsadd(iternext())

int length = mypointssize()

int numvars = mypointsget(0)features()size() + 1

double[] x = new double[lengthnumvars]

for(int i=0 iltlength i++)

for(int j=0 jltnumvars-1 j++)

x[(j) length + i] = mypointsget(i)features()apply(j)

x[(numvars-1) length + i] = mypointsget(i)label()

G02BU g02bu = new G02BU()

g02bueval(M U length numvars x length x 00 means ssq ifail)

return ArraysasList(new DataClass(lengthmeansssq))

ComputeSSQ (Mapper)

Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence

static class CombineData implements Function2ltDataClassDataClassDataClassgt

Override

public DataClass call(DataClass data1 DataClass data2) throws Exception

G02BZ g02bz = new G02BZ()

g02bzeval(M data1meanslength data1length data1means

data1ssq data2length data2means data2ssq IFAIL)

data1length = (g02bzgetXSW())

return data1

CombineData (Reducer)

Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master

Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence

Test 1

Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster

Used an 8 slave xlarge cluster (16 GB RAM)

Test 2

Varied the number of slave nodes from 2 - 16

Used 16 GB of data to see how algorithms scale

Linear Regression Example

Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence

Cluster Utilization

Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence

Results of Linear Regression

Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence

Results of Scaling

Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence

NAG vs MLlib

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence

Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a

function to all elements of this RDD

flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results

collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence

Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies

a function f to each partition of this RDD

mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions

Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence

public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception

if(dataSize numVars datapointspartitions()size() gt 10000000)

throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)

DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())

hellip

G02BW g02bw = new G02BW()

g02bweval(numVars finalptssq ifail)

if(g02bwgetIFAIL() gt 0)

Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())

throw new NAGSparkException(Error from g02bw)

How it looks in Spark

Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence

static class ComputeSSQ implements

FlatMapFunctionltIteratorltLabeledPointgt DataClassgt

Override

public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception

ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()

while(iterhasNext()) mypointsadd(iternext())

int length = mypointssize()

int numvars = mypointsget(0)features()size() + 1

double[] x = new double[lengthnumvars]

for(int i=0 iltlength i++)

for(int j=0 jltnumvars-1 j++)

x[(j) length + i] = mypointsget(i)features()apply(j)

x[(numvars-1) length + i] = mypointsget(i)label()

G02BU g02bu = new G02BU()

g02bueval(M U length numvars x length x 00 means ssq ifail)

return ArraysasList(new DataClass(lengthmeansssq))

ComputeSSQ (Mapper)

Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence

static class CombineData implements Function2ltDataClassDataClassDataClassgt

Override

public DataClass call(DataClass data1 DataClass data2) throws Exception

G02BZ g02bz = new G02BZ()

g02bzeval(M data1meanslength data1length data1means

data1ssq data2length data2means data2ssq IFAIL)

data1length = (g02bzgetXSW())

return data1

CombineData (Reducer)

Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master

Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence

Test 1

Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster

Used an 8 slave xlarge cluster (16 GB RAM)

Test 2

Varied the number of slave nodes from 2 - 16

Used 16 GB of data to see how algorithms scale

Linear Regression Example

Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence

Cluster Utilization

Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence

Results of Linear Regression

Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence

Results of Scaling

Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence

NAG vs MLlib

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence

Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a

function to all elements of this RDD

flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results

collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence

Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies

a function f to each partition of this RDD

mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions

Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence

public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception

if(dataSize numVars datapointspartitions()size() gt 10000000)

throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)

DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())

hellip

G02BW g02bw = new G02BW()

g02bweval(numVars finalptssq ifail)

if(g02bwgetIFAIL() gt 0)

Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())

throw new NAGSparkException(Error from g02bw)

How it looks in Spark

Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence

static class ComputeSSQ implements

FlatMapFunctionltIteratorltLabeledPointgt DataClassgt

Override

public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception

ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()

while(iterhasNext()) mypointsadd(iternext())

int length = mypointssize()

int numvars = mypointsget(0)features()size() + 1

double[] x = new double[lengthnumvars]

for(int i=0 iltlength i++)

for(int j=0 jltnumvars-1 j++)

x[(j) length + i] = mypointsget(i)features()apply(j)

x[(numvars-1) length + i] = mypointsget(i)label()

G02BU g02bu = new G02BU()

g02bueval(M U length numvars x length x 00 means ssq ifail)

return ArraysasList(new DataClass(lengthmeansssq))

ComputeSSQ (Mapper)

Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence

static class CombineData implements Function2ltDataClassDataClassDataClassgt

Override

public DataClass call(DataClass data1 DataClass data2) throws Exception

G02BZ g02bz = new G02BZ()

g02bzeval(M data1meanslength data1length data1means

data1ssq data2length data2means data2ssq IFAIL)

data1length = (g02bzgetXSW())

return data1

CombineData (Reducer)

Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master

Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence

Test 1

Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster

Used an 8 slave xlarge cluster (16 GB RAM)

Test 2

Varied the number of slave nodes from 2 - 16

Used 16 GB of data to see how algorithms scale

Linear Regression Example

Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence

Cluster Utilization

Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence

Results of Linear Regression

Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence

Results of Scaling

Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence

NAG vs MLlib

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence

Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies

a function f to each partition of this RDD

mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions

Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence

public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception

if(dataSize numVars datapointspartitions()size() gt 10000000)

throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)

DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())

hellip

G02BW g02bw = new G02BW()

g02bweval(numVars finalptssq ifail)

if(g02bwgetIFAIL() gt 0)

Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())

throw new NAGSparkException(Error from g02bw)

How it looks in Spark

Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence

static class ComputeSSQ implements

FlatMapFunctionltIteratorltLabeledPointgt DataClassgt

Override

public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception

ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()

while(iterhasNext()) mypointsadd(iternext())

int length = mypointssize()

int numvars = mypointsget(0)features()size() + 1

double[] x = new double[lengthnumvars]

for(int i=0 iltlength i++)

for(int j=0 jltnumvars-1 j++)

x[(j) length + i] = mypointsget(i)features()apply(j)

x[(numvars-1) length + i] = mypointsget(i)label()

G02BU g02bu = new G02BU()

g02bueval(M U length numvars x length x 00 means ssq ifail)

return ArraysasList(new DataClass(lengthmeansssq))

ComputeSSQ (Mapper)

Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence

static class CombineData implements Function2ltDataClassDataClassDataClassgt

Override

public DataClass call(DataClass data1 DataClass data2) throws Exception

G02BZ g02bz = new G02BZ()

g02bzeval(M data1meanslength data1length data1means

data1ssq data2length data2means data2ssq IFAIL)

data1length = (g02bzgetXSW())

return data1

CombineData (Reducer)

Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master

Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence

Test 1

Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster

Used an 8 slave xlarge cluster (16 GB RAM)

Test 2

Varied the number of slave nodes from 2 - 16

Used 16 GB of data to see how algorithms scale

Linear Regression Example

Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence

Cluster Utilization

Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence

Results of Linear Regression

Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence

Results of Scaling

Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence

NAG vs MLlib

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions

Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence

public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception

if(dataSize numVars datapointspartitions()size() gt 10000000)

throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)

DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())

hellip

G02BW g02bw = new G02BW()

g02bweval(numVars finalptssq ifail)

if(g02bwgetIFAIL() gt 0)

Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())

throw new NAGSparkException(Error from g02bw)

How it looks in Spark

Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence

static class ComputeSSQ implements

FlatMapFunctionltIteratorltLabeledPointgt DataClassgt

Override

public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception

ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()

while(iterhasNext()) mypointsadd(iternext())

int length = mypointssize()

int numvars = mypointsget(0)features()size() + 1

double[] x = new double[lengthnumvars]

for(int i=0 iltlength i++)

for(int j=0 jltnumvars-1 j++)

x[(j) length + i] = mypointsget(i)features()apply(j)

x[(numvars-1) length + i] = mypointsget(i)label()

G02BU g02bu = new G02BU()

g02bueval(M U length numvars x length x 00 means ssq ifail)

return ArraysasList(new DataClass(lengthmeansssq))

ComputeSSQ (Mapper)

Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence

static class CombineData implements Function2ltDataClassDataClassDataClassgt

Override

public DataClass call(DataClass data1 DataClass data2) throws Exception

G02BZ g02bz = new G02BZ()

g02bzeval(M data1meanslength data1length data1means

data1ssq data2length data2means data2ssq IFAIL)

data1length = (g02bzgetXSW())

return data1

CombineData (Reducer)

Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master

Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence

Test 1

Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster

Used an 8 slave xlarge cluster (16 GB RAM)

Test 2

Varied the number of slave nodes from 2 - 16

Used 16 GB of data to see how algorithms scale

Linear Regression Example

Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence

Cluster Utilization

Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence

Results of Linear Regression

Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence

Results of Scaling

Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence

NAG vs MLlib

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence

public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception

if(dataSize numVars datapointspartitions()size() gt 10000000)

throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)

DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())

hellip

G02BW g02bw = new G02BW()

g02bweval(numVars finalptssq ifail)

if(g02bwgetIFAIL() gt 0)

Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())

throw new NAGSparkException(Error from g02bw)

How it looks in Spark

Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence

static class ComputeSSQ implements

FlatMapFunctionltIteratorltLabeledPointgt DataClassgt

Override

public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception

ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()

while(iterhasNext()) mypointsadd(iternext())

int length = mypointssize()

int numvars = mypointsget(0)features()size() + 1

double[] x = new double[lengthnumvars]

for(int i=0 iltlength i++)

for(int j=0 jltnumvars-1 j++)

x[(j) length + i] = mypointsget(i)features()apply(j)

x[(numvars-1) length + i] = mypointsget(i)label()

G02BU g02bu = new G02BU()

g02bueval(M U length numvars x length x 00 means ssq ifail)

return ArraysasList(new DataClass(lengthmeansssq))

ComputeSSQ (Mapper)

Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence

static class CombineData implements Function2ltDataClassDataClassDataClassgt

Override

public DataClass call(DataClass data1 DataClass data2) throws Exception

G02BZ g02bz = new G02BZ()

g02bzeval(M data1meanslength data1length data1means

data1ssq data2length data2means data2ssq IFAIL)

data1length = (g02bzgetXSW())

return data1

CombineData (Reducer)

Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master

Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence

Test 1

Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster

Used an 8 slave xlarge cluster (16 GB RAM)

Test 2

Varied the number of slave nodes from 2 - 16

Used 16 GB of data to see how algorithms scale

Linear Regression Example

Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence

Cluster Utilization

Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence

Results of Linear Regression

Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence

Results of Scaling

Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence

NAG vs MLlib

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence

static class ComputeSSQ implements

FlatMapFunctionltIteratorltLabeledPointgt DataClassgt

Override

public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception

ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()

while(iterhasNext()) mypointsadd(iternext())

int length = mypointssize()

int numvars = mypointsget(0)features()size() + 1

double[] x = new double[lengthnumvars]

for(int i=0 iltlength i++)

for(int j=0 jltnumvars-1 j++)

x[(j) length + i] = mypointsget(i)features()apply(j)

x[(numvars-1) length + i] = mypointsget(i)label()

G02BU g02bu = new G02BU()

g02bueval(M U length numvars x length x 00 means ssq ifail)

return ArraysasList(new DataClass(lengthmeansssq))

ComputeSSQ (Mapper)

Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence

static class CombineData implements Function2ltDataClassDataClassDataClassgt

Override

public DataClass call(DataClass data1 DataClass data2) throws Exception

G02BZ g02bz = new G02BZ()

g02bzeval(M data1meanslength data1length data1means

data1ssq data2length data2means data2ssq IFAIL)

data1length = (g02bzgetXSW())

return data1

CombineData (Reducer)

Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master

Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence

Test 1

Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster

Used an 8 slave xlarge cluster (16 GB RAM)

Test 2

Varied the number of slave nodes from 2 - 16

Used 16 GB of data to see how algorithms scale

Linear Regression Example

Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence

Cluster Utilization

Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence

Results of Linear Regression

Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence

Results of Scaling

Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence

NAG vs MLlib

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence

static class CombineData implements Function2ltDataClassDataClassDataClassgt

Override

public DataClass call(DataClass data1 DataClass data2) throws Exception

G02BZ g02bz = new G02BZ()

g02bzeval(M data1meanslength data1length data1means

data1ssq data2length data2means data2ssq IFAIL)

data1length = (g02bzgetXSW())

return data1

CombineData (Reducer)

Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master

Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence

Test 1

Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster

Used an 8 slave xlarge cluster (16 GB RAM)

Test 2

Varied the number of slave nodes from 2 - 16

Used 16 GB of data to see how algorithms scale

Linear Regression Example

Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence

Cluster Utilization

Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence

Results of Linear Regression

Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence

Results of Scaling

Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence

NAG vs MLlib

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence

The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master

Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence

Test 1

Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster

Used an 8 slave xlarge cluster (16 GB RAM)

Test 2

Varied the number of slave nodes from 2 - 16

Used 16 GB of data to see how algorithms scale

Linear Regression Example

Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence

Cluster Utilization

Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence

Results of Linear Regression

Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence

Results of Scaling

Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence

NAG vs MLlib

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence

Test 1

Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster

Used an 8 slave xlarge cluster (16 GB RAM)

Test 2

Varied the number of slave nodes from 2 - 16

Used 16 GB of data to see how algorithms scale

Linear Regression Example

Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence

Cluster Utilization

Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence

Results of Linear Regression

Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence

Results of Scaling

Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence

NAG vs MLlib

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence

Cluster Utilization

Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence

Results of Linear Regression

Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence

Results of Scaling

Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence

NAG vs MLlib

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence

Results of Linear Regression

Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence

Results of Scaling

Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence

NAG vs MLlib

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence

Results of Scaling

Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence

NAG vs MLlib

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence

NAG vs MLlib

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence

Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence

The data problem

Data locality

Inverse of singular matrix

Problems Encountered (2)

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence

Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm

Problems Encountered (3)

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence

NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence

DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across slaves

How to use existing libraries on Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark

Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence

Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark