using numerical libraries on spark - nag · commercial in confidence experts in numerical...
TRANSCRIPT
Commercial in Confidence
Experts in numerical algorithms and HPC services
Using Numerical Libraries on Spark
Brian Spector
London Spark Users Meetup
August 18th 2015
Commercial in Confidence 2Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across workers
How to use existing libraries on Spark
Commercial in Confidence 3Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 4Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 5
The Numerical Algorithms Group
Founded in 1970 as a co-operative project out of academia in UK
Support and Maintain Numerical Library ~1700 MathematicalStatistical Routines
NAGrsquos code is embedded in many vendor libraries (eg AMD Intel)
Numerical Library
HPCServices
Numerical Services
Commercial in Confidence 6
NAG Library Full Contents
Root Finding
Summation of Series
Quadrature
Ordinary Differential Equations
Partial Differential Equations
Numerical Differentiation
Integral Equations
Mesh Generation
Interpolation
Curve and Surface Fitting
Optimization
Approximations of Special Functions
Dense Linear Algebra
Sparse Linear Algebra
Correlation amp Regression Analysis
Multivariate Methods
Analysis of Variance
Random Number Generators
Univariate Estimation
Nonparametric Statistics
Smoothing in Statistics
Contingency Table Analysis
Survival Analysis
Time Series Analysis
Operations Research
Commercial in Confidence 7Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 8Numerical ExcellenceCommercial in Confidence
At NAG we have 40 years of algorithms
Linear Algebra Regression Optimization
Data is all in-memory and call a compiled libraryvoid nag_1d_spline_interpolant (Integer m const double x[] const double y[] Nag_Spline spline NagError fail)
Algorithms take advantage of
AVXAVX2
Multi-core
LAPACKBLAS
MKL
Efficient Algorithms
Commercial in Confidence 9Numerical ExcellenceCommercial in Confidence
Modern programming languages and compilers (hopefully) take care of some efficiencies for us
Python numpyscipy
Compiler flags (-O2)
Users worry less about efficient computing and more about solving their problem
Efficient Algorithms
Commercial in Confidence 10Numerical ExcellenceCommercial in Confidence
hellipBut what happens when it comes to Big Data
x = (double) malloc(10000000 10000 sizeof(double))
Past efficient algorithms break down
Need different ways of solving same problem
How do we use our existing functions (libraries) on Spark
We must reformulate the problem
An examplehellip (Warning Mathematics Approaching)
Big Data
Commercial in Confidence 11Numerical ExcellenceCommercial in Confidence
General Problem
Given a set of input measurements 11990911199092 hellip119909119901 and an
outcome measurement 119910 fit a linear model
119910 = 119883 119861 + ε
119910 =
1199101
⋮119910119901
119883 =
11990911 ⋯ 1199091119901
⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901
Three ways of solving the same
problem
Linear Regression Example
Commercial in Confidence 12Numerical ExcellenceCommercial in Confidence
Solution 1 (Optimal Solution)
119883 =
11990911 ⋯ 1199091119901
⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901
= 119876119877lowast where 119877lowast =1198770
119877 is 119901 by 119901 Upper Triangular 119876 orthogonal
119877 119861 = 1198881 where 119888 = 119876119879119910 hellip
hellip lots of linear algebra
but we have an algorithm
Linear Regression Example
Commercial in Confidence 13Numerical ExcellenceCommercial in Confidence
Solution 2 (Normal Equations)
119883119879119883 119861 = 119883119879 119910
If we 3 independent variables
119883 =1 2 3⋮ ⋮ ⋮4 5 6
119910 =7⋮8
Linear Regression Example
Commercial in Confidence 14Numerical ExcellenceCommercial in Confidence
119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6
1 2 3⋮ ⋮ ⋮4 5 6
=
11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133
119861 = 119883119879119883 minus1119883119879 119910
This computation can be map out to slave nodes
Linear Regression Example
Commercial in Confidence 15Numerical ExcellenceCommercial in Confidence
Solution 3 (Optimization)
min 119910 minus 119883 1198612
Iterative over the data
Final answer is an approximate to
the solution
Can add constraints to variables
(LASSO)
Linear Regression Example
Commercial in Confidence 16Numerical ExcellenceCommercial in Confidence
Machine LearningOptimization
MLlib uses solution 3 for many applications
Classification
Regression
Gaussian Mixture Model
Kmeans
Slave nodes are used for callbacks to access data
Map Reduce Repeat
MLlib Algorithms
Commercial in Confidence 17Numerical ExcellenceCommercial in Confidence
MLlib Algorithms
MasterLibraries
Slaves Slaves
Commercial in Confidence 18Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence
Use Normal Equations (Solution 2) to compute Linear Regression on large dataset
NAG Libraries were distributed across MasterSlave Nodes
Steps
1 Read in data
2 Call routine to compute sum-of-squares matrix on partitions (chucks)
Default partition size is 64mb
3 Call routine to aggregate SSQ matrix together
4 Call routine to compute optimal coefficients
Example on Spark
Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence
Linear Regression Example
Master
LIBRARY CALL
LIBRARY CALL
LIBRARY CALL
Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence
Linear Regression Example Data
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Master
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Slave 1
Slave 2
rddparallelize(data)cache()
Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2LIBRARY_CALL(data)
Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence
Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a
function to all elements of this RDD
flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results
collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence
Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies
a function f to each partition of this RDD
mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
mapPartitions
Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence
public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception
if(dataSize numVars datapointspartitions()size() gt 10000000)
throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)
DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())
hellip
G02BW g02bw = new G02BW()
g02bweval(numVars finalptssq ifail)
if(g02bwgetIFAIL() gt 0)
Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())
throw new NAGSparkException(Error from g02bw)
How it looks in Spark
Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence
static class ComputeSSQ implements
FlatMapFunctionltIteratorltLabeledPointgt DataClassgt
Override
public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception
ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()
while(iterhasNext()) mypointsadd(iternext())
int length = mypointssize()
int numvars = mypointsget(0)features()size() + 1
double[] x = new double[lengthnumvars]
for(int i=0 iltlength i++)
for(int j=0 jltnumvars-1 j++)
x[(j) length + i] = mypointsget(i)features()apply(j)
x[(numvars-1) length + i] = mypointsget(i)label()
G02BU g02bu = new G02BU()
g02bueval(M U length numvars x length x 00 means ssq ifail)
return ArraysasList(new DataClass(lengthmeansssq))
ComputeSSQ (Mapper)
Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence
static class CombineData implements Function2ltDataClassDataClassDataClassgt
Override
public DataClass call(DataClass data1 DataClass data2) throws Exception
G02BZ g02bz = new G02BZ()
g02bzeval(M data1meanslength data1length data1means
data1ssq data2length data2means data2ssq IFAIL)
data1length = (g02bzgetXSW())
return data1
CombineData (Reducer)
Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Master
Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence
Test 1
Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster
Used an 8 slave xlarge cluster (16 GB RAM)
Test 2
Varied the number of slave nodes from 2 - 16
Used 16 GB of data to see how algorithms scale
Linear Regression Example
Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence
Cluster Utilization
Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence
Results of Linear Regression
Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence
Results of Scaling
Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence
NAG vs MLlib
Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence
Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))
Distributing libraries
spark-ec2copy-dir
Setting environment variables
LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar
NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv
Problems Encountered (1)
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 2Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across workers
How to use existing libraries on Spark
Commercial in Confidence 3Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 4Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 5
The Numerical Algorithms Group
Founded in 1970 as a co-operative project out of academia in UK
Support and Maintain Numerical Library ~1700 MathematicalStatistical Routines
NAGrsquos code is embedded in many vendor libraries (eg AMD Intel)
Numerical Library
HPCServices
Numerical Services
Commercial in Confidence 6
NAG Library Full Contents
Root Finding
Summation of Series
Quadrature
Ordinary Differential Equations
Partial Differential Equations
Numerical Differentiation
Integral Equations
Mesh Generation
Interpolation
Curve and Surface Fitting
Optimization
Approximations of Special Functions
Dense Linear Algebra
Sparse Linear Algebra
Correlation amp Regression Analysis
Multivariate Methods
Analysis of Variance
Random Number Generators
Univariate Estimation
Nonparametric Statistics
Smoothing in Statistics
Contingency Table Analysis
Survival Analysis
Time Series Analysis
Operations Research
Commercial in Confidence 7Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 8Numerical ExcellenceCommercial in Confidence
At NAG we have 40 years of algorithms
Linear Algebra Regression Optimization
Data is all in-memory and call a compiled libraryvoid nag_1d_spline_interpolant (Integer m const double x[] const double y[] Nag_Spline spline NagError fail)
Algorithms take advantage of
AVXAVX2
Multi-core
LAPACKBLAS
MKL
Efficient Algorithms
Commercial in Confidence 9Numerical ExcellenceCommercial in Confidence
Modern programming languages and compilers (hopefully) take care of some efficiencies for us
Python numpyscipy
Compiler flags (-O2)
Users worry less about efficient computing and more about solving their problem
Efficient Algorithms
Commercial in Confidence 10Numerical ExcellenceCommercial in Confidence
hellipBut what happens when it comes to Big Data
x = (double) malloc(10000000 10000 sizeof(double))
Past efficient algorithms break down
Need different ways of solving same problem
How do we use our existing functions (libraries) on Spark
We must reformulate the problem
An examplehellip (Warning Mathematics Approaching)
Big Data
Commercial in Confidence 11Numerical ExcellenceCommercial in Confidence
General Problem
Given a set of input measurements 11990911199092 hellip119909119901 and an
outcome measurement 119910 fit a linear model
119910 = 119883 119861 + ε
119910 =
1199101
⋮119910119901
119883 =
11990911 ⋯ 1199091119901
⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901
Three ways of solving the same
problem
Linear Regression Example
Commercial in Confidence 12Numerical ExcellenceCommercial in Confidence
Solution 1 (Optimal Solution)
119883 =
11990911 ⋯ 1199091119901
⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901
= 119876119877lowast where 119877lowast =1198770
119877 is 119901 by 119901 Upper Triangular 119876 orthogonal
119877 119861 = 1198881 where 119888 = 119876119879119910 hellip
hellip lots of linear algebra
but we have an algorithm
Linear Regression Example
Commercial in Confidence 13Numerical ExcellenceCommercial in Confidence
Solution 2 (Normal Equations)
119883119879119883 119861 = 119883119879 119910
If we 3 independent variables
119883 =1 2 3⋮ ⋮ ⋮4 5 6
119910 =7⋮8
Linear Regression Example
Commercial in Confidence 14Numerical ExcellenceCommercial in Confidence
119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6
1 2 3⋮ ⋮ ⋮4 5 6
=
11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133
119861 = 119883119879119883 minus1119883119879 119910
This computation can be map out to slave nodes
Linear Regression Example
Commercial in Confidence 15Numerical ExcellenceCommercial in Confidence
Solution 3 (Optimization)
min 119910 minus 119883 1198612
Iterative over the data
Final answer is an approximate to
the solution
Can add constraints to variables
(LASSO)
Linear Regression Example
Commercial in Confidence 16Numerical ExcellenceCommercial in Confidence
Machine LearningOptimization
MLlib uses solution 3 for many applications
Classification
Regression
Gaussian Mixture Model
Kmeans
Slave nodes are used for callbacks to access data
Map Reduce Repeat
MLlib Algorithms
Commercial in Confidence 17Numerical ExcellenceCommercial in Confidence
MLlib Algorithms
MasterLibraries
Slaves Slaves
Commercial in Confidence 18Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence
Use Normal Equations (Solution 2) to compute Linear Regression on large dataset
NAG Libraries were distributed across MasterSlave Nodes
Steps
1 Read in data
2 Call routine to compute sum-of-squares matrix on partitions (chucks)
Default partition size is 64mb
3 Call routine to aggregate SSQ matrix together
4 Call routine to compute optimal coefficients
Example on Spark
Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence
Linear Regression Example
Master
LIBRARY CALL
LIBRARY CALL
LIBRARY CALL
Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence
Linear Regression Example Data
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Master
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Slave 1
Slave 2
rddparallelize(data)cache()
Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2LIBRARY_CALL(data)
Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence
Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a
function to all elements of this RDD
flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results
collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence
Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies
a function f to each partition of this RDD
mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
mapPartitions
Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence
public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception
if(dataSize numVars datapointspartitions()size() gt 10000000)
throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)
DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())
hellip
G02BW g02bw = new G02BW()
g02bweval(numVars finalptssq ifail)
if(g02bwgetIFAIL() gt 0)
Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())
throw new NAGSparkException(Error from g02bw)
How it looks in Spark
Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence
static class ComputeSSQ implements
FlatMapFunctionltIteratorltLabeledPointgt DataClassgt
Override
public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception
ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()
while(iterhasNext()) mypointsadd(iternext())
int length = mypointssize()
int numvars = mypointsget(0)features()size() + 1
double[] x = new double[lengthnumvars]
for(int i=0 iltlength i++)
for(int j=0 jltnumvars-1 j++)
x[(j) length + i] = mypointsget(i)features()apply(j)
x[(numvars-1) length + i] = mypointsget(i)label()
G02BU g02bu = new G02BU()
g02bueval(M U length numvars x length x 00 means ssq ifail)
return ArraysasList(new DataClass(lengthmeansssq))
ComputeSSQ (Mapper)
Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence
static class CombineData implements Function2ltDataClassDataClassDataClassgt
Override
public DataClass call(DataClass data1 DataClass data2) throws Exception
G02BZ g02bz = new G02BZ()
g02bzeval(M data1meanslength data1length data1means
data1ssq data2length data2means data2ssq IFAIL)
data1length = (g02bzgetXSW())
return data1
CombineData (Reducer)
Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Master
Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence
Test 1
Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster
Used an 8 slave xlarge cluster (16 GB RAM)
Test 2
Varied the number of slave nodes from 2 - 16
Used 16 GB of data to see how algorithms scale
Linear Regression Example
Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence
Cluster Utilization
Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence
Results of Linear Regression
Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence
Results of Scaling
Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence
NAG vs MLlib
Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence
Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))
Distributing libraries
spark-ec2copy-dir
Setting environment variables
LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar
NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv
Problems Encountered (1)
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 3Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 4Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 5
The Numerical Algorithms Group
Founded in 1970 as a co-operative project out of academia in UK
Support and Maintain Numerical Library ~1700 MathematicalStatistical Routines
NAGrsquos code is embedded in many vendor libraries (eg AMD Intel)
Numerical Library
HPCServices
Numerical Services
Commercial in Confidence 6
NAG Library Full Contents
Root Finding
Summation of Series
Quadrature
Ordinary Differential Equations
Partial Differential Equations
Numerical Differentiation
Integral Equations
Mesh Generation
Interpolation
Curve and Surface Fitting
Optimization
Approximations of Special Functions
Dense Linear Algebra
Sparse Linear Algebra
Correlation amp Regression Analysis
Multivariate Methods
Analysis of Variance
Random Number Generators
Univariate Estimation
Nonparametric Statistics
Smoothing in Statistics
Contingency Table Analysis
Survival Analysis
Time Series Analysis
Operations Research
Commercial in Confidence 7Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 8Numerical ExcellenceCommercial in Confidence
At NAG we have 40 years of algorithms
Linear Algebra Regression Optimization
Data is all in-memory and call a compiled libraryvoid nag_1d_spline_interpolant (Integer m const double x[] const double y[] Nag_Spline spline NagError fail)
Algorithms take advantage of
AVXAVX2
Multi-core
LAPACKBLAS
MKL
Efficient Algorithms
Commercial in Confidence 9Numerical ExcellenceCommercial in Confidence
Modern programming languages and compilers (hopefully) take care of some efficiencies for us
Python numpyscipy
Compiler flags (-O2)
Users worry less about efficient computing and more about solving their problem
Efficient Algorithms
Commercial in Confidence 10Numerical ExcellenceCommercial in Confidence
hellipBut what happens when it comes to Big Data
x = (double) malloc(10000000 10000 sizeof(double))
Past efficient algorithms break down
Need different ways of solving same problem
How do we use our existing functions (libraries) on Spark
We must reformulate the problem
An examplehellip (Warning Mathematics Approaching)
Big Data
Commercial in Confidence 11Numerical ExcellenceCommercial in Confidence
General Problem
Given a set of input measurements 11990911199092 hellip119909119901 and an
outcome measurement 119910 fit a linear model
119910 = 119883 119861 + ε
119910 =
1199101
⋮119910119901
119883 =
11990911 ⋯ 1199091119901
⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901
Three ways of solving the same
problem
Linear Regression Example
Commercial in Confidence 12Numerical ExcellenceCommercial in Confidence
Solution 1 (Optimal Solution)
119883 =
11990911 ⋯ 1199091119901
⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901
= 119876119877lowast where 119877lowast =1198770
119877 is 119901 by 119901 Upper Triangular 119876 orthogonal
119877 119861 = 1198881 where 119888 = 119876119879119910 hellip
hellip lots of linear algebra
but we have an algorithm
Linear Regression Example
Commercial in Confidence 13Numerical ExcellenceCommercial in Confidence
Solution 2 (Normal Equations)
119883119879119883 119861 = 119883119879 119910
If we 3 independent variables
119883 =1 2 3⋮ ⋮ ⋮4 5 6
119910 =7⋮8
Linear Regression Example
Commercial in Confidence 14Numerical ExcellenceCommercial in Confidence
119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6
1 2 3⋮ ⋮ ⋮4 5 6
=
11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133
119861 = 119883119879119883 minus1119883119879 119910
This computation can be map out to slave nodes
Linear Regression Example
Commercial in Confidence 15Numerical ExcellenceCommercial in Confidence
Solution 3 (Optimization)
min 119910 minus 119883 1198612
Iterative over the data
Final answer is an approximate to
the solution
Can add constraints to variables
(LASSO)
Linear Regression Example
Commercial in Confidence 16Numerical ExcellenceCommercial in Confidence
Machine LearningOptimization
MLlib uses solution 3 for many applications
Classification
Regression
Gaussian Mixture Model
Kmeans
Slave nodes are used for callbacks to access data
Map Reduce Repeat
MLlib Algorithms
Commercial in Confidence 17Numerical ExcellenceCommercial in Confidence
MLlib Algorithms
MasterLibraries
Slaves Slaves
Commercial in Confidence 18Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence
Use Normal Equations (Solution 2) to compute Linear Regression on large dataset
NAG Libraries were distributed across MasterSlave Nodes
Steps
1 Read in data
2 Call routine to compute sum-of-squares matrix on partitions (chucks)
Default partition size is 64mb
3 Call routine to aggregate SSQ matrix together
4 Call routine to compute optimal coefficients
Example on Spark
Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence
Linear Regression Example
Master
LIBRARY CALL
LIBRARY CALL
LIBRARY CALL
Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence
Linear Regression Example Data
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Master
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Slave 1
Slave 2
rddparallelize(data)cache()
Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2LIBRARY_CALL(data)
Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence
Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a
function to all elements of this RDD
flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results
collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence
Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies
a function f to each partition of this RDD
mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
mapPartitions
Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence
public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception
if(dataSize numVars datapointspartitions()size() gt 10000000)
throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)
DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())
hellip
G02BW g02bw = new G02BW()
g02bweval(numVars finalptssq ifail)
if(g02bwgetIFAIL() gt 0)
Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())
throw new NAGSparkException(Error from g02bw)
How it looks in Spark
Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence
static class ComputeSSQ implements
FlatMapFunctionltIteratorltLabeledPointgt DataClassgt
Override
public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception
ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()
while(iterhasNext()) mypointsadd(iternext())
int length = mypointssize()
int numvars = mypointsget(0)features()size() + 1
double[] x = new double[lengthnumvars]
for(int i=0 iltlength i++)
for(int j=0 jltnumvars-1 j++)
x[(j) length + i] = mypointsget(i)features()apply(j)
x[(numvars-1) length + i] = mypointsget(i)label()
G02BU g02bu = new G02BU()
g02bueval(M U length numvars x length x 00 means ssq ifail)
return ArraysasList(new DataClass(lengthmeansssq))
ComputeSSQ (Mapper)
Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence
static class CombineData implements Function2ltDataClassDataClassDataClassgt
Override
public DataClass call(DataClass data1 DataClass data2) throws Exception
G02BZ g02bz = new G02BZ()
g02bzeval(M data1meanslength data1length data1means
data1ssq data2length data2means data2ssq IFAIL)
data1length = (g02bzgetXSW())
return data1
CombineData (Reducer)
Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Master
Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence
Test 1
Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster
Used an 8 slave xlarge cluster (16 GB RAM)
Test 2
Varied the number of slave nodes from 2 - 16
Used 16 GB of data to see how algorithms scale
Linear Regression Example
Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence
Cluster Utilization
Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence
Results of Linear Regression
Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence
Results of Scaling
Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence
NAG vs MLlib
Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence
Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))
Distributing libraries
spark-ec2copy-dir
Setting environment variables
LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar
NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv
Problems Encountered (1)
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 4Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 5
The Numerical Algorithms Group
Founded in 1970 as a co-operative project out of academia in UK
Support and Maintain Numerical Library ~1700 MathematicalStatistical Routines
NAGrsquos code is embedded in many vendor libraries (eg AMD Intel)
Numerical Library
HPCServices
Numerical Services
Commercial in Confidence 6
NAG Library Full Contents
Root Finding
Summation of Series
Quadrature
Ordinary Differential Equations
Partial Differential Equations
Numerical Differentiation
Integral Equations
Mesh Generation
Interpolation
Curve and Surface Fitting
Optimization
Approximations of Special Functions
Dense Linear Algebra
Sparse Linear Algebra
Correlation amp Regression Analysis
Multivariate Methods
Analysis of Variance
Random Number Generators
Univariate Estimation
Nonparametric Statistics
Smoothing in Statistics
Contingency Table Analysis
Survival Analysis
Time Series Analysis
Operations Research
Commercial in Confidence 7Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 8Numerical ExcellenceCommercial in Confidence
At NAG we have 40 years of algorithms
Linear Algebra Regression Optimization
Data is all in-memory and call a compiled libraryvoid nag_1d_spline_interpolant (Integer m const double x[] const double y[] Nag_Spline spline NagError fail)
Algorithms take advantage of
AVXAVX2
Multi-core
LAPACKBLAS
MKL
Efficient Algorithms
Commercial in Confidence 9Numerical ExcellenceCommercial in Confidence
Modern programming languages and compilers (hopefully) take care of some efficiencies for us
Python numpyscipy
Compiler flags (-O2)
Users worry less about efficient computing and more about solving their problem
Efficient Algorithms
Commercial in Confidence 10Numerical ExcellenceCommercial in Confidence
hellipBut what happens when it comes to Big Data
x = (double) malloc(10000000 10000 sizeof(double))
Past efficient algorithms break down
Need different ways of solving same problem
How do we use our existing functions (libraries) on Spark
We must reformulate the problem
An examplehellip (Warning Mathematics Approaching)
Big Data
Commercial in Confidence 11Numerical ExcellenceCommercial in Confidence
General Problem
Given a set of input measurements 11990911199092 hellip119909119901 and an
outcome measurement 119910 fit a linear model
119910 = 119883 119861 + ε
119910 =
1199101
⋮119910119901
119883 =
11990911 ⋯ 1199091119901
⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901
Three ways of solving the same
problem
Linear Regression Example
Commercial in Confidence 12Numerical ExcellenceCommercial in Confidence
Solution 1 (Optimal Solution)
119883 =
11990911 ⋯ 1199091119901
⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901
= 119876119877lowast where 119877lowast =1198770
119877 is 119901 by 119901 Upper Triangular 119876 orthogonal
119877 119861 = 1198881 where 119888 = 119876119879119910 hellip
hellip lots of linear algebra
but we have an algorithm
Linear Regression Example
Commercial in Confidence 13Numerical ExcellenceCommercial in Confidence
Solution 2 (Normal Equations)
119883119879119883 119861 = 119883119879 119910
If we 3 independent variables
119883 =1 2 3⋮ ⋮ ⋮4 5 6
119910 =7⋮8
Linear Regression Example
Commercial in Confidence 14Numerical ExcellenceCommercial in Confidence
119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6
1 2 3⋮ ⋮ ⋮4 5 6
=
11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133
119861 = 119883119879119883 minus1119883119879 119910
This computation can be map out to slave nodes
Linear Regression Example
Commercial in Confidence 15Numerical ExcellenceCommercial in Confidence
Solution 3 (Optimization)
min 119910 minus 119883 1198612
Iterative over the data
Final answer is an approximate to
the solution
Can add constraints to variables
(LASSO)
Linear Regression Example
Commercial in Confidence 16Numerical ExcellenceCommercial in Confidence
Machine LearningOptimization
MLlib uses solution 3 for many applications
Classification
Regression
Gaussian Mixture Model
Kmeans
Slave nodes are used for callbacks to access data
Map Reduce Repeat
MLlib Algorithms
Commercial in Confidence 17Numerical ExcellenceCommercial in Confidence
MLlib Algorithms
MasterLibraries
Slaves Slaves
Commercial in Confidence 18Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence
Use Normal Equations (Solution 2) to compute Linear Regression on large dataset
NAG Libraries were distributed across MasterSlave Nodes
Steps
1 Read in data
2 Call routine to compute sum-of-squares matrix on partitions (chucks)
Default partition size is 64mb
3 Call routine to aggregate SSQ matrix together
4 Call routine to compute optimal coefficients
Example on Spark
Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence
Linear Regression Example
Master
LIBRARY CALL
LIBRARY CALL
LIBRARY CALL
Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence
Linear Regression Example Data
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Master
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Slave 1
Slave 2
rddparallelize(data)cache()
Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2LIBRARY_CALL(data)
Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence
Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a
function to all elements of this RDD
flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results
collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence
Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies
a function f to each partition of this RDD
mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
mapPartitions
Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence
public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception
if(dataSize numVars datapointspartitions()size() gt 10000000)
throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)
DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())
hellip
G02BW g02bw = new G02BW()
g02bweval(numVars finalptssq ifail)
if(g02bwgetIFAIL() gt 0)
Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())
throw new NAGSparkException(Error from g02bw)
How it looks in Spark
Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence
static class ComputeSSQ implements
FlatMapFunctionltIteratorltLabeledPointgt DataClassgt
Override
public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception
ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()
while(iterhasNext()) mypointsadd(iternext())
int length = mypointssize()
int numvars = mypointsget(0)features()size() + 1
double[] x = new double[lengthnumvars]
for(int i=0 iltlength i++)
for(int j=0 jltnumvars-1 j++)
x[(j) length + i] = mypointsget(i)features()apply(j)
x[(numvars-1) length + i] = mypointsget(i)label()
G02BU g02bu = new G02BU()
g02bueval(M U length numvars x length x 00 means ssq ifail)
return ArraysasList(new DataClass(lengthmeansssq))
ComputeSSQ (Mapper)
Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence
static class CombineData implements Function2ltDataClassDataClassDataClassgt
Override
public DataClass call(DataClass data1 DataClass data2) throws Exception
G02BZ g02bz = new G02BZ()
g02bzeval(M data1meanslength data1length data1means
data1ssq data2length data2means data2ssq IFAIL)
data1length = (g02bzgetXSW())
return data1
CombineData (Reducer)
Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Master
Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence
Test 1
Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster
Used an 8 slave xlarge cluster (16 GB RAM)
Test 2
Varied the number of slave nodes from 2 - 16
Used 16 GB of data to see how algorithms scale
Linear Regression Example
Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence
Cluster Utilization
Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence
Results of Linear Regression
Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence
Results of Scaling
Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence
NAG vs MLlib
Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence
Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))
Distributing libraries
spark-ec2copy-dir
Setting environment variables
LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar
NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv
Problems Encountered (1)
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 5
The Numerical Algorithms Group
Founded in 1970 as a co-operative project out of academia in UK
Support and Maintain Numerical Library ~1700 MathematicalStatistical Routines
NAGrsquos code is embedded in many vendor libraries (eg AMD Intel)
Numerical Library
HPCServices
Numerical Services
Commercial in Confidence 6
NAG Library Full Contents
Root Finding
Summation of Series
Quadrature
Ordinary Differential Equations
Partial Differential Equations
Numerical Differentiation
Integral Equations
Mesh Generation
Interpolation
Curve and Surface Fitting
Optimization
Approximations of Special Functions
Dense Linear Algebra
Sparse Linear Algebra
Correlation amp Regression Analysis
Multivariate Methods
Analysis of Variance
Random Number Generators
Univariate Estimation
Nonparametric Statistics
Smoothing in Statistics
Contingency Table Analysis
Survival Analysis
Time Series Analysis
Operations Research
Commercial in Confidence 7Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 8Numerical ExcellenceCommercial in Confidence
At NAG we have 40 years of algorithms
Linear Algebra Regression Optimization
Data is all in-memory and call a compiled libraryvoid nag_1d_spline_interpolant (Integer m const double x[] const double y[] Nag_Spline spline NagError fail)
Algorithms take advantage of
AVXAVX2
Multi-core
LAPACKBLAS
MKL
Efficient Algorithms
Commercial in Confidence 9Numerical ExcellenceCommercial in Confidence
Modern programming languages and compilers (hopefully) take care of some efficiencies for us
Python numpyscipy
Compiler flags (-O2)
Users worry less about efficient computing and more about solving their problem
Efficient Algorithms
Commercial in Confidence 10Numerical ExcellenceCommercial in Confidence
hellipBut what happens when it comes to Big Data
x = (double) malloc(10000000 10000 sizeof(double))
Past efficient algorithms break down
Need different ways of solving same problem
How do we use our existing functions (libraries) on Spark
We must reformulate the problem
An examplehellip (Warning Mathematics Approaching)
Big Data
Commercial in Confidence 11Numerical ExcellenceCommercial in Confidence
General Problem
Given a set of input measurements 11990911199092 hellip119909119901 and an
outcome measurement 119910 fit a linear model
119910 = 119883 119861 + ε
119910 =
1199101
⋮119910119901
119883 =
11990911 ⋯ 1199091119901
⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901
Three ways of solving the same
problem
Linear Regression Example
Commercial in Confidence 12Numerical ExcellenceCommercial in Confidence
Solution 1 (Optimal Solution)
119883 =
11990911 ⋯ 1199091119901
⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901
= 119876119877lowast where 119877lowast =1198770
119877 is 119901 by 119901 Upper Triangular 119876 orthogonal
119877 119861 = 1198881 where 119888 = 119876119879119910 hellip
hellip lots of linear algebra
but we have an algorithm
Linear Regression Example
Commercial in Confidence 13Numerical ExcellenceCommercial in Confidence
Solution 2 (Normal Equations)
119883119879119883 119861 = 119883119879 119910
If we 3 independent variables
119883 =1 2 3⋮ ⋮ ⋮4 5 6
119910 =7⋮8
Linear Regression Example
Commercial in Confidence 14Numerical ExcellenceCommercial in Confidence
119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6
1 2 3⋮ ⋮ ⋮4 5 6
=
11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133
119861 = 119883119879119883 minus1119883119879 119910
This computation can be map out to slave nodes
Linear Regression Example
Commercial in Confidence 15Numerical ExcellenceCommercial in Confidence
Solution 3 (Optimization)
min 119910 minus 119883 1198612
Iterative over the data
Final answer is an approximate to
the solution
Can add constraints to variables
(LASSO)
Linear Regression Example
Commercial in Confidence 16Numerical ExcellenceCommercial in Confidence
Machine LearningOptimization
MLlib uses solution 3 for many applications
Classification
Regression
Gaussian Mixture Model
Kmeans
Slave nodes are used for callbacks to access data
Map Reduce Repeat
MLlib Algorithms
Commercial in Confidence 17Numerical ExcellenceCommercial in Confidence
MLlib Algorithms
MasterLibraries
Slaves Slaves
Commercial in Confidence 18Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence
Use Normal Equations (Solution 2) to compute Linear Regression on large dataset
NAG Libraries were distributed across MasterSlave Nodes
Steps
1 Read in data
2 Call routine to compute sum-of-squares matrix on partitions (chucks)
Default partition size is 64mb
3 Call routine to aggregate SSQ matrix together
4 Call routine to compute optimal coefficients
Example on Spark
Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence
Linear Regression Example
Master
LIBRARY CALL
LIBRARY CALL
LIBRARY CALL
Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence
Linear Regression Example Data
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Master
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Slave 1
Slave 2
rddparallelize(data)cache()
Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2LIBRARY_CALL(data)
Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence
Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a
function to all elements of this RDD
flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results
collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence
Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies
a function f to each partition of this RDD
mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
mapPartitions
Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence
public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception
if(dataSize numVars datapointspartitions()size() gt 10000000)
throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)
DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())
hellip
G02BW g02bw = new G02BW()
g02bweval(numVars finalptssq ifail)
if(g02bwgetIFAIL() gt 0)
Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())
throw new NAGSparkException(Error from g02bw)
How it looks in Spark
Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence
static class ComputeSSQ implements
FlatMapFunctionltIteratorltLabeledPointgt DataClassgt
Override
public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception
ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()
while(iterhasNext()) mypointsadd(iternext())
int length = mypointssize()
int numvars = mypointsget(0)features()size() + 1
double[] x = new double[lengthnumvars]
for(int i=0 iltlength i++)
for(int j=0 jltnumvars-1 j++)
x[(j) length + i] = mypointsget(i)features()apply(j)
x[(numvars-1) length + i] = mypointsget(i)label()
G02BU g02bu = new G02BU()
g02bueval(M U length numvars x length x 00 means ssq ifail)
return ArraysasList(new DataClass(lengthmeansssq))
ComputeSSQ (Mapper)
Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence
static class CombineData implements Function2ltDataClassDataClassDataClassgt
Override
public DataClass call(DataClass data1 DataClass data2) throws Exception
G02BZ g02bz = new G02BZ()
g02bzeval(M data1meanslength data1length data1means
data1ssq data2length data2means data2ssq IFAIL)
data1length = (g02bzgetXSW())
return data1
CombineData (Reducer)
Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Master
Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence
Test 1
Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster
Used an 8 slave xlarge cluster (16 GB RAM)
Test 2
Varied the number of slave nodes from 2 - 16
Used 16 GB of data to see how algorithms scale
Linear Regression Example
Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence
Cluster Utilization
Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence
Results of Linear Regression
Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence
Results of Scaling
Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence
NAG vs MLlib
Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence
Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))
Distributing libraries
spark-ec2copy-dir
Setting environment variables
LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar
NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv
Problems Encountered (1)
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 6
NAG Library Full Contents
Root Finding
Summation of Series
Quadrature
Ordinary Differential Equations
Partial Differential Equations
Numerical Differentiation
Integral Equations
Mesh Generation
Interpolation
Curve and Surface Fitting
Optimization
Approximations of Special Functions
Dense Linear Algebra
Sparse Linear Algebra
Correlation amp Regression Analysis
Multivariate Methods
Analysis of Variance
Random Number Generators
Univariate Estimation
Nonparametric Statistics
Smoothing in Statistics
Contingency Table Analysis
Survival Analysis
Time Series Analysis
Operations Research
Commercial in Confidence 7Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 8Numerical ExcellenceCommercial in Confidence
At NAG we have 40 years of algorithms
Linear Algebra Regression Optimization
Data is all in-memory and call a compiled libraryvoid nag_1d_spline_interpolant (Integer m const double x[] const double y[] Nag_Spline spline NagError fail)
Algorithms take advantage of
AVXAVX2
Multi-core
LAPACKBLAS
MKL
Efficient Algorithms
Commercial in Confidence 9Numerical ExcellenceCommercial in Confidence
Modern programming languages and compilers (hopefully) take care of some efficiencies for us
Python numpyscipy
Compiler flags (-O2)
Users worry less about efficient computing and more about solving their problem
Efficient Algorithms
Commercial in Confidence 10Numerical ExcellenceCommercial in Confidence
hellipBut what happens when it comes to Big Data
x = (double) malloc(10000000 10000 sizeof(double))
Past efficient algorithms break down
Need different ways of solving same problem
How do we use our existing functions (libraries) on Spark
We must reformulate the problem
An examplehellip (Warning Mathematics Approaching)
Big Data
Commercial in Confidence 11Numerical ExcellenceCommercial in Confidence
General Problem
Given a set of input measurements 11990911199092 hellip119909119901 and an
outcome measurement 119910 fit a linear model
119910 = 119883 119861 + ε
119910 =
1199101
⋮119910119901
119883 =
11990911 ⋯ 1199091119901
⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901
Three ways of solving the same
problem
Linear Regression Example
Commercial in Confidence 12Numerical ExcellenceCommercial in Confidence
Solution 1 (Optimal Solution)
119883 =
11990911 ⋯ 1199091119901
⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901
= 119876119877lowast where 119877lowast =1198770
119877 is 119901 by 119901 Upper Triangular 119876 orthogonal
119877 119861 = 1198881 where 119888 = 119876119879119910 hellip
hellip lots of linear algebra
but we have an algorithm
Linear Regression Example
Commercial in Confidence 13Numerical ExcellenceCommercial in Confidence
Solution 2 (Normal Equations)
119883119879119883 119861 = 119883119879 119910
If we 3 independent variables
119883 =1 2 3⋮ ⋮ ⋮4 5 6
119910 =7⋮8
Linear Regression Example
Commercial in Confidence 14Numerical ExcellenceCommercial in Confidence
119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6
1 2 3⋮ ⋮ ⋮4 5 6
=
11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133
119861 = 119883119879119883 minus1119883119879 119910
This computation can be map out to slave nodes
Linear Regression Example
Commercial in Confidence 15Numerical ExcellenceCommercial in Confidence
Solution 3 (Optimization)
min 119910 minus 119883 1198612
Iterative over the data
Final answer is an approximate to
the solution
Can add constraints to variables
(LASSO)
Linear Regression Example
Commercial in Confidence 16Numerical ExcellenceCommercial in Confidence
Machine LearningOptimization
MLlib uses solution 3 for many applications
Classification
Regression
Gaussian Mixture Model
Kmeans
Slave nodes are used for callbacks to access data
Map Reduce Repeat
MLlib Algorithms
Commercial in Confidence 17Numerical ExcellenceCommercial in Confidence
MLlib Algorithms
MasterLibraries
Slaves Slaves
Commercial in Confidence 18Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence
Use Normal Equations (Solution 2) to compute Linear Regression on large dataset
NAG Libraries were distributed across MasterSlave Nodes
Steps
1 Read in data
2 Call routine to compute sum-of-squares matrix on partitions (chucks)
Default partition size is 64mb
3 Call routine to aggregate SSQ matrix together
4 Call routine to compute optimal coefficients
Example on Spark
Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence
Linear Regression Example
Master
LIBRARY CALL
LIBRARY CALL
LIBRARY CALL
Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence
Linear Regression Example Data
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Master
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Slave 1
Slave 2
rddparallelize(data)cache()
Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2LIBRARY_CALL(data)
Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence
Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a
function to all elements of this RDD
flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results
collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence
Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies
a function f to each partition of this RDD
mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
mapPartitions
Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence
public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception
if(dataSize numVars datapointspartitions()size() gt 10000000)
throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)
DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())
hellip
G02BW g02bw = new G02BW()
g02bweval(numVars finalptssq ifail)
if(g02bwgetIFAIL() gt 0)
Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())
throw new NAGSparkException(Error from g02bw)
How it looks in Spark
Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence
static class ComputeSSQ implements
FlatMapFunctionltIteratorltLabeledPointgt DataClassgt
Override
public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception
ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()
while(iterhasNext()) mypointsadd(iternext())
int length = mypointssize()
int numvars = mypointsget(0)features()size() + 1
double[] x = new double[lengthnumvars]
for(int i=0 iltlength i++)
for(int j=0 jltnumvars-1 j++)
x[(j) length + i] = mypointsget(i)features()apply(j)
x[(numvars-1) length + i] = mypointsget(i)label()
G02BU g02bu = new G02BU()
g02bueval(M U length numvars x length x 00 means ssq ifail)
return ArraysasList(new DataClass(lengthmeansssq))
ComputeSSQ (Mapper)
Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence
static class CombineData implements Function2ltDataClassDataClassDataClassgt
Override
public DataClass call(DataClass data1 DataClass data2) throws Exception
G02BZ g02bz = new G02BZ()
g02bzeval(M data1meanslength data1length data1means
data1ssq data2length data2means data2ssq IFAIL)
data1length = (g02bzgetXSW())
return data1
CombineData (Reducer)
Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Master
Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence
Test 1
Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster
Used an 8 slave xlarge cluster (16 GB RAM)
Test 2
Varied the number of slave nodes from 2 - 16
Used 16 GB of data to see how algorithms scale
Linear Regression Example
Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence
Cluster Utilization
Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence
Results of Linear Regression
Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence
Results of Scaling
Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence
NAG vs MLlib
Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence
Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))
Distributing libraries
spark-ec2copy-dir
Setting environment variables
LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar
NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv
Problems Encountered (1)
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 7Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 8Numerical ExcellenceCommercial in Confidence
At NAG we have 40 years of algorithms
Linear Algebra Regression Optimization
Data is all in-memory and call a compiled libraryvoid nag_1d_spline_interpolant (Integer m const double x[] const double y[] Nag_Spline spline NagError fail)
Algorithms take advantage of
AVXAVX2
Multi-core
LAPACKBLAS
MKL
Efficient Algorithms
Commercial in Confidence 9Numerical ExcellenceCommercial in Confidence
Modern programming languages and compilers (hopefully) take care of some efficiencies for us
Python numpyscipy
Compiler flags (-O2)
Users worry less about efficient computing and more about solving their problem
Efficient Algorithms
Commercial in Confidence 10Numerical ExcellenceCommercial in Confidence
hellipBut what happens when it comes to Big Data
x = (double) malloc(10000000 10000 sizeof(double))
Past efficient algorithms break down
Need different ways of solving same problem
How do we use our existing functions (libraries) on Spark
We must reformulate the problem
An examplehellip (Warning Mathematics Approaching)
Big Data
Commercial in Confidence 11Numerical ExcellenceCommercial in Confidence
General Problem
Given a set of input measurements 11990911199092 hellip119909119901 and an
outcome measurement 119910 fit a linear model
119910 = 119883 119861 + ε
119910 =
1199101
⋮119910119901
119883 =
11990911 ⋯ 1199091119901
⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901
Three ways of solving the same
problem
Linear Regression Example
Commercial in Confidence 12Numerical ExcellenceCommercial in Confidence
Solution 1 (Optimal Solution)
119883 =
11990911 ⋯ 1199091119901
⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901
= 119876119877lowast where 119877lowast =1198770
119877 is 119901 by 119901 Upper Triangular 119876 orthogonal
119877 119861 = 1198881 where 119888 = 119876119879119910 hellip
hellip lots of linear algebra
but we have an algorithm
Linear Regression Example
Commercial in Confidence 13Numerical ExcellenceCommercial in Confidence
Solution 2 (Normal Equations)
119883119879119883 119861 = 119883119879 119910
If we 3 independent variables
119883 =1 2 3⋮ ⋮ ⋮4 5 6
119910 =7⋮8
Linear Regression Example
Commercial in Confidence 14Numerical ExcellenceCommercial in Confidence
119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6
1 2 3⋮ ⋮ ⋮4 5 6
=
11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133
119861 = 119883119879119883 minus1119883119879 119910
This computation can be map out to slave nodes
Linear Regression Example
Commercial in Confidence 15Numerical ExcellenceCommercial in Confidence
Solution 3 (Optimization)
min 119910 minus 119883 1198612
Iterative over the data
Final answer is an approximate to
the solution
Can add constraints to variables
(LASSO)
Linear Regression Example
Commercial in Confidence 16Numerical ExcellenceCommercial in Confidence
Machine LearningOptimization
MLlib uses solution 3 for many applications
Classification
Regression
Gaussian Mixture Model
Kmeans
Slave nodes are used for callbacks to access data
Map Reduce Repeat
MLlib Algorithms
Commercial in Confidence 17Numerical ExcellenceCommercial in Confidence
MLlib Algorithms
MasterLibraries
Slaves Slaves
Commercial in Confidence 18Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence
Use Normal Equations (Solution 2) to compute Linear Regression on large dataset
NAG Libraries were distributed across MasterSlave Nodes
Steps
1 Read in data
2 Call routine to compute sum-of-squares matrix on partitions (chucks)
Default partition size is 64mb
3 Call routine to aggregate SSQ matrix together
4 Call routine to compute optimal coefficients
Example on Spark
Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence
Linear Regression Example
Master
LIBRARY CALL
LIBRARY CALL
LIBRARY CALL
Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence
Linear Regression Example Data
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Master
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Slave 1
Slave 2
rddparallelize(data)cache()
Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2LIBRARY_CALL(data)
Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence
Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a
function to all elements of this RDD
flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results
collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence
Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies
a function f to each partition of this RDD
mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
mapPartitions
Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence
public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception
if(dataSize numVars datapointspartitions()size() gt 10000000)
throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)
DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())
hellip
G02BW g02bw = new G02BW()
g02bweval(numVars finalptssq ifail)
if(g02bwgetIFAIL() gt 0)
Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())
throw new NAGSparkException(Error from g02bw)
How it looks in Spark
Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence
static class ComputeSSQ implements
FlatMapFunctionltIteratorltLabeledPointgt DataClassgt
Override
public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception
ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()
while(iterhasNext()) mypointsadd(iternext())
int length = mypointssize()
int numvars = mypointsget(0)features()size() + 1
double[] x = new double[lengthnumvars]
for(int i=0 iltlength i++)
for(int j=0 jltnumvars-1 j++)
x[(j) length + i] = mypointsget(i)features()apply(j)
x[(numvars-1) length + i] = mypointsget(i)label()
G02BU g02bu = new G02BU()
g02bueval(M U length numvars x length x 00 means ssq ifail)
return ArraysasList(new DataClass(lengthmeansssq))
ComputeSSQ (Mapper)
Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence
static class CombineData implements Function2ltDataClassDataClassDataClassgt
Override
public DataClass call(DataClass data1 DataClass data2) throws Exception
G02BZ g02bz = new G02BZ()
g02bzeval(M data1meanslength data1length data1means
data1ssq data2length data2means data2ssq IFAIL)
data1length = (g02bzgetXSW())
return data1
CombineData (Reducer)
Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Master
Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence
Test 1
Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster
Used an 8 slave xlarge cluster (16 GB RAM)
Test 2
Varied the number of slave nodes from 2 - 16
Used 16 GB of data to see how algorithms scale
Linear Regression Example
Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence
Cluster Utilization
Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence
Results of Linear Regression
Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence
Results of Scaling
Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence
NAG vs MLlib
Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence
Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))
Distributing libraries
spark-ec2copy-dir
Setting environment variables
LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar
NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv
Problems Encountered (1)
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 8Numerical ExcellenceCommercial in Confidence
At NAG we have 40 years of algorithms
Linear Algebra Regression Optimization
Data is all in-memory and call a compiled libraryvoid nag_1d_spline_interpolant (Integer m const double x[] const double y[] Nag_Spline spline NagError fail)
Algorithms take advantage of
AVXAVX2
Multi-core
LAPACKBLAS
MKL
Efficient Algorithms
Commercial in Confidence 9Numerical ExcellenceCommercial in Confidence
Modern programming languages and compilers (hopefully) take care of some efficiencies for us
Python numpyscipy
Compiler flags (-O2)
Users worry less about efficient computing and more about solving their problem
Efficient Algorithms
Commercial in Confidence 10Numerical ExcellenceCommercial in Confidence
hellipBut what happens when it comes to Big Data
x = (double) malloc(10000000 10000 sizeof(double))
Past efficient algorithms break down
Need different ways of solving same problem
How do we use our existing functions (libraries) on Spark
We must reformulate the problem
An examplehellip (Warning Mathematics Approaching)
Big Data
Commercial in Confidence 11Numerical ExcellenceCommercial in Confidence
General Problem
Given a set of input measurements 11990911199092 hellip119909119901 and an
outcome measurement 119910 fit a linear model
119910 = 119883 119861 + ε
119910 =
1199101
⋮119910119901
119883 =
11990911 ⋯ 1199091119901
⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901
Three ways of solving the same
problem
Linear Regression Example
Commercial in Confidence 12Numerical ExcellenceCommercial in Confidence
Solution 1 (Optimal Solution)
119883 =
11990911 ⋯ 1199091119901
⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901
= 119876119877lowast where 119877lowast =1198770
119877 is 119901 by 119901 Upper Triangular 119876 orthogonal
119877 119861 = 1198881 where 119888 = 119876119879119910 hellip
hellip lots of linear algebra
but we have an algorithm
Linear Regression Example
Commercial in Confidence 13Numerical ExcellenceCommercial in Confidence
Solution 2 (Normal Equations)
119883119879119883 119861 = 119883119879 119910
If we 3 independent variables
119883 =1 2 3⋮ ⋮ ⋮4 5 6
119910 =7⋮8
Linear Regression Example
Commercial in Confidence 14Numerical ExcellenceCommercial in Confidence
119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6
1 2 3⋮ ⋮ ⋮4 5 6
=
11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133
119861 = 119883119879119883 minus1119883119879 119910
This computation can be map out to slave nodes
Linear Regression Example
Commercial in Confidence 15Numerical ExcellenceCommercial in Confidence
Solution 3 (Optimization)
min 119910 minus 119883 1198612
Iterative over the data
Final answer is an approximate to
the solution
Can add constraints to variables
(LASSO)
Linear Regression Example
Commercial in Confidence 16Numerical ExcellenceCommercial in Confidence
Machine LearningOptimization
MLlib uses solution 3 for many applications
Classification
Regression
Gaussian Mixture Model
Kmeans
Slave nodes are used for callbacks to access data
Map Reduce Repeat
MLlib Algorithms
Commercial in Confidence 17Numerical ExcellenceCommercial in Confidence
MLlib Algorithms
MasterLibraries
Slaves Slaves
Commercial in Confidence 18Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence
Use Normal Equations (Solution 2) to compute Linear Regression on large dataset
NAG Libraries were distributed across MasterSlave Nodes
Steps
1 Read in data
2 Call routine to compute sum-of-squares matrix on partitions (chucks)
Default partition size is 64mb
3 Call routine to aggregate SSQ matrix together
4 Call routine to compute optimal coefficients
Example on Spark
Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence
Linear Regression Example
Master
LIBRARY CALL
LIBRARY CALL
LIBRARY CALL
Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence
Linear Regression Example Data
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Master
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Slave 1
Slave 2
rddparallelize(data)cache()
Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2LIBRARY_CALL(data)
Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence
Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a
function to all elements of this RDD
flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results
collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence
Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies
a function f to each partition of this RDD
mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
mapPartitions
Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence
public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception
if(dataSize numVars datapointspartitions()size() gt 10000000)
throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)
DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())
hellip
G02BW g02bw = new G02BW()
g02bweval(numVars finalptssq ifail)
if(g02bwgetIFAIL() gt 0)
Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())
throw new NAGSparkException(Error from g02bw)
How it looks in Spark
Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence
static class ComputeSSQ implements
FlatMapFunctionltIteratorltLabeledPointgt DataClassgt
Override
public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception
ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()
while(iterhasNext()) mypointsadd(iternext())
int length = mypointssize()
int numvars = mypointsget(0)features()size() + 1
double[] x = new double[lengthnumvars]
for(int i=0 iltlength i++)
for(int j=0 jltnumvars-1 j++)
x[(j) length + i] = mypointsget(i)features()apply(j)
x[(numvars-1) length + i] = mypointsget(i)label()
G02BU g02bu = new G02BU()
g02bueval(M U length numvars x length x 00 means ssq ifail)
return ArraysasList(new DataClass(lengthmeansssq))
ComputeSSQ (Mapper)
Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence
static class CombineData implements Function2ltDataClassDataClassDataClassgt
Override
public DataClass call(DataClass data1 DataClass data2) throws Exception
G02BZ g02bz = new G02BZ()
g02bzeval(M data1meanslength data1length data1means
data1ssq data2length data2means data2ssq IFAIL)
data1length = (g02bzgetXSW())
return data1
CombineData (Reducer)
Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Master
Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence
Test 1
Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster
Used an 8 slave xlarge cluster (16 GB RAM)
Test 2
Varied the number of slave nodes from 2 - 16
Used 16 GB of data to see how algorithms scale
Linear Regression Example
Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence
Cluster Utilization
Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence
Results of Linear Regression
Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence
Results of Scaling
Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence
NAG vs MLlib
Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence
Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))
Distributing libraries
spark-ec2copy-dir
Setting environment variables
LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar
NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv
Problems Encountered (1)
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 9Numerical ExcellenceCommercial in Confidence
Modern programming languages and compilers (hopefully) take care of some efficiencies for us
Python numpyscipy
Compiler flags (-O2)
Users worry less about efficient computing and more about solving their problem
Efficient Algorithms
Commercial in Confidence 10Numerical ExcellenceCommercial in Confidence
hellipBut what happens when it comes to Big Data
x = (double) malloc(10000000 10000 sizeof(double))
Past efficient algorithms break down
Need different ways of solving same problem
How do we use our existing functions (libraries) on Spark
We must reformulate the problem
An examplehellip (Warning Mathematics Approaching)
Big Data
Commercial in Confidence 11Numerical ExcellenceCommercial in Confidence
General Problem
Given a set of input measurements 11990911199092 hellip119909119901 and an
outcome measurement 119910 fit a linear model
119910 = 119883 119861 + ε
119910 =
1199101
⋮119910119901
119883 =
11990911 ⋯ 1199091119901
⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901
Three ways of solving the same
problem
Linear Regression Example
Commercial in Confidence 12Numerical ExcellenceCommercial in Confidence
Solution 1 (Optimal Solution)
119883 =
11990911 ⋯ 1199091119901
⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901
= 119876119877lowast where 119877lowast =1198770
119877 is 119901 by 119901 Upper Triangular 119876 orthogonal
119877 119861 = 1198881 where 119888 = 119876119879119910 hellip
hellip lots of linear algebra
but we have an algorithm
Linear Regression Example
Commercial in Confidence 13Numerical ExcellenceCommercial in Confidence
Solution 2 (Normal Equations)
119883119879119883 119861 = 119883119879 119910
If we 3 independent variables
119883 =1 2 3⋮ ⋮ ⋮4 5 6
119910 =7⋮8
Linear Regression Example
Commercial in Confidence 14Numerical ExcellenceCommercial in Confidence
119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6
1 2 3⋮ ⋮ ⋮4 5 6
=
11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133
119861 = 119883119879119883 minus1119883119879 119910
This computation can be map out to slave nodes
Linear Regression Example
Commercial in Confidence 15Numerical ExcellenceCommercial in Confidence
Solution 3 (Optimization)
min 119910 minus 119883 1198612
Iterative over the data
Final answer is an approximate to
the solution
Can add constraints to variables
(LASSO)
Linear Regression Example
Commercial in Confidence 16Numerical ExcellenceCommercial in Confidence
Machine LearningOptimization
MLlib uses solution 3 for many applications
Classification
Regression
Gaussian Mixture Model
Kmeans
Slave nodes are used for callbacks to access data
Map Reduce Repeat
MLlib Algorithms
Commercial in Confidence 17Numerical ExcellenceCommercial in Confidence
MLlib Algorithms
MasterLibraries
Slaves Slaves
Commercial in Confidence 18Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence
Use Normal Equations (Solution 2) to compute Linear Regression on large dataset
NAG Libraries were distributed across MasterSlave Nodes
Steps
1 Read in data
2 Call routine to compute sum-of-squares matrix on partitions (chucks)
Default partition size is 64mb
3 Call routine to aggregate SSQ matrix together
4 Call routine to compute optimal coefficients
Example on Spark
Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence
Linear Regression Example
Master
LIBRARY CALL
LIBRARY CALL
LIBRARY CALL
Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence
Linear Regression Example Data
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Master
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Slave 1
Slave 2
rddparallelize(data)cache()
Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2LIBRARY_CALL(data)
Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence
Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a
function to all elements of this RDD
flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results
collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence
Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies
a function f to each partition of this RDD
mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
mapPartitions
Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence
public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception
if(dataSize numVars datapointspartitions()size() gt 10000000)
throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)
DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())
hellip
G02BW g02bw = new G02BW()
g02bweval(numVars finalptssq ifail)
if(g02bwgetIFAIL() gt 0)
Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())
throw new NAGSparkException(Error from g02bw)
How it looks in Spark
Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence
static class ComputeSSQ implements
FlatMapFunctionltIteratorltLabeledPointgt DataClassgt
Override
public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception
ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()
while(iterhasNext()) mypointsadd(iternext())
int length = mypointssize()
int numvars = mypointsget(0)features()size() + 1
double[] x = new double[lengthnumvars]
for(int i=0 iltlength i++)
for(int j=0 jltnumvars-1 j++)
x[(j) length + i] = mypointsget(i)features()apply(j)
x[(numvars-1) length + i] = mypointsget(i)label()
G02BU g02bu = new G02BU()
g02bueval(M U length numvars x length x 00 means ssq ifail)
return ArraysasList(new DataClass(lengthmeansssq))
ComputeSSQ (Mapper)
Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence
static class CombineData implements Function2ltDataClassDataClassDataClassgt
Override
public DataClass call(DataClass data1 DataClass data2) throws Exception
G02BZ g02bz = new G02BZ()
g02bzeval(M data1meanslength data1length data1means
data1ssq data2length data2means data2ssq IFAIL)
data1length = (g02bzgetXSW())
return data1
CombineData (Reducer)
Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Master
Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence
Test 1
Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster
Used an 8 slave xlarge cluster (16 GB RAM)
Test 2
Varied the number of slave nodes from 2 - 16
Used 16 GB of data to see how algorithms scale
Linear Regression Example
Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence
Cluster Utilization
Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence
Results of Linear Regression
Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence
Results of Scaling
Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence
NAG vs MLlib
Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence
Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))
Distributing libraries
spark-ec2copy-dir
Setting environment variables
LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar
NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv
Problems Encountered (1)
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 10Numerical ExcellenceCommercial in Confidence
hellipBut what happens when it comes to Big Data
x = (double) malloc(10000000 10000 sizeof(double))
Past efficient algorithms break down
Need different ways of solving same problem
How do we use our existing functions (libraries) on Spark
We must reformulate the problem
An examplehellip (Warning Mathematics Approaching)
Big Data
Commercial in Confidence 11Numerical ExcellenceCommercial in Confidence
General Problem
Given a set of input measurements 11990911199092 hellip119909119901 and an
outcome measurement 119910 fit a linear model
119910 = 119883 119861 + ε
119910 =
1199101
⋮119910119901
119883 =
11990911 ⋯ 1199091119901
⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901
Three ways of solving the same
problem
Linear Regression Example
Commercial in Confidence 12Numerical ExcellenceCommercial in Confidence
Solution 1 (Optimal Solution)
119883 =
11990911 ⋯ 1199091119901
⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901
= 119876119877lowast where 119877lowast =1198770
119877 is 119901 by 119901 Upper Triangular 119876 orthogonal
119877 119861 = 1198881 where 119888 = 119876119879119910 hellip
hellip lots of linear algebra
but we have an algorithm
Linear Regression Example
Commercial in Confidence 13Numerical ExcellenceCommercial in Confidence
Solution 2 (Normal Equations)
119883119879119883 119861 = 119883119879 119910
If we 3 independent variables
119883 =1 2 3⋮ ⋮ ⋮4 5 6
119910 =7⋮8
Linear Regression Example
Commercial in Confidence 14Numerical ExcellenceCommercial in Confidence
119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6
1 2 3⋮ ⋮ ⋮4 5 6
=
11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133
119861 = 119883119879119883 minus1119883119879 119910
This computation can be map out to slave nodes
Linear Regression Example
Commercial in Confidence 15Numerical ExcellenceCommercial in Confidence
Solution 3 (Optimization)
min 119910 minus 119883 1198612
Iterative over the data
Final answer is an approximate to
the solution
Can add constraints to variables
(LASSO)
Linear Regression Example
Commercial in Confidence 16Numerical ExcellenceCommercial in Confidence
Machine LearningOptimization
MLlib uses solution 3 for many applications
Classification
Regression
Gaussian Mixture Model
Kmeans
Slave nodes are used for callbacks to access data
Map Reduce Repeat
MLlib Algorithms
Commercial in Confidence 17Numerical ExcellenceCommercial in Confidence
MLlib Algorithms
MasterLibraries
Slaves Slaves
Commercial in Confidence 18Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence
Use Normal Equations (Solution 2) to compute Linear Regression on large dataset
NAG Libraries were distributed across MasterSlave Nodes
Steps
1 Read in data
2 Call routine to compute sum-of-squares matrix on partitions (chucks)
Default partition size is 64mb
3 Call routine to aggregate SSQ matrix together
4 Call routine to compute optimal coefficients
Example on Spark
Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence
Linear Regression Example
Master
LIBRARY CALL
LIBRARY CALL
LIBRARY CALL
Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence
Linear Regression Example Data
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Master
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Slave 1
Slave 2
rddparallelize(data)cache()
Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2LIBRARY_CALL(data)
Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence
Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a
function to all elements of this RDD
flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results
collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence
Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies
a function f to each partition of this RDD
mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
mapPartitions
Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence
public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception
if(dataSize numVars datapointspartitions()size() gt 10000000)
throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)
DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())
hellip
G02BW g02bw = new G02BW()
g02bweval(numVars finalptssq ifail)
if(g02bwgetIFAIL() gt 0)
Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())
throw new NAGSparkException(Error from g02bw)
How it looks in Spark
Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence
static class ComputeSSQ implements
FlatMapFunctionltIteratorltLabeledPointgt DataClassgt
Override
public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception
ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()
while(iterhasNext()) mypointsadd(iternext())
int length = mypointssize()
int numvars = mypointsget(0)features()size() + 1
double[] x = new double[lengthnumvars]
for(int i=0 iltlength i++)
for(int j=0 jltnumvars-1 j++)
x[(j) length + i] = mypointsget(i)features()apply(j)
x[(numvars-1) length + i] = mypointsget(i)label()
G02BU g02bu = new G02BU()
g02bueval(M U length numvars x length x 00 means ssq ifail)
return ArraysasList(new DataClass(lengthmeansssq))
ComputeSSQ (Mapper)
Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence
static class CombineData implements Function2ltDataClassDataClassDataClassgt
Override
public DataClass call(DataClass data1 DataClass data2) throws Exception
G02BZ g02bz = new G02BZ()
g02bzeval(M data1meanslength data1length data1means
data1ssq data2length data2means data2ssq IFAIL)
data1length = (g02bzgetXSW())
return data1
CombineData (Reducer)
Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Master
Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence
Test 1
Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster
Used an 8 slave xlarge cluster (16 GB RAM)
Test 2
Varied the number of slave nodes from 2 - 16
Used 16 GB of data to see how algorithms scale
Linear Regression Example
Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence
Cluster Utilization
Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence
Results of Linear Regression
Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence
Results of Scaling
Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence
NAG vs MLlib
Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence
Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))
Distributing libraries
spark-ec2copy-dir
Setting environment variables
LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar
NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv
Problems Encountered (1)
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 11Numerical ExcellenceCommercial in Confidence
General Problem
Given a set of input measurements 11990911199092 hellip119909119901 and an
outcome measurement 119910 fit a linear model
119910 = 119883 119861 + ε
119910 =
1199101
⋮119910119901
119883 =
11990911 ⋯ 1199091119901
⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901
Three ways of solving the same
problem
Linear Regression Example
Commercial in Confidence 12Numerical ExcellenceCommercial in Confidence
Solution 1 (Optimal Solution)
119883 =
11990911 ⋯ 1199091119901
⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901
= 119876119877lowast where 119877lowast =1198770
119877 is 119901 by 119901 Upper Triangular 119876 orthogonal
119877 119861 = 1198881 where 119888 = 119876119879119910 hellip
hellip lots of linear algebra
but we have an algorithm
Linear Regression Example
Commercial in Confidence 13Numerical ExcellenceCommercial in Confidence
Solution 2 (Normal Equations)
119883119879119883 119861 = 119883119879 119910
If we 3 independent variables
119883 =1 2 3⋮ ⋮ ⋮4 5 6
119910 =7⋮8
Linear Regression Example
Commercial in Confidence 14Numerical ExcellenceCommercial in Confidence
119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6
1 2 3⋮ ⋮ ⋮4 5 6
=
11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133
119861 = 119883119879119883 minus1119883119879 119910
This computation can be map out to slave nodes
Linear Regression Example
Commercial in Confidence 15Numerical ExcellenceCommercial in Confidence
Solution 3 (Optimization)
min 119910 minus 119883 1198612
Iterative over the data
Final answer is an approximate to
the solution
Can add constraints to variables
(LASSO)
Linear Regression Example
Commercial in Confidence 16Numerical ExcellenceCommercial in Confidence
Machine LearningOptimization
MLlib uses solution 3 for many applications
Classification
Regression
Gaussian Mixture Model
Kmeans
Slave nodes are used for callbacks to access data
Map Reduce Repeat
MLlib Algorithms
Commercial in Confidence 17Numerical ExcellenceCommercial in Confidence
MLlib Algorithms
MasterLibraries
Slaves Slaves
Commercial in Confidence 18Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence
Use Normal Equations (Solution 2) to compute Linear Regression on large dataset
NAG Libraries were distributed across MasterSlave Nodes
Steps
1 Read in data
2 Call routine to compute sum-of-squares matrix on partitions (chucks)
Default partition size is 64mb
3 Call routine to aggregate SSQ matrix together
4 Call routine to compute optimal coefficients
Example on Spark
Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence
Linear Regression Example
Master
LIBRARY CALL
LIBRARY CALL
LIBRARY CALL
Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence
Linear Regression Example Data
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Master
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Slave 1
Slave 2
rddparallelize(data)cache()
Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2LIBRARY_CALL(data)
Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence
Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a
function to all elements of this RDD
flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results
collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence
Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies
a function f to each partition of this RDD
mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
mapPartitions
Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence
public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception
if(dataSize numVars datapointspartitions()size() gt 10000000)
throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)
DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())
hellip
G02BW g02bw = new G02BW()
g02bweval(numVars finalptssq ifail)
if(g02bwgetIFAIL() gt 0)
Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())
throw new NAGSparkException(Error from g02bw)
How it looks in Spark
Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence
static class ComputeSSQ implements
FlatMapFunctionltIteratorltLabeledPointgt DataClassgt
Override
public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception
ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()
while(iterhasNext()) mypointsadd(iternext())
int length = mypointssize()
int numvars = mypointsget(0)features()size() + 1
double[] x = new double[lengthnumvars]
for(int i=0 iltlength i++)
for(int j=0 jltnumvars-1 j++)
x[(j) length + i] = mypointsget(i)features()apply(j)
x[(numvars-1) length + i] = mypointsget(i)label()
G02BU g02bu = new G02BU()
g02bueval(M U length numvars x length x 00 means ssq ifail)
return ArraysasList(new DataClass(lengthmeansssq))
ComputeSSQ (Mapper)
Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence
static class CombineData implements Function2ltDataClassDataClassDataClassgt
Override
public DataClass call(DataClass data1 DataClass data2) throws Exception
G02BZ g02bz = new G02BZ()
g02bzeval(M data1meanslength data1length data1means
data1ssq data2length data2means data2ssq IFAIL)
data1length = (g02bzgetXSW())
return data1
CombineData (Reducer)
Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Master
Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence
Test 1
Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster
Used an 8 slave xlarge cluster (16 GB RAM)
Test 2
Varied the number of slave nodes from 2 - 16
Used 16 GB of data to see how algorithms scale
Linear Regression Example
Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence
Cluster Utilization
Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence
Results of Linear Regression
Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence
Results of Scaling
Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence
NAG vs MLlib
Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence
Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))
Distributing libraries
spark-ec2copy-dir
Setting environment variables
LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar
NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv
Problems Encountered (1)
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 12Numerical ExcellenceCommercial in Confidence
Solution 1 (Optimal Solution)
119883 =
11990911 ⋯ 1199091119901
⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901
= 119876119877lowast where 119877lowast =1198770
119877 is 119901 by 119901 Upper Triangular 119876 orthogonal
119877 119861 = 1198881 where 119888 = 119876119879119910 hellip
hellip lots of linear algebra
but we have an algorithm
Linear Regression Example
Commercial in Confidence 13Numerical ExcellenceCommercial in Confidence
Solution 2 (Normal Equations)
119883119879119883 119861 = 119883119879 119910
If we 3 independent variables
119883 =1 2 3⋮ ⋮ ⋮4 5 6
119910 =7⋮8
Linear Regression Example
Commercial in Confidence 14Numerical ExcellenceCommercial in Confidence
119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6
1 2 3⋮ ⋮ ⋮4 5 6
=
11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133
119861 = 119883119879119883 minus1119883119879 119910
This computation can be map out to slave nodes
Linear Regression Example
Commercial in Confidence 15Numerical ExcellenceCommercial in Confidence
Solution 3 (Optimization)
min 119910 minus 119883 1198612
Iterative over the data
Final answer is an approximate to
the solution
Can add constraints to variables
(LASSO)
Linear Regression Example
Commercial in Confidence 16Numerical ExcellenceCommercial in Confidence
Machine LearningOptimization
MLlib uses solution 3 for many applications
Classification
Regression
Gaussian Mixture Model
Kmeans
Slave nodes are used for callbacks to access data
Map Reduce Repeat
MLlib Algorithms
Commercial in Confidence 17Numerical ExcellenceCommercial in Confidence
MLlib Algorithms
MasterLibraries
Slaves Slaves
Commercial in Confidence 18Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence
Use Normal Equations (Solution 2) to compute Linear Regression on large dataset
NAG Libraries were distributed across MasterSlave Nodes
Steps
1 Read in data
2 Call routine to compute sum-of-squares matrix on partitions (chucks)
Default partition size is 64mb
3 Call routine to aggregate SSQ matrix together
4 Call routine to compute optimal coefficients
Example on Spark
Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence
Linear Regression Example
Master
LIBRARY CALL
LIBRARY CALL
LIBRARY CALL
Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence
Linear Regression Example Data
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Master
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Slave 1
Slave 2
rddparallelize(data)cache()
Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2LIBRARY_CALL(data)
Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence
Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a
function to all elements of this RDD
flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results
collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence
Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies
a function f to each partition of this RDD
mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
mapPartitions
Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence
public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception
if(dataSize numVars datapointspartitions()size() gt 10000000)
throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)
DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())
hellip
G02BW g02bw = new G02BW()
g02bweval(numVars finalptssq ifail)
if(g02bwgetIFAIL() gt 0)
Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())
throw new NAGSparkException(Error from g02bw)
How it looks in Spark
Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence
static class ComputeSSQ implements
FlatMapFunctionltIteratorltLabeledPointgt DataClassgt
Override
public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception
ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()
while(iterhasNext()) mypointsadd(iternext())
int length = mypointssize()
int numvars = mypointsget(0)features()size() + 1
double[] x = new double[lengthnumvars]
for(int i=0 iltlength i++)
for(int j=0 jltnumvars-1 j++)
x[(j) length + i] = mypointsget(i)features()apply(j)
x[(numvars-1) length + i] = mypointsget(i)label()
G02BU g02bu = new G02BU()
g02bueval(M U length numvars x length x 00 means ssq ifail)
return ArraysasList(new DataClass(lengthmeansssq))
ComputeSSQ (Mapper)
Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence
static class CombineData implements Function2ltDataClassDataClassDataClassgt
Override
public DataClass call(DataClass data1 DataClass data2) throws Exception
G02BZ g02bz = new G02BZ()
g02bzeval(M data1meanslength data1length data1means
data1ssq data2length data2means data2ssq IFAIL)
data1length = (g02bzgetXSW())
return data1
CombineData (Reducer)
Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Master
Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence
Test 1
Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster
Used an 8 slave xlarge cluster (16 GB RAM)
Test 2
Varied the number of slave nodes from 2 - 16
Used 16 GB of data to see how algorithms scale
Linear Regression Example
Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence
Cluster Utilization
Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence
Results of Linear Regression
Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence
Results of Scaling
Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence
NAG vs MLlib
Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence
Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))
Distributing libraries
spark-ec2copy-dir
Setting environment variables
LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar
NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv
Problems Encountered (1)
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 13Numerical ExcellenceCommercial in Confidence
Solution 2 (Normal Equations)
119883119879119883 119861 = 119883119879 119910
If we 3 independent variables
119883 =1 2 3⋮ ⋮ ⋮4 5 6
119910 =7⋮8
Linear Regression Example
Commercial in Confidence 14Numerical ExcellenceCommercial in Confidence
119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6
1 2 3⋮ ⋮ ⋮4 5 6
=
11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133
119861 = 119883119879119883 minus1119883119879 119910
This computation can be map out to slave nodes
Linear Regression Example
Commercial in Confidence 15Numerical ExcellenceCommercial in Confidence
Solution 3 (Optimization)
min 119910 minus 119883 1198612
Iterative over the data
Final answer is an approximate to
the solution
Can add constraints to variables
(LASSO)
Linear Regression Example
Commercial in Confidence 16Numerical ExcellenceCommercial in Confidence
Machine LearningOptimization
MLlib uses solution 3 for many applications
Classification
Regression
Gaussian Mixture Model
Kmeans
Slave nodes are used for callbacks to access data
Map Reduce Repeat
MLlib Algorithms
Commercial in Confidence 17Numerical ExcellenceCommercial in Confidence
MLlib Algorithms
MasterLibraries
Slaves Slaves
Commercial in Confidence 18Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence
Use Normal Equations (Solution 2) to compute Linear Regression on large dataset
NAG Libraries were distributed across MasterSlave Nodes
Steps
1 Read in data
2 Call routine to compute sum-of-squares matrix on partitions (chucks)
Default partition size is 64mb
3 Call routine to aggregate SSQ matrix together
4 Call routine to compute optimal coefficients
Example on Spark
Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence
Linear Regression Example
Master
LIBRARY CALL
LIBRARY CALL
LIBRARY CALL
Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence
Linear Regression Example Data
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Master
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Slave 1
Slave 2
rddparallelize(data)cache()
Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2LIBRARY_CALL(data)
Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence
Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a
function to all elements of this RDD
flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results
collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence
Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies
a function f to each partition of this RDD
mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
mapPartitions
Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence
public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception
if(dataSize numVars datapointspartitions()size() gt 10000000)
throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)
DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())
hellip
G02BW g02bw = new G02BW()
g02bweval(numVars finalptssq ifail)
if(g02bwgetIFAIL() gt 0)
Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())
throw new NAGSparkException(Error from g02bw)
How it looks in Spark
Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence
static class ComputeSSQ implements
FlatMapFunctionltIteratorltLabeledPointgt DataClassgt
Override
public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception
ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()
while(iterhasNext()) mypointsadd(iternext())
int length = mypointssize()
int numvars = mypointsget(0)features()size() + 1
double[] x = new double[lengthnumvars]
for(int i=0 iltlength i++)
for(int j=0 jltnumvars-1 j++)
x[(j) length + i] = mypointsget(i)features()apply(j)
x[(numvars-1) length + i] = mypointsget(i)label()
G02BU g02bu = new G02BU()
g02bueval(M U length numvars x length x 00 means ssq ifail)
return ArraysasList(new DataClass(lengthmeansssq))
ComputeSSQ (Mapper)
Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence
static class CombineData implements Function2ltDataClassDataClassDataClassgt
Override
public DataClass call(DataClass data1 DataClass data2) throws Exception
G02BZ g02bz = new G02BZ()
g02bzeval(M data1meanslength data1length data1means
data1ssq data2length data2means data2ssq IFAIL)
data1length = (g02bzgetXSW())
return data1
CombineData (Reducer)
Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Master
Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence
Test 1
Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster
Used an 8 slave xlarge cluster (16 GB RAM)
Test 2
Varied the number of slave nodes from 2 - 16
Used 16 GB of data to see how algorithms scale
Linear Regression Example
Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence
Cluster Utilization
Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence
Results of Linear Regression
Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence
Results of Scaling
Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence
NAG vs MLlib
Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence
Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))
Distributing libraries
spark-ec2copy-dir
Setting environment variables
LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar
NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv
Problems Encountered (1)
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 14Numerical ExcellenceCommercial in Confidence
119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6
1 2 3⋮ ⋮ ⋮4 5 6
=
11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133
119861 = 119883119879119883 minus1119883119879 119910
This computation can be map out to slave nodes
Linear Regression Example
Commercial in Confidence 15Numerical ExcellenceCommercial in Confidence
Solution 3 (Optimization)
min 119910 minus 119883 1198612
Iterative over the data
Final answer is an approximate to
the solution
Can add constraints to variables
(LASSO)
Linear Regression Example
Commercial in Confidence 16Numerical ExcellenceCommercial in Confidence
Machine LearningOptimization
MLlib uses solution 3 for many applications
Classification
Regression
Gaussian Mixture Model
Kmeans
Slave nodes are used for callbacks to access data
Map Reduce Repeat
MLlib Algorithms
Commercial in Confidence 17Numerical ExcellenceCommercial in Confidence
MLlib Algorithms
MasterLibraries
Slaves Slaves
Commercial in Confidence 18Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence
Use Normal Equations (Solution 2) to compute Linear Regression on large dataset
NAG Libraries were distributed across MasterSlave Nodes
Steps
1 Read in data
2 Call routine to compute sum-of-squares matrix on partitions (chucks)
Default partition size is 64mb
3 Call routine to aggregate SSQ matrix together
4 Call routine to compute optimal coefficients
Example on Spark
Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence
Linear Regression Example
Master
LIBRARY CALL
LIBRARY CALL
LIBRARY CALL
Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence
Linear Regression Example Data
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Master
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Slave 1
Slave 2
rddparallelize(data)cache()
Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2LIBRARY_CALL(data)
Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence
Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a
function to all elements of this RDD
flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results
collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence
Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies
a function f to each partition of this RDD
mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
mapPartitions
Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence
public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception
if(dataSize numVars datapointspartitions()size() gt 10000000)
throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)
DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())
hellip
G02BW g02bw = new G02BW()
g02bweval(numVars finalptssq ifail)
if(g02bwgetIFAIL() gt 0)
Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())
throw new NAGSparkException(Error from g02bw)
How it looks in Spark
Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence
static class ComputeSSQ implements
FlatMapFunctionltIteratorltLabeledPointgt DataClassgt
Override
public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception
ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()
while(iterhasNext()) mypointsadd(iternext())
int length = mypointssize()
int numvars = mypointsget(0)features()size() + 1
double[] x = new double[lengthnumvars]
for(int i=0 iltlength i++)
for(int j=0 jltnumvars-1 j++)
x[(j) length + i] = mypointsget(i)features()apply(j)
x[(numvars-1) length + i] = mypointsget(i)label()
G02BU g02bu = new G02BU()
g02bueval(M U length numvars x length x 00 means ssq ifail)
return ArraysasList(new DataClass(lengthmeansssq))
ComputeSSQ (Mapper)
Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence
static class CombineData implements Function2ltDataClassDataClassDataClassgt
Override
public DataClass call(DataClass data1 DataClass data2) throws Exception
G02BZ g02bz = new G02BZ()
g02bzeval(M data1meanslength data1length data1means
data1ssq data2length data2means data2ssq IFAIL)
data1length = (g02bzgetXSW())
return data1
CombineData (Reducer)
Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Master
Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence
Test 1
Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster
Used an 8 slave xlarge cluster (16 GB RAM)
Test 2
Varied the number of slave nodes from 2 - 16
Used 16 GB of data to see how algorithms scale
Linear Regression Example
Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence
Cluster Utilization
Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence
Results of Linear Regression
Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence
Results of Scaling
Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence
NAG vs MLlib
Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence
Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))
Distributing libraries
spark-ec2copy-dir
Setting environment variables
LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar
NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv
Problems Encountered (1)
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 15Numerical ExcellenceCommercial in Confidence
Solution 3 (Optimization)
min 119910 minus 119883 1198612
Iterative over the data
Final answer is an approximate to
the solution
Can add constraints to variables
(LASSO)
Linear Regression Example
Commercial in Confidence 16Numerical ExcellenceCommercial in Confidence
Machine LearningOptimization
MLlib uses solution 3 for many applications
Classification
Regression
Gaussian Mixture Model
Kmeans
Slave nodes are used for callbacks to access data
Map Reduce Repeat
MLlib Algorithms
Commercial in Confidence 17Numerical ExcellenceCommercial in Confidence
MLlib Algorithms
MasterLibraries
Slaves Slaves
Commercial in Confidence 18Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence
Use Normal Equations (Solution 2) to compute Linear Regression on large dataset
NAG Libraries were distributed across MasterSlave Nodes
Steps
1 Read in data
2 Call routine to compute sum-of-squares matrix on partitions (chucks)
Default partition size is 64mb
3 Call routine to aggregate SSQ matrix together
4 Call routine to compute optimal coefficients
Example on Spark
Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence
Linear Regression Example
Master
LIBRARY CALL
LIBRARY CALL
LIBRARY CALL
Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence
Linear Regression Example Data
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Master
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Slave 1
Slave 2
rddparallelize(data)cache()
Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2LIBRARY_CALL(data)
Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence
Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a
function to all elements of this RDD
flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results
collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence
Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies
a function f to each partition of this RDD
mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
mapPartitions
Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence
public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception
if(dataSize numVars datapointspartitions()size() gt 10000000)
throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)
DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())
hellip
G02BW g02bw = new G02BW()
g02bweval(numVars finalptssq ifail)
if(g02bwgetIFAIL() gt 0)
Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())
throw new NAGSparkException(Error from g02bw)
How it looks in Spark
Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence
static class ComputeSSQ implements
FlatMapFunctionltIteratorltLabeledPointgt DataClassgt
Override
public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception
ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()
while(iterhasNext()) mypointsadd(iternext())
int length = mypointssize()
int numvars = mypointsget(0)features()size() + 1
double[] x = new double[lengthnumvars]
for(int i=0 iltlength i++)
for(int j=0 jltnumvars-1 j++)
x[(j) length + i] = mypointsget(i)features()apply(j)
x[(numvars-1) length + i] = mypointsget(i)label()
G02BU g02bu = new G02BU()
g02bueval(M U length numvars x length x 00 means ssq ifail)
return ArraysasList(new DataClass(lengthmeansssq))
ComputeSSQ (Mapper)
Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence
static class CombineData implements Function2ltDataClassDataClassDataClassgt
Override
public DataClass call(DataClass data1 DataClass data2) throws Exception
G02BZ g02bz = new G02BZ()
g02bzeval(M data1meanslength data1length data1means
data1ssq data2length data2means data2ssq IFAIL)
data1length = (g02bzgetXSW())
return data1
CombineData (Reducer)
Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Master
Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence
Test 1
Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster
Used an 8 slave xlarge cluster (16 GB RAM)
Test 2
Varied the number of slave nodes from 2 - 16
Used 16 GB of data to see how algorithms scale
Linear Regression Example
Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence
Cluster Utilization
Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence
Results of Linear Regression
Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence
Results of Scaling
Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence
NAG vs MLlib
Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence
Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))
Distributing libraries
spark-ec2copy-dir
Setting environment variables
LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar
NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv
Problems Encountered (1)
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 16Numerical ExcellenceCommercial in Confidence
Machine LearningOptimization
MLlib uses solution 3 for many applications
Classification
Regression
Gaussian Mixture Model
Kmeans
Slave nodes are used for callbacks to access data
Map Reduce Repeat
MLlib Algorithms
Commercial in Confidence 17Numerical ExcellenceCommercial in Confidence
MLlib Algorithms
MasterLibraries
Slaves Slaves
Commercial in Confidence 18Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence
Use Normal Equations (Solution 2) to compute Linear Regression on large dataset
NAG Libraries were distributed across MasterSlave Nodes
Steps
1 Read in data
2 Call routine to compute sum-of-squares matrix on partitions (chucks)
Default partition size is 64mb
3 Call routine to aggregate SSQ matrix together
4 Call routine to compute optimal coefficients
Example on Spark
Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence
Linear Regression Example
Master
LIBRARY CALL
LIBRARY CALL
LIBRARY CALL
Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence
Linear Regression Example Data
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Master
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Slave 1
Slave 2
rddparallelize(data)cache()
Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2LIBRARY_CALL(data)
Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence
Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a
function to all elements of this RDD
flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results
collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence
Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies
a function f to each partition of this RDD
mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
mapPartitions
Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence
public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception
if(dataSize numVars datapointspartitions()size() gt 10000000)
throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)
DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())
hellip
G02BW g02bw = new G02BW()
g02bweval(numVars finalptssq ifail)
if(g02bwgetIFAIL() gt 0)
Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())
throw new NAGSparkException(Error from g02bw)
How it looks in Spark
Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence
static class ComputeSSQ implements
FlatMapFunctionltIteratorltLabeledPointgt DataClassgt
Override
public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception
ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()
while(iterhasNext()) mypointsadd(iternext())
int length = mypointssize()
int numvars = mypointsget(0)features()size() + 1
double[] x = new double[lengthnumvars]
for(int i=0 iltlength i++)
for(int j=0 jltnumvars-1 j++)
x[(j) length + i] = mypointsget(i)features()apply(j)
x[(numvars-1) length + i] = mypointsget(i)label()
G02BU g02bu = new G02BU()
g02bueval(M U length numvars x length x 00 means ssq ifail)
return ArraysasList(new DataClass(lengthmeansssq))
ComputeSSQ (Mapper)
Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence
static class CombineData implements Function2ltDataClassDataClassDataClassgt
Override
public DataClass call(DataClass data1 DataClass data2) throws Exception
G02BZ g02bz = new G02BZ()
g02bzeval(M data1meanslength data1length data1means
data1ssq data2length data2means data2ssq IFAIL)
data1length = (g02bzgetXSW())
return data1
CombineData (Reducer)
Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Master
Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence
Test 1
Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster
Used an 8 slave xlarge cluster (16 GB RAM)
Test 2
Varied the number of slave nodes from 2 - 16
Used 16 GB of data to see how algorithms scale
Linear Regression Example
Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence
Cluster Utilization
Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence
Results of Linear Regression
Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence
Results of Scaling
Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence
NAG vs MLlib
Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence
Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))
Distributing libraries
spark-ec2copy-dir
Setting environment variables
LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar
NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv
Problems Encountered (1)
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 17Numerical ExcellenceCommercial in Confidence
MLlib Algorithms
MasterLibraries
Slaves Slaves
Commercial in Confidence 18Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence
Use Normal Equations (Solution 2) to compute Linear Regression on large dataset
NAG Libraries were distributed across MasterSlave Nodes
Steps
1 Read in data
2 Call routine to compute sum-of-squares matrix on partitions (chucks)
Default partition size is 64mb
3 Call routine to aggregate SSQ matrix together
4 Call routine to compute optimal coefficients
Example on Spark
Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence
Linear Regression Example
Master
LIBRARY CALL
LIBRARY CALL
LIBRARY CALL
Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence
Linear Regression Example Data
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Master
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Slave 1
Slave 2
rddparallelize(data)cache()
Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2LIBRARY_CALL(data)
Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence
Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a
function to all elements of this RDD
flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results
collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence
Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies
a function f to each partition of this RDD
mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
mapPartitions
Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence
public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception
if(dataSize numVars datapointspartitions()size() gt 10000000)
throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)
DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())
hellip
G02BW g02bw = new G02BW()
g02bweval(numVars finalptssq ifail)
if(g02bwgetIFAIL() gt 0)
Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())
throw new NAGSparkException(Error from g02bw)
How it looks in Spark
Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence
static class ComputeSSQ implements
FlatMapFunctionltIteratorltLabeledPointgt DataClassgt
Override
public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception
ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()
while(iterhasNext()) mypointsadd(iternext())
int length = mypointssize()
int numvars = mypointsget(0)features()size() + 1
double[] x = new double[lengthnumvars]
for(int i=0 iltlength i++)
for(int j=0 jltnumvars-1 j++)
x[(j) length + i] = mypointsget(i)features()apply(j)
x[(numvars-1) length + i] = mypointsget(i)label()
G02BU g02bu = new G02BU()
g02bueval(M U length numvars x length x 00 means ssq ifail)
return ArraysasList(new DataClass(lengthmeansssq))
ComputeSSQ (Mapper)
Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence
static class CombineData implements Function2ltDataClassDataClassDataClassgt
Override
public DataClass call(DataClass data1 DataClass data2) throws Exception
G02BZ g02bz = new G02BZ()
g02bzeval(M data1meanslength data1length data1means
data1ssq data2length data2means data2ssq IFAIL)
data1length = (g02bzgetXSW())
return data1
CombineData (Reducer)
Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Master
Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence
Test 1
Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster
Used an 8 slave xlarge cluster (16 GB RAM)
Test 2
Varied the number of slave nodes from 2 - 16
Used 16 GB of data to see how algorithms scale
Linear Regression Example
Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence
Cluster Utilization
Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence
Results of Linear Regression
Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence
Results of Scaling
Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence
NAG vs MLlib
Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence
Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))
Distributing libraries
spark-ec2copy-dir
Setting environment variables
LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar
NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv
Problems Encountered (1)
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 18Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence
Use Normal Equations (Solution 2) to compute Linear Regression on large dataset
NAG Libraries were distributed across MasterSlave Nodes
Steps
1 Read in data
2 Call routine to compute sum-of-squares matrix on partitions (chucks)
Default partition size is 64mb
3 Call routine to aggregate SSQ matrix together
4 Call routine to compute optimal coefficients
Example on Spark
Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence
Linear Regression Example
Master
LIBRARY CALL
LIBRARY CALL
LIBRARY CALL
Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence
Linear Regression Example Data
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Master
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Slave 1
Slave 2
rddparallelize(data)cache()
Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2LIBRARY_CALL(data)
Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence
Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a
function to all elements of this RDD
flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results
collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence
Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies
a function f to each partition of this RDD
mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
mapPartitions
Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence
public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception
if(dataSize numVars datapointspartitions()size() gt 10000000)
throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)
DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())
hellip
G02BW g02bw = new G02BW()
g02bweval(numVars finalptssq ifail)
if(g02bwgetIFAIL() gt 0)
Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())
throw new NAGSparkException(Error from g02bw)
How it looks in Spark
Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence
static class ComputeSSQ implements
FlatMapFunctionltIteratorltLabeledPointgt DataClassgt
Override
public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception
ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()
while(iterhasNext()) mypointsadd(iternext())
int length = mypointssize()
int numvars = mypointsget(0)features()size() + 1
double[] x = new double[lengthnumvars]
for(int i=0 iltlength i++)
for(int j=0 jltnumvars-1 j++)
x[(j) length + i] = mypointsget(i)features()apply(j)
x[(numvars-1) length + i] = mypointsget(i)label()
G02BU g02bu = new G02BU()
g02bueval(M U length numvars x length x 00 means ssq ifail)
return ArraysasList(new DataClass(lengthmeansssq))
ComputeSSQ (Mapper)
Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence
static class CombineData implements Function2ltDataClassDataClassDataClassgt
Override
public DataClass call(DataClass data1 DataClass data2) throws Exception
G02BZ g02bz = new G02BZ()
g02bzeval(M data1meanslength data1length data1means
data1ssq data2length data2means data2ssq IFAIL)
data1length = (g02bzgetXSW())
return data1
CombineData (Reducer)
Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Master
Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence
Test 1
Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster
Used an 8 slave xlarge cluster (16 GB RAM)
Test 2
Varied the number of slave nodes from 2 - 16
Used 16 GB of data to see how algorithms scale
Linear Regression Example
Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence
Cluster Utilization
Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence
Results of Linear Regression
Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence
Results of Scaling
Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence
NAG vs MLlib
Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence
Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))
Distributing libraries
spark-ec2copy-dir
Setting environment variables
LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar
NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv
Problems Encountered (1)
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 19Numerical ExcellenceCommercial in Confidence
Use Normal Equations (Solution 2) to compute Linear Regression on large dataset
NAG Libraries were distributed across MasterSlave Nodes
Steps
1 Read in data
2 Call routine to compute sum-of-squares matrix on partitions (chucks)
Default partition size is 64mb
3 Call routine to aggregate SSQ matrix together
4 Call routine to compute optimal coefficients
Example on Spark
Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence
Linear Regression Example
Master
LIBRARY CALL
LIBRARY CALL
LIBRARY CALL
Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence
Linear Regression Example Data
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Master
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Slave 1
Slave 2
rddparallelize(data)cache()
Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2LIBRARY_CALL(data)
Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence
Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a
function to all elements of this RDD
flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results
collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence
Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies
a function f to each partition of this RDD
mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
mapPartitions
Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence
public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception
if(dataSize numVars datapointspartitions()size() gt 10000000)
throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)
DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())
hellip
G02BW g02bw = new G02BW()
g02bweval(numVars finalptssq ifail)
if(g02bwgetIFAIL() gt 0)
Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())
throw new NAGSparkException(Error from g02bw)
How it looks in Spark
Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence
static class ComputeSSQ implements
FlatMapFunctionltIteratorltLabeledPointgt DataClassgt
Override
public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception
ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()
while(iterhasNext()) mypointsadd(iternext())
int length = mypointssize()
int numvars = mypointsget(0)features()size() + 1
double[] x = new double[lengthnumvars]
for(int i=0 iltlength i++)
for(int j=0 jltnumvars-1 j++)
x[(j) length + i] = mypointsget(i)features()apply(j)
x[(numvars-1) length + i] = mypointsget(i)label()
G02BU g02bu = new G02BU()
g02bueval(M U length numvars x length x 00 means ssq ifail)
return ArraysasList(new DataClass(lengthmeansssq))
ComputeSSQ (Mapper)
Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence
static class CombineData implements Function2ltDataClassDataClassDataClassgt
Override
public DataClass call(DataClass data1 DataClass data2) throws Exception
G02BZ g02bz = new G02BZ()
g02bzeval(M data1meanslength data1length data1means
data1ssq data2length data2means data2ssq IFAIL)
data1length = (g02bzgetXSW())
return data1
CombineData (Reducer)
Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Master
Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence
Test 1
Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster
Used an 8 slave xlarge cluster (16 GB RAM)
Test 2
Varied the number of slave nodes from 2 - 16
Used 16 GB of data to see how algorithms scale
Linear Regression Example
Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence
Cluster Utilization
Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence
Results of Linear Regression
Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence
Results of Scaling
Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence
NAG vs MLlib
Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence
Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))
Distributing libraries
spark-ec2copy-dir
Setting environment variables
LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar
NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv
Problems Encountered (1)
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 20Numerical ExcellenceCommercial in Confidence
Linear Regression Example
Master
LIBRARY CALL
LIBRARY CALL
LIBRARY CALL
Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence
Linear Regression Example Data
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Master
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Slave 1
Slave 2
rddparallelize(data)cache()
Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2LIBRARY_CALL(data)
Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence
Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a
function to all elements of this RDD
flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results
collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence
Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies
a function f to each partition of this RDD
mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
mapPartitions
Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence
public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception
if(dataSize numVars datapointspartitions()size() gt 10000000)
throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)
DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())
hellip
G02BW g02bw = new G02BW()
g02bweval(numVars finalptssq ifail)
if(g02bwgetIFAIL() gt 0)
Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())
throw new NAGSparkException(Error from g02bw)
How it looks in Spark
Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence
static class ComputeSSQ implements
FlatMapFunctionltIteratorltLabeledPointgt DataClassgt
Override
public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception
ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()
while(iterhasNext()) mypointsadd(iternext())
int length = mypointssize()
int numvars = mypointsget(0)features()size() + 1
double[] x = new double[lengthnumvars]
for(int i=0 iltlength i++)
for(int j=0 jltnumvars-1 j++)
x[(j) length + i] = mypointsget(i)features()apply(j)
x[(numvars-1) length + i] = mypointsget(i)label()
G02BU g02bu = new G02BU()
g02bueval(M U length numvars x length x 00 means ssq ifail)
return ArraysasList(new DataClass(lengthmeansssq))
ComputeSSQ (Mapper)
Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence
static class CombineData implements Function2ltDataClassDataClassDataClassgt
Override
public DataClass call(DataClass data1 DataClass data2) throws Exception
G02BZ g02bz = new G02BZ()
g02bzeval(M data1meanslength data1length data1means
data1ssq data2length data2means data2ssq IFAIL)
data1length = (g02bzgetXSW())
return data1
CombineData (Reducer)
Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Master
Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence
Test 1
Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster
Used an 8 slave xlarge cluster (16 GB RAM)
Test 2
Varied the number of slave nodes from 2 - 16
Used 16 GB of data to see how algorithms scale
Linear Regression Example
Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence
Cluster Utilization
Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence
Results of Linear Regression
Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence
Results of Scaling
Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence
NAG vs MLlib
Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence
Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))
Distributing libraries
spark-ec2copy-dir
Setting environment variables
LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar
NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv
Problems Encountered (1)
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 21Numerical ExcellenceCommercial in Confidence
Linear Regression Example Data
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Master
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Slave 1
Slave 2
rddparallelize(data)cache()
Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2LIBRARY_CALL(data)
Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence
Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a
function to all elements of this RDD
flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results
collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence
Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies
a function f to each partition of this RDD
mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
mapPartitions
Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence
public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception
if(dataSize numVars datapointspartitions()size() gt 10000000)
throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)
DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())
hellip
G02BW g02bw = new G02BW()
g02bweval(numVars finalptssq ifail)
if(g02bwgetIFAIL() gt 0)
Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())
throw new NAGSparkException(Error from g02bw)
How it looks in Spark
Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence
static class ComputeSSQ implements
FlatMapFunctionltIteratorltLabeledPointgt DataClassgt
Override
public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception
ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()
while(iterhasNext()) mypointsadd(iternext())
int length = mypointssize()
int numvars = mypointsget(0)features()size() + 1
double[] x = new double[lengthnumvars]
for(int i=0 iltlength i++)
for(int j=0 jltnumvars-1 j++)
x[(j) length + i] = mypointsget(i)features()apply(j)
x[(numvars-1) length + i] = mypointsget(i)label()
G02BU g02bu = new G02BU()
g02bueval(M U length numvars x length x 00 means ssq ifail)
return ArraysasList(new DataClass(lengthmeansssq))
ComputeSSQ (Mapper)
Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence
static class CombineData implements Function2ltDataClassDataClassDataClassgt
Override
public DataClass call(DataClass data1 DataClass data2) throws Exception
G02BZ g02bz = new G02BZ()
g02bzeval(M data1meanslength data1length data1means
data1ssq data2length data2means data2ssq IFAIL)
data1length = (g02bzgetXSW())
return data1
CombineData (Reducer)
Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Master
Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence
Test 1
Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster
Used an 8 slave xlarge cluster (16 GB RAM)
Test 2
Varied the number of slave nodes from 2 - 16
Used 16 GB of data to see how algorithms scale
Linear Regression Example
Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence
Cluster Utilization
Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence
Results of Linear Regression
Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence
Results of Scaling
Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence
NAG vs MLlib
Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence
Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))
Distributing libraries
spark-ec2copy-dir
Setting environment variables
LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar
NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv
Problems Encountered (1)
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 22Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Master
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Slave 1
Slave 2
rddparallelize(data)cache()
Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2LIBRARY_CALL(data)
Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence
Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a
function to all elements of this RDD
flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results
collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence
Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies
a function f to each partition of this RDD
mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
mapPartitions
Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence
public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception
if(dataSize numVars datapointspartitions()size() gt 10000000)
throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)
DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())
hellip
G02BW g02bw = new G02BW()
g02bweval(numVars finalptssq ifail)
if(g02bwgetIFAIL() gt 0)
Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())
throw new NAGSparkException(Error from g02bw)
How it looks in Spark
Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence
static class ComputeSSQ implements
FlatMapFunctionltIteratorltLabeledPointgt DataClassgt
Override
public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception
ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()
while(iterhasNext()) mypointsadd(iternext())
int length = mypointssize()
int numvars = mypointsget(0)features()size() + 1
double[] x = new double[lengthnumvars]
for(int i=0 iltlength i++)
for(int j=0 jltnumvars-1 j++)
x[(j) length + i] = mypointsget(i)features()apply(j)
x[(numvars-1) length + i] = mypointsget(i)label()
G02BU g02bu = new G02BU()
g02bueval(M U length numvars x length x 00 means ssq ifail)
return ArraysasList(new DataClass(lengthmeansssq))
ComputeSSQ (Mapper)
Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence
static class CombineData implements Function2ltDataClassDataClassDataClassgt
Override
public DataClass call(DataClass data1 DataClass data2) throws Exception
G02BZ g02bz = new G02BZ()
g02bzeval(M data1meanslength data1length data1means
data1ssq data2length data2means data2ssq IFAIL)
data1length = (g02bzgetXSW())
return data1
CombineData (Reducer)
Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Master
Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence
Test 1
Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster
Used an 8 slave xlarge cluster (16 GB RAM)
Test 2
Varied the number of slave nodes from 2 - 16
Used 16 GB of data to see how algorithms scale
Linear Regression Example
Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence
Cluster Utilization
Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence
Results of Linear Regression
Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence
Results of Scaling
Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence
NAG vs MLlib
Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence
Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))
Distributing libraries
spark-ec2copy-dir
Setting environment variables
LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar
NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv
Problems Encountered (1)
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 23Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2LIBRARY_CALL(data)
Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence
Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a
function to all elements of this RDD
flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results
collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence
Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies
a function f to each partition of this RDD
mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
mapPartitions
Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence
public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception
if(dataSize numVars datapointspartitions()size() gt 10000000)
throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)
DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())
hellip
G02BW g02bw = new G02BW()
g02bweval(numVars finalptssq ifail)
if(g02bwgetIFAIL() gt 0)
Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())
throw new NAGSparkException(Error from g02bw)
How it looks in Spark
Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence
static class ComputeSSQ implements
FlatMapFunctionltIteratorltLabeledPointgt DataClassgt
Override
public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception
ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()
while(iterhasNext()) mypointsadd(iternext())
int length = mypointssize()
int numvars = mypointsget(0)features()size() + 1
double[] x = new double[lengthnumvars]
for(int i=0 iltlength i++)
for(int j=0 jltnumvars-1 j++)
x[(j) length + i] = mypointsget(i)features()apply(j)
x[(numvars-1) length + i] = mypointsget(i)label()
G02BU g02bu = new G02BU()
g02bueval(M U length numvars x length x 00 means ssq ifail)
return ArraysasList(new DataClass(lengthmeansssq))
ComputeSSQ (Mapper)
Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence
static class CombineData implements Function2ltDataClassDataClassDataClassgt
Override
public DataClass call(DataClass data1 DataClass data2) throws Exception
G02BZ g02bz = new G02BZ()
g02bzeval(M data1meanslength data1length data1means
data1ssq data2length data2means data2ssq IFAIL)
data1length = (g02bzgetXSW())
return data1
CombineData (Reducer)
Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Master
Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence
Test 1
Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster
Used an 8 slave xlarge cluster (16 GB RAM)
Test 2
Varied the number of slave nodes from 2 - 16
Used 16 GB of data to see how algorithms scale
Linear Regression Example
Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence
Cluster Utilization
Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence
Results of Linear Regression
Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence
Results of Scaling
Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence
NAG vs MLlib
Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence
Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))
Distributing libraries
spark-ec2copy-dir
Setting environment variables
LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar
NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv
Problems Encountered (1)
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 24Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence
Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a
function to all elements of this RDD
flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results
collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence
Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies
a function f to each partition of this RDD
mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
mapPartitions
Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence
public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception
if(dataSize numVars datapointspartitions()size() gt 10000000)
throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)
DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())
hellip
G02BW g02bw = new G02BW()
g02bweval(numVars finalptssq ifail)
if(g02bwgetIFAIL() gt 0)
Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())
throw new NAGSparkException(Error from g02bw)
How it looks in Spark
Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence
static class ComputeSSQ implements
FlatMapFunctionltIteratorltLabeledPointgt DataClassgt
Override
public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception
ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()
while(iterhasNext()) mypointsadd(iternext())
int length = mypointssize()
int numvars = mypointsget(0)features()size() + 1
double[] x = new double[lengthnumvars]
for(int i=0 iltlength i++)
for(int j=0 jltnumvars-1 j++)
x[(j) length + i] = mypointsget(i)features()apply(j)
x[(numvars-1) length + i] = mypointsget(i)label()
G02BU g02bu = new G02BU()
g02bueval(M U length numvars x length x 00 means ssq ifail)
return ArraysasList(new DataClass(lengthmeansssq))
ComputeSSQ (Mapper)
Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence
static class CombineData implements Function2ltDataClassDataClassDataClassgt
Override
public DataClass call(DataClass data1 DataClass data2) throws Exception
G02BZ g02bz = new G02BZ()
g02bzeval(M data1meanslength data1length data1means
data1ssq data2length data2means data2ssq IFAIL)
data1length = (g02bzgetXSW())
return data1
CombineData (Reducer)
Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Master
Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence
Test 1
Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster
Used an 8 slave xlarge cluster (16 GB RAM)
Test 2
Varied the number of slave nodes from 2 - 16
Used 16 GB of data to see how algorithms scale
Linear Regression Example
Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence
Cluster Utilization
Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence
Results of Linear Regression
Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence
Results of Scaling
Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence
NAG vs MLlib
Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence
Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))
Distributing libraries
spark-ec2copy-dir
Setting environment variables
LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar
NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv
Problems Encountered (1)
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 25Numerical ExcellenceCommercial in Confidence
Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a
function to all elements of this RDD
flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results
collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence
Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies
a function f to each partition of this RDD
mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
mapPartitions
Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence
public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception
if(dataSize numVars datapointspartitions()size() gt 10000000)
throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)
DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())
hellip
G02BW g02bw = new G02BW()
g02bweval(numVars finalptssq ifail)
if(g02bwgetIFAIL() gt 0)
Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())
throw new NAGSparkException(Error from g02bw)
How it looks in Spark
Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence
static class ComputeSSQ implements
FlatMapFunctionltIteratorltLabeledPointgt DataClassgt
Override
public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception
ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()
while(iterhasNext()) mypointsadd(iternext())
int length = mypointssize()
int numvars = mypointsget(0)features()size() + 1
double[] x = new double[lengthnumvars]
for(int i=0 iltlength i++)
for(int j=0 jltnumvars-1 j++)
x[(j) length + i] = mypointsget(i)features()apply(j)
x[(numvars-1) length + i] = mypointsget(i)label()
G02BU g02bu = new G02BU()
g02bueval(M U length numvars x length x 00 means ssq ifail)
return ArraysasList(new DataClass(lengthmeansssq))
ComputeSSQ (Mapper)
Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence
static class CombineData implements Function2ltDataClassDataClassDataClassgt
Override
public DataClass call(DataClass data1 DataClass data2) throws Exception
G02BZ g02bz = new G02BZ()
g02bzeval(M data1meanslength data1length data1means
data1ssq data2length data2means data2ssq IFAIL)
data1length = (g02bzgetXSW())
return data1
CombineData (Reducer)
Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Master
Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence
Test 1
Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster
Used an 8 slave xlarge cluster (16 GB RAM)
Test 2
Varied the number of slave nodes from 2 - 16
Used 16 GB of data to see how algorithms scale
Linear Regression Example
Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence
Cluster Utilization
Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence
Results of Linear Regression
Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence
Results of Scaling
Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence
NAG vs MLlib
Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence
Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))
Distributing libraries
spark-ec2copy-dir
Setting environment variables
LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar
NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv
Problems Encountered (1)
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 26Numerical ExcellenceCommercial in Confidence
Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies
a function f to each partition of this RDD
mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD
Getting Data into Contiguous Blocks
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
680509 423 121 232 1
651515 473 74 168 0
785641 532 114 174 1
Slave 2
Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
mapPartitions
Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence
public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception
if(dataSize numVars datapointspartitions()size() gt 10000000)
throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)
DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())
hellip
G02BW g02bw = new G02BW()
g02bweval(numVars finalptssq ifail)
if(g02bwgetIFAIL() gt 0)
Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())
throw new NAGSparkException(Error from g02bw)
How it looks in Spark
Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence
static class ComputeSSQ implements
FlatMapFunctionltIteratorltLabeledPointgt DataClassgt
Override
public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception
ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()
while(iterhasNext()) mypointsadd(iternext())
int length = mypointssize()
int numvars = mypointsget(0)features()size() + 1
double[] x = new double[lengthnumvars]
for(int i=0 iltlength i++)
for(int j=0 jltnumvars-1 j++)
x[(j) length + i] = mypointsget(i)features()apply(j)
x[(numvars-1) length + i] = mypointsget(i)label()
G02BU g02bu = new G02BU()
g02bueval(M U length numvars x length x 00 means ssq ifail)
return ArraysasList(new DataClass(lengthmeansssq))
ComputeSSQ (Mapper)
Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence
static class CombineData implements Function2ltDataClassDataClassDataClassgt
Override
public DataClass call(DataClass data1 DataClass data2) throws Exception
G02BZ g02bz = new G02BZ()
g02bzeval(M data1meanslength data1length data1means
data1ssq data2length data2means data2ssq IFAIL)
data1length = (g02bzgetXSW())
return data1
CombineData (Reducer)
Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Master
Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence
Test 1
Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster
Used an 8 slave xlarge cluster (16 GB RAM)
Test 2
Varied the number of slave nodes from 2 - 16
Used 16 GB of data to see how algorithms scale
Linear Regression Example
Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence
Cluster Utilization
Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence
Results of Linear Regression
Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence
Results of Scaling
Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence
NAG vs MLlib
Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence
Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))
Distributing libraries
spark-ec2copy-dir
Setting environment variables
LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar
NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv
Problems Encountered (1)
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 27Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
mapPartitions
Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence
public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception
if(dataSize numVars datapointspartitions()size() gt 10000000)
throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)
DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())
hellip
G02BW g02bw = new G02BW()
g02bweval(numVars finalptssq ifail)
if(g02bwgetIFAIL() gt 0)
Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())
throw new NAGSparkException(Error from g02bw)
How it looks in Spark
Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence
static class ComputeSSQ implements
FlatMapFunctionltIteratorltLabeledPointgt DataClassgt
Override
public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception
ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()
while(iterhasNext()) mypointsadd(iternext())
int length = mypointssize()
int numvars = mypointsget(0)features()size() + 1
double[] x = new double[lengthnumvars]
for(int i=0 iltlength i++)
for(int j=0 jltnumvars-1 j++)
x[(j) length + i] = mypointsget(i)features()apply(j)
x[(numvars-1) length + i] = mypointsget(i)label()
G02BU g02bu = new G02BU()
g02bueval(M U length numvars x length x 00 means ssq ifail)
return ArraysasList(new DataClass(lengthmeansssq))
ComputeSSQ (Mapper)
Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence
static class CombineData implements Function2ltDataClassDataClassDataClassgt
Override
public DataClass call(DataClass data1 DataClass data2) throws Exception
G02BZ g02bz = new G02BZ()
g02bzeval(M data1meanslength data1length data1means
data1ssq data2length data2means data2ssq IFAIL)
data1length = (g02bzgetXSW())
return data1
CombineData (Reducer)
Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Master
Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence
Test 1
Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster
Used an 8 slave xlarge cluster (16 GB RAM)
Test 2
Varied the number of slave nodes from 2 - 16
Used 16 GB of data to see how algorithms scale
Linear Regression Example
Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence
Cluster Utilization
Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence
Results of Linear Regression
Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence
Results of Scaling
Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence
NAG vs MLlib
Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence
Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))
Distributing libraries
spark-ec2copy-dir
Setting environment variables
LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar
NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv
Problems Encountered (1)
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 28Numerical ExcellenceCommercial in Confidence
public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception
if(dataSize numVars datapointspartitions()size() gt 10000000)
throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)
DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())
hellip
G02BW g02bw = new G02BW()
g02bweval(numVars finalptssq ifail)
if(g02bwgetIFAIL() gt 0)
Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())
throw new NAGSparkException(Error from g02bw)
How it looks in Spark
Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence
static class ComputeSSQ implements
FlatMapFunctionltIteratorltLabeledPointgt DataClassgt
Override
public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception
ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()
while(iterhasNext()) mypointsadd(iternext())
int length = mypointssize()
int numvars = mypointsget(0)features()size() + 1
double[] x = new double[lengthnumvars]
for(int i=0 iltlength i++)
for(int j=0 jltnumvars-1 j++)
x[(j) length + i] = mypointsget(i)features()apply(j)
x[(numvars-1) length + i] = mypointsget(i)label()
G02BU g02bu = new G02BU()
g02bueval(M U length numvars x length x 00 means ssq ifail)
return ArraysasList(new DataClass(lengthmeansssq))
ComputeSSQ (Mapper)
Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence
static class CombineData implements Function2ltDataClassDataClassDataClassgt
Override
public DataClass call(DataClass data1 DataClass data2) throws Exception
G02BZ g02bz = new G02BZ()
g02bzeval(M data1meanslength data1length data1means
data1ssq data2length data2means data2ssq IFAIL)
data1length = (g02bzgetXSW())
return data1
CombineData (Reducer)
Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Master
Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence
Test 1
Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster
Used an 8 slave xlarge cluster (16 GB RAM)
Test 2
Varied the number of slave nodes from 2 - 16
Used 16 GB of data to see how algorithms scale
Linear Regression Example
Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence
Cluster Utilization
Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence
Results of Linear Regression
Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence
Results of Scaling
Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence
NAG vs MLlib
Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence
Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))
Distributing libraries
spark-ec2copy-dir
Setting environment variables
LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar
NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv
Problems Encountered (1)
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 29Numerical ExcellenceCommercial in Confidence
static class ComputeSSQ implements
FlatMapFunctionltIteratorltLabeledPointgt DataClassgt
Override
public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception
ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()
while(iterhasNext()) mypointsadd(iternext())
int length = mypointssize()
int numvars = mypointsget(0)features()size() + 1
double[] x = new double[lengthnumvars]
for(int i=0 iltlength i++)
for(int j=0 jltnumvars-1 j++)
x[(j) length + i] = mypointsget(i)features()apply(j)
x[(numvars-1) length + i] = mypointsget(i)label()
G02BU g02bu = new G02BU()
g02bueval(M U length numvars x length x 00 means ssq ifail)
return ArraysasList(new DataClass(lengthmeansssq))
ComputeSSQ (Mapper)
Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence
static class CombineData implements Function2ltDataClassDataClassDataClassgt
Override
public DataClass call(DataClass data1 DataClass data2) throws Exception
G02BZ g02bz = new G02BZ()
g02bzeval(M data1meanslength data1length data1means
data1ssq data2length data2means data2ssq IFAIL)
data1length = (g02bzgetXSW())
return data1
CombineData (Reducer)
Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Master
Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence
Test 1
Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster
Used an 8 slave xlarge cluster (16 GB RAM)
Test 2
Varied the number of slave nodes from 2 - 16
Used 16 GB of data to see how algorithms scale
Linear Regression Example
Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence
Cluster Utilization
Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence
Results of Linear Regression
Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence
Results of Scaling
Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence
NAG vs MLlib
Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence
Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))
Distributing libraries
spark-ec2copy-dir
Setting environment variables
LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar
NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv
Problems Encountered (1)
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 30Numerical ExcellenceCommercial in Confidence
static class CombineData implements Function2ltDataClassDataClassDataClassgt
Override
public DataClass call(DataClass data1 DataClass data2) throws Exception
G02BZ g02bz = new G02BZ()
g02bzeval(M data1meanslength data1length data1means
data1ssq data2length data2means data2ssq IFAIL)
data1length = (g02bzgetXSW())
return data1
CombineData (Reducer)
Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Master
Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence
Test 1
Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster
Used an 8 slave xlarge cluster (16 GB RAM)
Test 2
Varied the number of slave nodes from 2 - 16
Used 16 GB of data to see how algorithms scale
Linear Regression Example
Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence
Cluster Utilization
Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence
Results of Linear Regression
Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence
Results of Scaling
Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence
NAG vs MLlib
Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence
Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))
Distributing libraries
spark-ec2copy-dir
Setting environment variables
LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar
NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv
Problems Encountered (1)
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 31Numerical ExcellenceCommercial in Confidence
The Data (in Spark)
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
Slave 1
Slave 2
Master
Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence
Test 1
Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster
Used an 8 slave xlarge cluster (16 GB RAM)
Test 2
Varied the number of slave nodes from 2 - 16
Used 16 GB of data to see how algorithms scale
Linear Regression Example
Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence
Cluster Utilization
Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence
Results of Linear Regression
Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence
Results of Scaling
Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence
NAG vs MLlib
Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence
Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))
Distributing libraries
spark-ec2copy-dir
Setting environment variables
LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar
NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv
Problems Encountered (1)
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 32Numerical ExcellenceCommercial in Confidence
Test 1
Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster
Used an 8 slave xlarge cluster (16 GB RAM)
Test 2
Varied the number of slave nodes from 2 - 16
Used 16 GB of data to see how algorithms scale
Linear Regression Example
Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence
Cluster Utilization
Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence
Results of Linear Regression
Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence
Results of Scaling
Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence
NAG vs MLlib
Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence
Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))
Distributing libraries
spark-ec2copy-dir
Setting environment variables
LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar
NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv
Problems Encountered (1)
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 33Numerical ExcellenceCommercial in Confidence
Cluster Utilization
Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence
Results of Linear Regression
Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence
Results of Scaling
Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence
NAG vs MLlib
Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence
Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))
Distributing libraries
spark-ec2copy-dir
Setting environment variables
LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar
NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv
Problems Encountered (1)
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 34Numerical ExcellenceCommercial in Confidence
Results of Linear Regression
Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence
Results of Scaling
Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence
NAG vs MLlib
Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence
Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))
Distributing libraries
spark-ec2copy-dir
Setting environment variables
LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar
NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv
Problems Encountered (1)
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 35Numerical ExcellenceCommercial in Confidence
Results of Scaling
Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence
NAG vs MLlib
Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence
Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))
Distributing libraries
spark-ec2copy-dir
Setting environment variables
LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar
NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv
Problems Encountered (1)
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 36Numerical ExcellenceCommercial in Confidence
NAG vs MLlib
Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence
Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))
Distributing libraries
spark-ec2copy-dir
Setting environment variables
LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar
NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv
Problems Encountered (1)
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 37Numerical ExcellenceCommercial in Confidence
Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))
Distributing libraries
spark-ec2copy-dir
Setting environment variables
LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar
NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv
Problems Encountered (1)
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 38Numerical ExcellenceCommercial in Confidence
The data problem
Data locality
Inverse of singular matrix
Problems Encountered (2)
Label X1 X2 X3 X4
00 180 00 222 1
00 180 00 811 0
680509 423 121 232 1
865650 473 195 719 2
651515 473 74 168 0
785641 532 114 174 1
565567 349 107 684 0
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 39Numerical ExcellenceCommercial in Confidence
Debugging slave nodes
Throw lots of exceptions
Int sizes with Big Data and a numerical library
32 vs 64 bit integers
Some algorithms donrsquot work
nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm
Problems Encountered (3)
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 40Numerical ExcellenceCommercial in Confidence
NAG Introduction
Numerical Computing
Linear Regression
Linear Regression Example on Spark
Timings
Problems Encountered
MLlib Algorithms
Outline
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 41Numerical ExcellenceCommercial in Confidence
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
random data generation
Classification and regression
linear models (SVMs logistic regression linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
MLlib Algorithms
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 42Numerical ExcellenceCommercial in Confidence
DataFrames
H2O Sparking Water
MLlib pipeline
Better optimizers
Improved APIs
SparkR
Contains all the components for better numerical computations Use existing R code on Spark
FreeHuge open source community
Other Exciting Happenings
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark
Commercial in Confidence 43Numerical ExcellenceCommercial in Confidence
Call algorithm with data in-memory
If data is small enough
Sample
Obtain estimates for parameters
Reformulate the problem
MapReduce existing functions across slaves
How to use existing libraries on Spark
Commercial in Confidence 44Numerical ExcellenceCommercial in Confidence
Thanks
For more information
Email brianspectornagcom
Check out The NAG Blog
httpblognagcom201502advanced-analytics-on-apache-sparkhtml
NAG and Apache Spark