using numerical libraries on spark - nag · commercial in confidence experts in numerical...

Commercial in Confidence

Experts in numerical algorithms and HPC services

Using Numerical Libraries on Spark

Brian Spector

London Spark Users Meetup

August 18th 2015

Commercial in Confidence 2Numerical ExcellenceCommercial in Confidence

Call algorithm with data in-memory

If data is small enough

Sample

Obtain estimates for parameters

Reformulate the problem

MapReduce existing functions across workers

How to use existing libraries on Spark


NAG Introduction

Numerical Computing

Linear Regression

Linear Regression Example on Spark

Timings

Problems Encountered

MLlib Algorithms

Outline


NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline

Commercial in Confidence 5

The Numerical Algorithms Group

Founded in 1970 as a co-operative project out of academia in UK

Support and Maintain Numerical Library ~1700 MathematicalStatistical Routines

NAGrsquos code is embedded in many vendor libraries (eg AMD Intel)

Numerical Library

HPCServices

Numerical Services


NAG Library Full Contents

Root Finding

Summation of Series

Quadrature

Ordinary Differential Equations

Partial Differential Equations

Numerical Differentiation

Integral Equations

Mesh Generation

Interpolation

Curve and Surface Fitting

Optimization

Approximations of Special Functions

Dense Linear Algebra

Sparse Linear Algebra

Correlation amp Regression Analysis

Multivariate Methods

Analysis of Variance

Random Number Generators

Univariate Estimation

Nonparametric Statistics

Smoothing in Statistics

Contingency Table Analysis

Survival Analysis

Time Series Analysis

Operations Research


NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


At NAG we have 40 years of algorithms

Linear Algebra Regression Optimization

Data is all in-memory and call a compiled libraryvoid nag_1d_spline_interpolant (Integer m const double x[] const double y[] Nag_Spline spline NagError fail)

Algorithms take advantage of

AVXAVX2

Multi-core

LAPACKBLAS

MKL

Efficient Algorithms


Modern programming languages and compilers (hopefully) take care of some efficiencies for us

Python numpyscipy

Compiler flags (-O2)

Users worry less about efficient computing and more about solving their problem



hellipBut what happens when it comes to Big Data

x = (double) malloc(10000000 10000 sizeof(double))

Past efficient algorithms break down

Need different ways of solving same problem

How do we use our existing functions (libraries) on Spark

We must reformulate the problem

An examplehellip (Warning Mathematics Approaching)

Big Data


General Problem

Given a set of input measurements 11990911199092 hellip119909119901 and an

outcome measurement 119910 fit a linear model

119910 = 119883 119861 + ε

119910 =

1199101

⋮119910119901

119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901

Three ways of solving the same

problem

Linear Regression Example


Solution 1 (Optimal Solution)

119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901

= 119876119877lowast where 119877lowast =1198770

119877 is 119901 by 119901 Upper Triangular 119876 orthogonal

119877 119861 = 1198881 where 119888 = 119876119879119910 hellip

hellip lots of linear algebra

but we have an algorithm



Solution 2 (Normal Equations)

119883119879119883 119861 = 119883119879 119910

If we 3 independent variables

119883 =1 2 3⋮ ⋮ ⋮4 5 6

119910 =7⋮8



119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6

1 2 3⋮ ⋮ ⋮4 5 6

=

11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133

119861 = 119883119879119883 minus1119883119879 119910

This computation can be map out to slave nodes



Solution 3 (Optimization)

min 119910 minus 119883 1198612

Iterative over the data

Final answer is an approximate to

the solution

Can add constraints to variables

(LASSO)



Machine LearningOptimization

MLlib uses solution 3 for many applications

Classification

Regression

Gaussian Mixture Model

Kmeans

Slave nodes are used for callbacks to access data

Map Reduce Repeat

MLlib Algorithms


MLlib Algorithms

MasterLibraries

Slaves Slaves


NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Use Normal Equations (Solution 2) to compute Linear Regression on large dataset

NAG Libraries were distributed across MasterSlave Nodes

Steps

1 Read in data

2 Call routine to compute sum-of-squares matrix on partitions (chucks)

Default partition size is 64mb

3 Call routine to aggregate SSQ matrix together

4 Call routine to compute optimal coefficients

Example on Spark



Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL


Linear Regression Example Data

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2

rddparallelize(data)cache()


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2LIBRARY_CALL(data)


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2


Spark Java Transformations map(FunctionltTRgt f) - Return a new RDD by applying a

function to all elements of this RDD

flatMap(FlatMapFunctionltTUgt f) - Return a new RDD by first applying a function to all elements of this RDD and then flattening the results

collectPartitions(int[] partitionIds) - Return an array that contains all of the elements in a specific partition of this RDD

Getting Data into Contiguous Blocks

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2


Spark Java Transformations foreachPartition(VoidFunctionltjavautilIteratorltTgtgt f) - Applies

a function f to each partition of this RDD

mapPartitions(FlatMapFunctionltjavautilIteratorltTgtUgt f) -Return a new RDD by applying a function to each partition of this RDD


680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions


public void LabeledPointCorrelation(JavaRDDltLabeledPointgt datapoints) throws Exception

if(dataSize numVars datapointspartitions()size() gt 10000000)

throw new NAGSparkException(Partitioned Data Further To Use NAG Routine)

DataClass finalpt = datapointsmapPartitions(new ComputeSSQ())reduce(new CombineData())

hellip

G02BW g02bw = new G02BW()

g02bweval(numVars finalptssq ifail)

if(g02bwgetIFAIL() gt 0)

Systemoutprintln(Error with NAG (g02bw) IFAIL = + g02bwgetIFAIL())

throw new NAGSparkException(Error from g02bw)

How it looks in Spark


static class ComputeSSQ implements

FlatMapFunctionltIteratorltLabeledPointgt DataClassgt

Override

public Iterablelt DataClass gt call(IteratorltLabeledPointgt iter) throws Exception

ListltLabeledPointgt mypoints= new ArrayListltLabeledPointgt()

while(iterhasNext()) mypointsadd(iternext())

int length = mypointssize()

int numvars = mypointsget(0)features()size() + 1

double[] x = new double[lengthnumvars]

for(int i=0 iltlength i++)

for(int j=0 jltnumvars-1 j++)

x[(j) length + i] = mypointsget(i)features()apply(j)

x[(numvars-1) length + i] = mypointsget(i)label()

G02BU g02bu = new G02BU()

g02bueval(M U length numvars x length x 00 means ssq ifail)

return ArraysasList(new DataClass(lengthmeansssq))

ComputeSSQ (Mapper)


static class CombineData implements Function2ltDataClassDataClassDataClassgt

Override

public DataClass call(DataClass data1 DataClass data2) throws Exception

G02BZ g02bz = new G02BZ()

g02bzeval(M data1meanslength data1length data1means

data1ssq data2length data2means data2ssq IFAIL)

data1length = (g02bzgetXSW())

return data1

CombineData (Reducer)


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master


Test 1

Data ranges in size from 2 GB ndash 64 GB on Amazon EC2 Cluster

Used an 8 slave xlarge cluster (16 GB RAM)

Test 2

Varied the number of slave nodes from 2 - 16

Used 16 GB of data to see how algorithms scale



Cluster Utilization


Results of Linear Regression


Results of Scaling


NAG vs MLlib


Java Spark Documentation return ArraysasList(new DataClass(lengthmeansssq))

Distributing libraries

spark-ec2copy-dir

Setting environment variables

LD_LIBRARY_PATH - Needs to be set as you submit the job$ spark-submit --jars NAGJavajar --conf sparkexecutorextraLibraryPath= $Path-to-Libraries-on-Worker-Nodes simpleNAGExamplejar

NAG_KUSARI_FILE (NAG license file) Can be set in code via scsetExecutorEnv

Problems Encountered (1)


The data problem

Data locality

Inverse of singular matrix


Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0


Debugging slave nodes

Throw lots of exceptions

Int sizes with Big Data and a numerical library

32 vs 64 bit integers

Some algorithms donrsquot work

nag_approx_quantiles_fixed (g01anc) finds approximate quantiles from a data stream of known size using an out-of-core algorithm



NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

Classification and regression

linear models (SVMs logistic regression linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR

Contains all the components for better numerical computations Use existing R code on Spark

FreeHuge open source community

Other Exciting Happenings




Sample



MapReduce existing functions across slaves



Thanks

For more information

Email brianspectornagcom

Check out The NAG Blog

httpblognagcom201502advanced-analytics-on-apache-sparkhtml

NAG and Apache Spark




Sample



MapReduce existing functions across workers



NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline






Numerical Library

HPCServices

Numerical Services



Root Finding

Summation of Series

Quadrature




Integral Equations

Mesh Generation

Interpolation


Optimization












Survival Analysis


Operations Research


NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline






AVXAVX2

Multi-core

LAPACKBLAS

MKL




Python numpyscipy












Big Data


General Problem



119910 = 119883 119861 + ε

119910 =

1199101

⋮119910119901

119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901


problem




119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901



119877 119861 = 1198881 where 119888 = 119876119879119910 hellip






119883119879119883 119861 = 119883119879 119910


119883 =1 2 3⋮ ⋮ ⋮4 5 6

119910 =7⋮8



119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6

1 2 3⋮ ⋮ ⋮4 5 6

=

11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133

119861 = 119883119879119883 minus1119883119879 119910





min 119910 minus 119883 1198612



the solution


(LASSO)





Classification

Regression


Kmeans


Map Reduce Repeat

MLlib Algorithms


MLlib Algorithms

MasterLibraries

Slaves Slaves


NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline




Steps

1 Read in data





Example on Spark



Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2







680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2






680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions






hellip










Override














ComputeSSQ (Mapper)



Override






return data1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master


Test 1



Test 2





Cluster Utilization




Results of Scaling


NAG vs MLlib




spark-ec2copy-dir






The data problem

Data locality



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0










NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks







NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline






Numerical Library

HPCServices

Numerical Services



Root Finding

Summation of Series

Quadrature




Integral Equations

Mesh Generation

Interpolation


Optimization












Survival Analysis


Operations Research


NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline






AVXAVX2

Multi-core

LAPACKBLAS

MKL




Python numpyscipy












Big Data


General Problem



119910 = 119883 119861 + ε

119910 =

1199101

⋮119910119901

119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901


problem




119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901



119877 119861 = 1198881 where 119888 = 119876119879119910 hellip






119883119879119883 119861 = 119883119879 119910


119883 =1 2 3⋮ ⋮ ⋮4 5 6

119910 =7⋮8



119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6

1 2 3⋮ ⋮ ⋮4 5 6

=

11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133

119861 = 119883119879119883 minus1119883119879 119910





min 119910 minus 119883 1198612



the solution


(LASSO)





Classification

Regression


Kmeans


Map Reduce Repeat

MLlib Algorithms


MLlib Algorithms

MasterLibraries

Slaves Slaves


NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline




Steps

1 Read in data





Example on Spark



Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2







680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2






680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions






hellip










Override














ComputeSSQ (Mapper)



Override






return data1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master


Test 1



Test 2





Cluster Utilization




Results of Scaling


NAG vs MLlib




spark-ec2copy-dir






The data problem

Data locality



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0










NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks







NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline






Numerical Library

HPCServices

Numerical Services



Root Finding

Summation of Series

Quadrature




Integral Equations

Mesh Generation

Interpolation


Optimization












Survival Analysis


Operations Research


NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline






AVXAVX2

Multi-core

LAPACKBLAS

MKL




Python numpyscipy












Big Data


General Problem



119910 = 119883 119861 + ε

119910 =

1199101

⋮119910119901

119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901


problem




119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901



119877 119861 = 1198881 where 119888 = 119876119879119910 hellip






119883119879119883 119861 = 119883119879 119910


119883 =1 2 3⋮ ⋮ ⋮4 5 6

119910 =7⋮8



119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6

1 2 3⋮ ⋮ ⋮4 5 6

=

11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133

119861 = 119883119879119883 minus1119883119879 119910





min 119910 minus 119883 1198612



the solution


(LASSO)





Classification

Regression


Kmeans


Map Reduce Repeat

MLlib Algorithms


MLlib Algorithms

MasterLibraries

Slaves Slaves


NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline




Steps

1 Read in data





Example on Spark



Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2







680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2






680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions






hellip










Override














ComputeSSQ (Mapper)



Override






return data1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master


Test 1



Test 2





Cluster Utilization




Results of Scaling


NAG vs MLlib




spark-ec2copy-dir






The data problem

Data locality



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0










NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks











Numerical Library

HPCServices

Numerical Services



Root Finding

Summation of Series

Quadrature




Integral Equations

Mesh Generation

Interpolation


Optimization












Survival Analysis


Operations Research


NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline






AVXAVX2

Multi-core

LAPACKBLAS

MKL




Python numpyscipy












Big Data


General Problem



119910 = 119883 119861 + ε

119910 =

1199101

⋮119910119901

119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901


problem




119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901



119877 119861 = 1198881 where 119888 = 119876119879119910 hellip






119883119879119883 119861 = 119883119879 119910


119883 =1 2 3⋮ ⋮ ⋮4 5 6

119910 =7⋮8



119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6

1 2 3⋮ ⋮ ⋮4 5 6

=

11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133

119861 = 119883119879119883 minus1119883119879 119910





min 119910 minus 119883 1198612



the solution


(LASSO)





Classification

Regression


Kmeans


Map Reduce Repeat

MLlib Algorithms


MLlib Algorithms

MasterLibraries

Slaves Slaves


NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline




Steps

1 Read in data





Example on Spark



Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2







680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2






680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions






hellip










Override














ComputeSSQ (Mapper)



Override






return data1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master


Test 1



Test 2





Cluster Utilization




Results of Scaling


NAG vs MLlib




spark-ec2copy-dir






The data problem

Data locality



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0










NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks








Root Finding

Summation of Series

Quadrature




Integral Equations

Mesh Generation

Interpolation


Optimization












Survival Analysis


Operations Research


NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline






AVXAVX2

Multi-core

LAPACKBLAS

MKL




Python numpyscipy












Big Data


General Problem



119910 = 119883 119861 + ε

119910 =

1199101

⋮119910119901

119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901


problem




119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901



119877 119861 = 1198881 where 119888 = 119876119879119910 hellip






119883119879119883 119861 = 119883119879 119910


119883 =1 2 3⋮ ⋮ ⋮4 5 6

119910 =7⋮8



119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6

1 2 3⋮ ⋮ ⋮4 5 6

=

11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133

119861 = 119883119879119883 minus1119883119879 119910





min 119910 minus 119883 1198612



the solution


(LASSO)





Classification

Regression


Kmeans


Map Reduce Repeat

MLlib Algorithms


MLlib Algorithms

MasterLibraries

Slaves Slaves


NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline




Steps

1 Read in data





Example on Spark



Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2







680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2






680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions






hellip










Override














ComputeSSQ (Mapper)



Override






return data1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master


Test 1



Test 2





Cluster Utilization




Results of Scaling


NAG vs MLlib




spark-ec2copy-dir






The data problem

Data locality



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0










NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks







NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline






AVXAVX2

Multi-core

LAPACKBLAS

MKL




Python numpyscipy












Big Data


General Problem



119910 = 119883 119861 + ε

119910 =

1199101

⋮119910119901

119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901


problem




119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901



119877 119861 = 1198881 where 119888 = 119876119879119910 hellip






119883119879119883 119861 = 119883119879 119910


119883 =1 2 3⋮ ⋮ ⋮4 5 6

119910 =7⋮8



119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6

1 2 3⋮ ⋮ ⋮4 5 6

=

11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133

119861 = 119883119879119883 minus1119883119879 119910





min 119910 minus 119883 1198612



the solution


(LASSO)





Classification

Regression


Kmeans


Map Reduce Repeat

MLlib Algorithms


MLlib Algorithms

MasterLibraries

Slaves Slaves


NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline




Steps

1 Read in data





Example on Spark



Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2







680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2






680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions






hellip










Override














ComputeSSQ (Mapper)



Override






return data1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master


Test 1



Test 2





Cluster Utilization




Results of Scaling


NAG vs MLlib




spark-ec2copy-dir






The data problem

Data locality



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0










NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks











AVXAVX2

Multi-core

LAPACKBLAS

MKL




Python numpyscipy












Big Data


General Problem



119910 = 119883 119861 + ε

119910 =

1199101

⋮119910119901

119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901


problem




119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901



119877 119861 = 1198881 where 119888 = 119876119879119910 hellip






119883119879119883 119861 = 119883119879 119910


119883 =1 2 3⋮ ⋮ ⋮4 5 6

119910 =7⋮8



119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6

1 2 3⋮ ⋮ ⋮4 5 6

=

11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133

119861 = 119883119879119883 minus1119883119879 119910





min 119910 minus 119883 1198612



the solution


(LASSO)





Classification

Regression


Kmeans


Map Reduce Repeat

MLlib Algorithms


MLlib Algorithms

MasterLibraries

Slaves Slaves


NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline




Steps

1 Read in data





Example on Spark



Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2







680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2






680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions






hellip










Override














ComputeSSQ (Mapper)



Override






return data1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master


Test 1



Test 2





Cluster Utilization




Results of Scaling


NAG vs MLlib




spark-ec2copy-dir






The data problem

Data locality



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0










NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks








Python numpyscipy












Big Data


General Problem



119910 = 119883 119861 + ε

119910 =

1199101

⋮119910119901

119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901


problem




119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901



119877 119861 = 1198881 where 119888 = 119876119879119910 hellip






119883119879119883 119861 = 119883119879 119910


119883 =1 2 3⋮ ⋮ ⋮4 5 6

119910 =7⋮8



119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6

1 2 3⋮ ⋮ ⋮4 5 6

=

11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133

119861 = 119883119879119883 minus1119883119879 119910





min 119910 minus 119883 1198612



the solution


(LASSO)





Classification

Regression


Kmeans


Map Reduce Repeat

MLlib Algorithms


MLlib Algorithms

MasterLibraries

Slaves Slaves


NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline




Steps

1 Read in data





Example on Spark



Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2







680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2






680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions






hellip










Override














ComputeSSQ (Mapper)



Override






return data1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master


Test 1



Test 2





Cluster Utilization




Results of Scaling


NAG vs MLlib




spark-ec2copy-dir






The data problem

Data locality



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0










NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks














Big Data


General Problem



119910 = 119883 119861 + ε

119910 =

1199101

⋮119910119901

119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901


problem




119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901



119877 119861 = 1198881 where 119888 = 119876119879119910 hellip






119883119879119883 119861 = 119883119879 119910


119883 =1 2 3⋮ ⋮ ⋮4 5 6

119910 =7⋮8



119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6

1 2 3⋮ ⋮ ⋮4 5 6

=

11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133

119861 = 119883119879119883 minus1119883119879 119910





min 119910 minus 119883 1198612



the solution


(LASSO)





Classification

Regression


Kmeans


Map Reduce Repeat

MLlib Algorithms


MLlib Algorithms

MasterLibraries

Slaves Slaves


NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline




Steps

1 Read in data





Example on Spark



Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2







680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2






680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions






hellip










Override














ComputeSSQ (Mapper)



Override






return data1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master


Test 1



Test 2





Cluster Utilization




Results of Scaling


NAG vs MLlib




spark-ec2copy-dir






The data problem

Data locality



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0










NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks







General Problem



119910 = 119883 119861 + ε

119910 =

1199101

⋮119910119901

119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901


problem




119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901



119877 119861 = 1198881 where 119888 = 119876119879119910 hellip






119883119879119883 119861 = 119883119879 119910


119883 =1 2 3⋮ ⋮ ⋮4 5 6

119910 =7⋮8



119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6

1 2 3⋮ ⋮ ⋮4 5 6

=

11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133

119861 = 119883119879119883 minus1119883119879 119910





min 119910 minus 119883 1198612



the solution


(LASSO)





Classification

Regression


Kmeans


Map Reduce Repeat

MLlib Algorithms


MLlib Algorithms

MasterLibraries

Slaves Slaves


NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline




Steps

1 Read in data





Example on Spark



Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2







680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2






680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions






hellip










Override














ComputeSSQ (Mapper)



Override






return data1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master


Test 1



Test 2





Cluster Utilization




Results of Scaling


NAG vs MLlib




spark-ec2copy-dir






The data problem

Data locality



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0










NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks








119883 =

11990911 ⋯ 1199091119901

⋮ ⋱ ⋮1199091198991 ⋯ 119909119899119901



119877 119861 = 1198881 where 119888 = 119876119879119910 hellip






119883119879119883 119861 = 119883119879 119910


119883 =1 2 3⋮ ⋮ ⋮4 5 6

119910 =7⋮8



119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6

1 2 3⋮ ⋮ ⋮4 5 6

=

11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133

119861 = 119883119879119883 minus1119883119879 119910





min 119910 minus 119883 1198612



the solution


(LASSO)





Classification

Regression


Kmeans


Map Reduce Repeat

MLlib Algorithms


MLlib Algorithms

MasterLibraries

Slaves Slaves


NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline




Steps

1 Read in data





Example on Spark



Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2







680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2






680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions






hellip










Override














ComputeSSQ (Mapper)



Override






return data1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master


Test 1



Test 2





Cluster Utilization




Results of Scaling


NAG vs MLlib




spark-ec2copy-dir






The data problem

Data locality



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0










NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks








119883119879119883 119861 = 119883119879 119910


119883 =1 2 3⋮ ⋮ ⋮4 5 6

119910 =7⋮8



119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6

1 2 3⋮ ⋮ ⋮4 5 6

=

11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133

119861 = 119883119879119883 minus1119883119879 119910





min 119910 minus 119883 1198612



the solution


(LASSO)





Classification

Regression


Kmeans


Map Reduce Repeat

MLlib Algorithms


MLlib Algorithms

MasterLibraries

Slaves Slaves


NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline




Steps

1 Read in data





Example on Spark



Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2







680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2






680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions






hellip










Override














ComputeSSQ (Mapper)



Override






return data1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master


Test 1



Test 2





Cluster Utilization




Results of Scaling


NAG vs MLlib




spark-ec2copy-dir






The data problem

Data locality



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0










NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks







119883119879119883 =1 ⋯ 42 ⋯ 53 ⋯ 6

1 2 3⋮ ⋮ ⋮4 5 6

=

11991111 11991112 1199111311991121 11991122 1199112311991131 11991132 11991133

119861 = 119883119879119883 minus1119883119879 119910





min 119910 minus 119883 1198612



the solution


(LASSO)





Classification

Regression


Kmeans


Map Reduce Repeat

MLlib Algorithms


MLlib Algorithms

MasterLibraries

Slaves Slaves


NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline




Steps

1 Read in data





Example on Spark



Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2







680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2






680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions






hellip










Override














ComputeSSQ (Mapper)



Override






return data1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master


Test 1



Test 2





Cluster Utilization




Results of Scaling


NAG vs MLlib




spark-ec2copy-dir






The data problem

Data locality



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0










NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks








min 119910 minus 119883 1198612



the solution


(LASSO)





Classification

Regression


Kmeans


Map Reduce Repeat

MLlib Algorithms


MLlib Algorithms

MasterLibraries

Slaves Slaves


NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline




Steps

1 Read in data





Example on Spark



Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2







680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2






680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions






hellip










Override














ComputeSSQ (Mapper)



Override






return data1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master


Test 1



Test 2





Cluster Utilization




Results of Scaling


NAG vs MLlib




spark-ec2copy-dir






The data problem

Data locality



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0










NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks









Classification

Regression


Kmeans


Map Reduce Repeat

MLlib Algorithms


MLlib Algorithms

MasterLibraries

Slaves Slaves


NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline




Steps

1 Read in data





Example on Spark



Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2







680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2






680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions






hellip










Override














ComputeSSQ (Mapper)



Override






return data1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master


Test 1



Test 2





Cluster Utilization




Results of Scaling


NAG vs MLlib




spark-ec2copy-dir






The data problem

Data locality



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0










NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks







MLlib Algorithms

MasterLibraries

Slaves Slaves


NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline




Steps

1 Read in data





Example on Spark



Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2







680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2






680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions






hellip










Override














ComputeSSQ (Mapper)



Override






return data1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master


Test 1



Test 2





Cluster Utilization




Results of Scaling


NAG vs MLlib




spark-ec2copy-dir






The data problem

Data locality



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0










NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks







NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline




Steps

1 Read in data





Example on Spark



Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2







680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2






680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions






hellip










Override














ComputeSSQ (Mapper)



Override






return data1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master


Test 1



Test 2





Cluster Utilization




Results of Scaling


NAG vs MLlib




spark-ec2copy-dir






The data problem

Data locality



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0










NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks









Steps

1 Read in data





Example on Spark



Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2







680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2






680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions






hellip










Override














ComputeSSQ (Mapper)



Override






return data1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master


Test 1



Test 2





Cluster Utilization




Results of Scaling


NAG vs MLlib




spark-ec2copy-dir






The data problem

Data locality



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0










NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks








Master

LIBRARY CALL

LIBRARY CALL

LIBRARY CALL



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2







680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2






680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions






hellip










Override














ComputeSSQ (Mapper)



Override






return data1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master


Test 1



Test 2





Cluster Utilization




Results of Scaling


NAG vs MLlib




spark-ec2copy-dir






The data problem

Data locality



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0










NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks








Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2







680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2






680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions






hellip










Override














ComputeSSQ (Mapper)



Override






return data1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master


Test 1



Test 2





Cluster Utilization




Results of Scaling


NAG vs MLlib




spark-ec2copy-dir






The data problem

Data locality



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0










NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks







The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Master

Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0

Slave 1

Slave 2



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2







680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2






680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions






hellip










Override














ComputeSSQ (Mapper)



Override






return data1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master


Test 1



Test 2





Cluster Utilization




Results of Scaling


NAG vs MLlib




spark-ec2copy-dir






The data problem

Data locality



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0










NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks







The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2







680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2






680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions






hellip










Override














ComputeSSQ (Mapper)



Override






return data1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master


Test 1



Test 2





Cluster Utilization




Results of Scaling


NAG vs MLlib




spark-ec2copy-dir






The data problem

Data locality



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0










NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks







The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2







680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2






680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions






hellip










Override














ComputeSSQ (Mapper)



Override






return data1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master


Test 1



Test 2





Cluster Utilization




Results of Scaling


NAG vs MLlib




spark-ec2copy-dir






The data problem

Data locality



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0










NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks












680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2






680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions






hellip










Override














ComputeSSQ (Mapper)



Override






return data1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master


Test 1



Test 2





Cluster Utilization




Results of Scaling


NAG vs MLlib




spark-ec2copy-dir






The data problem

Data locality



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0










NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks











680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2

680509 423 121 232 1

651515 473 74 168 0

785641 532 114 174 1

Slave 2


The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions






hellip










Override














ComputeSSQ (Mapper)



Override






return data1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master


Test 1



Test 2





Cluster Utilization




Results of Scaling


NAG vs MLlib




spark-ec2copy-dir






The data problem

Data locality



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0










NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks







The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

mapPartitions






hellip










Override














ComputeSSQ (Mapper)



Override






return data1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master


Test 1



Test 2





Cluster Utilization




Results of Scaling


NAG vs MLlib




spark-ec2copy-dir






The data problem

Data locality



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0










NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks











hellip










Override














ComputeSSQ (Mapper)



Override






return data1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master


Test 1



Test 2





Cluster Utilization




Results of Scaling


NAG vs MLlib




spark-ec2copy-dir






The data problem

Data locality



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0










NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks









Override














ComputeSSQ (Mapper)



Override






return data1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master


Test 1



Test 2





Cluster Utilization




Results of Scaling


NAG vs MLlib




spark-ec2copy-dir






The data problem

Data locality



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0










NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks








Override






return data1



The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master


Test 1



Test 2





Cluster Utilization




Results of Scaling


NAG vs MLlib




spark-ec2copy-dir






The data problem

Data locality



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0










NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks







The Data (in Spark)

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

Slave 1

Slave 2

Master


Test 1



Test 2





Cluster Utilization




Results of Scaling


NAG vs MLlib




spark-ec2copy-dir






The data problem

Data locality



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0










NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks







Test 1



Test 2





Cluster Utilization




Results of Scaling


NAG vs MLlib




spark-ec2copy-dir






The data problem

Data locality



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0










NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks







Cluster Utilization




Results of Scaling


NAG vs MLlib




spark-ec2copy-dir






The data problem

Data locality



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0










NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks









Results of Scaling


NAG vs MLlib




spark-ec2copy-dir






The data problem

Data locality



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0










NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks







NAG vs MLlib




spark-ec2copy-dir






The data problem

Data locality



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0










NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks









spark-ec2copy-dir






The data problem

Data locality



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0










NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks







The data problem

Data locality



Label X1 X2 X3 X4

00 180 00 222 1

00 180 00 811 0

680509 423 121 232 1

865650 473 195 719 2

651515 473 74 168 0

785641 532 114 174 1

565567 349 107 684 0










NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks















NAG Introduction

Numerical Computing

Linear Regression


Timings


MLlib Algorithms

Outline


Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks







Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing




naive Bayes

decision trees


isotonic regression



Clustering

k-means

Gaussian mixture



streaming k-means






FP-growth




MLlib Algorithms


DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks







DataFrames

H2O Sparking Water

MLlib pipeline

Better optimizers

Improved APIs

SparkR







Sample






Thanks









Sample






Thanks







Thanks






using numerical libraries on spark - nag · commercial in confidence experts in numerical...

Documents