systemml's optimizer: advanced compilation techniques for large

28
SystemML’s Optimizer: Advanced Compilation Techniques for Large-Scale Machine Learning Programs 124, 165, 206? 203 © 2015 IBM Corporation Machine Learning Programs Matthias Boehm IBM Research – Almaden San Jose, CA, USA IBM Research Acknowledgements: A. V. Evfimievski, F. Makari Manshadi, N. Pansare, B. Reinwald, F. R. Reiss, P. Sen, S. Tatikonda, M. W. Dusenberry, D. Eriksson, C. R. Kadner, J. Kim, N. Kokhlikyan, D. Kumar, M. Li, L. Resende, A. Singh, A. C. Surve, G. Weidner

Upload: lekiet

Post on 13-Feb-2017

229 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SystemML's Optimizer: Advanced Compilation Techniques for Large

SystemML’s Optimizer: Advanced Compilation Techniques for Large-Scale Machine Learning Programs

124, 165, 206? 203

© 2015 IBM Corporation

Machine Learning Programs

Matthias Boehm

IBM Research – Almaden

San Jose, CA, USA

IBM Research

Acknowledgements: A. V. Evfimievski, F. Makari Manshadi, N. Pansare, B. Reinwald, F. R. Reiss, P. Sen, S. Tatikonda,

M. W. Dusenberry, D. Eriksson, C. R. Kadner, J. Kim, N. Kokhlikyan, D. Kumar, M. Li, L. Resende, A. Singh, A. C. Surve, G. Weidner

Page 2: SystemML's Optimizer: Advanced Compilation Techniques for Large

Outline

� Part I: SystemML Overview– Declarative Machine Learning

– SystemML Architecture

� Part II: SystemML’s Optimizer– SystemML’s Compilation Chain

– Rewrites and Operator Selection

– Dynamic Recompilation

– Spark-Specific Optimizer Extensions

Open Source:• Apache Incubator Project (11/2015)

• Website: http://systemml.apache.org/

• Source code: https://github.com/

apache/incubator-systemml

© 2015 IBM Corporation

– Spark-Specific Optimizer Extensions

2 IBM Research

[A. Ghoting et al. SystemML: Declarative Machine Learning on MapReduce, ICDE 2011]

[M. Boehm et al. SystemML's Optimizer: Plan Generation for Large-Scale Machine Learning Programs. IEEE Data Eng. Bull, 2014]

[M. Boehm et al. SystemML: Declarative Machine Learning on Spark, submitted]

Page 3: SystemML's Optimizer: Advanced Compilation Techniques for Large

Case Study: An Automobile Manufacturer

� Goal: Design a model to predict car reacquisition

Warranty Claims

Predictive

Features MachineLearning

Algorithm

Labels

Algorithm

• Class skew• Low precision

© 2015 IBM Corporation3 IBM Research

Repair History

Diagnostic Readouts

Models

Algorithm

Algorithm

Algorithm

Result: 25x improvement in precision!

Page 4: SystemML's Optimizer: Advanced Compilation Techniques for Large

Common Patterns Across Customers

� Algorithm customization

� Changes in feature set

� Changes in data size

Custom Analytics

Declarative Machine Learning

© 2015 IBM Corporation

� Changes in data size

� Quick iteration

4 IBM Research

Machine Learning

Page 5: SystemML's Optimizer: Advanced Compilation Techniques for Large

Motivation Large-Scale Machine Learning

� Need for Large Scale ML– Analyzing large datasets

– Advanced analytics / machine learning ���� Gain value from collected data

– Specific workload characteristics ���� Specialized systems for large-scale ML

� Landscape of Existing Work– Classification by abstraction level (different target users)

© 2015 IBM Corporation5 IBM Research

Distributed Systems w/ DSLs

Large-Scale ML Libraries (fixed plan)

Declarative ML (fixed algorithm)

Declarative ML++ (fixed task)

Spark, Flink, REEF, GraphLab,

(R, Matlab, SAS)

MLlib, Mahout MR, MADlib, ORE,

Rev R, HP Dist R, Custom alg.

SystemML, (Mahout Samsara,

Tupleware, Cumulon, DMac)

Mlbase*, Specific sys.

Page 6: SystemML's Optimizer: Advanced Compilation Techniques for Large

Declarative Machine Learning – Requirements

� Goal: Write ML algorithms independent of input data and cluster characteristics

� R1: Full flexibility– Specify new / customize existing ML algorithms

���� ML DSL

� R2: Data independence– Hide physical data representation (sparse/dense, row/column-major, blocking

© 2015 IBM Corporation

– Hide physical data representation (sparse/dense, row/column-major, blocking

configs, partitioning, caching, compression)

���� Abstract data types and coarse-grained logical operations

� R3: Efficiency and scalability– Very small to very large usecases

���� Automatic optimization and hybrid runtime plans

� R4: Specified algorithm semantics– Understand, debug and control algorithm behavior

���� Optimization for performance only, not accuracy

6 IBM Research

Automatic optimization is key here!

Page 7: SystemML's Optimizer: Advanced Compilation Techniques for Large

Simplified DML Examples

Linear Regression (Direct Solve via Normal Equations)

Logistic Regression (via Trust Region Method)

Spectrum of Use Cases(size, characteristics)

DML (Declarative Machine Learning Language)

© 2015 IBM Corporation7 IBM Research

Page 8: SystemML's Optimizer: Advanced Compilation Techniques for Large

High-Level SystemML Architecture

Runtime

Compiler

Language

DML Scripts

ThisTalk

DML (Declarative Machine Learning Language)

© 2015 IBM Corporation8 IBM Research

Hadoop or Spark Cluster (scale-out)

In-Memory Single Node (scale-up)

since 2010since 2012 since 2015

Page 9: SystemML's Optimizer: Advanced Compilation Techniques for Large

SystemML Architecture (APIs and runtime)

Command Line

JMLCSpark

MLContextSpark

MLAPIs

High-Level Operators (HOPs)

Parser/Language

Low-Level Operators (LOPs)

Compiler Cost-based optimizations

© 2015 IBM Corporation9 IBM Research

Low-Level Operators (LOPs)

Runtime

Control Program

RuntimeProg

Buffer Pool

ParFor Optimizer/Runtime

MRInstSpark

InstCPInst

Recompiler

DFS IOMem/FS IO

GenericMR Jobs

MatrixBlock Library(single/multi-threaded)

Page 10: SystemML's Optimizer: Advanced Compilation Techniques for Large

Outline

� Part I: SystemML Overview– Declarative Machine Learning

– SystemML Architecture

� Part II: SystemML’s Optimizer– SystemML’s Compilation Chain

– Rewrites and Operator Selection

– Dynamic Recompilation

– Spark-Specific Optimizer Extensions

© 2015 IBM Corporation

– Spark-Specific Optimizer Extensions

10 IBM Research

Page 11: SystemML's Optimizer: Advanced Compilation Techniques for Large

SystemML’s Compilation Chain

DEBUG

Various optimization decisions at different

compilation steps

[Matthias Boehm et al:

SystemML's Optimizer: Plan Generation for Large-Scale

Machine Learning Programs. IEEE Data Eng. Bull 2014]

© 2015 IBM Corporation11 IBM Research

EXPLAINhops

STATS

IPA

ParFor Opt

Resource Opt

GDF Opt

EXPLAINruntime

HOP (High-level operator)

LOP (Low-level operator)

Page 12: SystemML's Optimizer: Advanced Compilation Techniques for Large

Basic HOP and LOP DAG Compilation

� Example Linreg DS

X = read($1);

y = read($2);

intercept = $3;

lambda = 0.001;

...

if( intercept == 1 ) {

ones = matrix(1,nrow(X),1);

X = append(X, ones);

}

I = matrix(1, ncol(X), 1);

HOP DAG(after rewrites)

Cluster Config:• client mem: 4 GB

• map/red mem: 2 GB

dg(rand)

(103x1,103)

r(diag)

X

(108x103,1011)

y

(108x1,108)

ba(+*) ba(+*)

r(t)

b(+)

b(solve)

write

Scenario: X: 108 x 103, 1011

y: 108 x 1, 108

800MB

800GB

800GB8KB

172KB

1.6TB

1.6TB

16MB8MB

8KB

CP

MR

CP

CP

CP

MRMR

CP

© 2015 IBM Corporation12 IBM Research

I = matrix(1, ncol(X), 1);

A = t(X) %*% X + diag(I)*lambda;

b = t(X) %*% y;

beta = solve(A, b);

...

write(beta, $4);

LOP DAG(after rewrites)

(103x1,103) (108x103,1011) (108x1,108)

map

reduce

1.6GB

800MB

16KB

� Hybrid Runtime Plans:• Size propagation over ML programs • Worst-case sparsity / memory estimates• Integrated CP / MR / Spark runtime

X

y

r’(CP)

mapmm(MR) tsmm(MR)

ak+(MR) ak+(MR)

r’(CP)

part(CP)

800MB

Page 13: SystemML's Optimizer: Advanced Compilation Techniques for Large

Static and Dynamic Rewrites

� Types of Rewrites– Static: size-independent rewrites

– Dynamic: size-dependent rewrites

� Examples Static Rewrites– Common Subexpression Elimination

– Constant Folding

– Static Algebraic Simplification Rewrites

Cascading rewrite effect(enables other rewrites, IPA,

operator selection)

© 2015 IBM Corporation

– Static Algebraic Simplification Rewrites

– Branch Removal

– Right/Left Indexing Vectorization

– For Loop Vectorization

– Checkpoint injection (caching)

– Repartition injection

� Examples Dynamic Rewrites– Matrix Multiplication Chain Optimization

– Dynamic Algebraic Simplification Rewrites

13 IBM Research

operator selection)

High performance impact(direct/indirect)

Page 14: SystemML's Optimizer: Advanced Compilation Techniques for Large

� Static simplification rewrites: size-independent patterns

Example Algebraic Simplification Rewrites (1)

Rewrite Category Static Patterns

Remove Unnecessary

Operations

t(t(X)), X/1, X*1, X-0, -(-X) � X

matrix(1,)/X � 1/X sum(t(X)) ���� sum(X)

rand(,min=-1,max=1)*7 � rand(,min=-7,max=7)

-rand(,min=-2,max=1) � rand(,min=-1,max=2)

t(cbind(t(X),t(Y))) � rbind(X,Y)

Simplify Bushy Binary (X*(Y*(Z%*%v))) � (X*Y)*(Z%*%v)

© 2015 IBM Corporation14 IBM Research

Simplify Bushy Binary (X*(Y*(Z%*%v))) � (X*Y)*(Z%*%v)

Binary to Unary X+X � 2*X X*X ���� X^2 X-X*Y � X*(1-Y)

X*(1-X) � sprop(X) 1/(1+exp(-X)) � sigmoid(X)

X*(X>0) � selp(X) (X-7)*(X!=0) ���� X -nz 7

(X!=0)*log(X) � log_nz(X)

aggregate(X,y,count) � aggregate(y,y,count)

Simplify Permutation

Matrix Construction

outer(X,seq(1,N),"==") � rexpand(v,max=N,row)

table(seq(1,nrow(v)),v,N) � rexpand(v,max=N,row)

Simplify Operation over

Matrix Multiplication

trace(X%*%Y) ���� sum(X*t(Y))

(X%*%Y)[7,3] � X[7,] %*% Y[,3]

Page 15: SystemML's Optimizer: Advanced Compilation Techniques for Large

� Dynamic simplification rewrites: size-dependent patterns

Example Algebraic Simplification Rewrites (2)

Rewrite Category Dynamic Patterns

Remove / Simplify

Unnecessary Indexing

X[a:b,c:d] = Y � X = Y iff dims(X)=dims(Y)

X = Y[, 1] ���� X = Y iff dims(X)=dims(Y)

X[,1]=Y;X[,2]=Z � X=cbind(Y,Z) iff ncol(X)=2,col

Fuse / Pushdown

Operations

t(rand(10, 1)) � rand(1, 10) iff nrow/ncol=1

sum(diag(X)) � trace(X) iff ncol(X)>1

diag(X)*7 � diag(X*7) iff ncol(X)=1

© 2015 IBM Corporation15 IBM Research

diag(X)*7 � diag(X*7) iff ncol(X)=1

sum(X^2) ����t(X)%*%X, �sumSq(X) iff ncol(X)=1, >1

Remove Empty /

Unnecessary Operations

X%*%Y ���� matrix(0,…) iff nnz(X)=0|nnz(Y)=0

X*Y � matrix(0,…), X+Y�X, X-Y�X iff nnz(Y)=0

round(X)�matrix(0), t(X)�matrix(0) iff nnz(X)=0

X*(Y%*%matrix(1,)) � X*Y iff ncol(Y)=1

Simplify Aggregates /

Scalar Operations

rowSums(X) ����sum(X) ����X iff nrow(X)=1, ncol(X)=1

rowSums(X*Y) � X%*%t(Y) iff nrow(Y)=1

X*Y � X*as.scalar(Y) iff nrow(Y)=1&ncol(Y)=1

Simplify Diag Matrix

Multiplications

diag(X)%*%Y ���� Y*X iff ncol(X)=1&ncol(Y)>1

diag(X%*%Y)->rowSums(X*t(Y)) iff ncol(Y)>1

Page 16: SystemML's Optimizer: Advanced Compilation Techniques for Large

Example Operator Selection: Matrix Multiplication

X

r(t)

ba(+*)

y

MR

MR

t(X)%*%yExample:

Transform(CP,’)

Aggregate(MR,ak+)

Group

Exec Type MM Ops Pattern

CP MM

MMChain

TSMM

PMM

X %*% Y

t(X) (w * (X %*% v))

t(X) %*% X

rmr(diag(v)) %*% X

MR / Spark

(* only Spark)

MapMM

MapMMChain

TSMM

ZipMM *

X %*% Y

t(X) (w * (X %*% v))

t(X) %*% X

t(X) %*% Y

© 2015 IBM Corporation

� Hop-Lop Rewrites– Aggregation (w/o, singleblock/multiblock)

– Partitioning (w/o, CP/MR, col/rowblock)

– Empty block materialization in output

– Transpose-MM rewrite t(X)%*%y � t(t(y)%*%X)

– CP degree of parallelism (multi-threaded mm)

16 IBM Research

MapMM(MR,left)

Partition(CP,col)

Transform(CP,’)

Xy

Group(MR)

Spark) ZipMM *

CPMM

RMM

PMM

t(X) %*% Y

rmr(diag(v)) %*% X

X %*% Y

X %*% Y

Page 17: SystemML's Optimizer: Advanced Compilation Techniques for Large

Example Fused Operators (1): MMChain

� Matrix Multiplication Chains: q = t(X) %*% (w * (X %*% v))

– Very common pattern

– MV-ops IO / memory-bandwidth bound

– Problem: Data dependencyforces two passes over X X

v

* w D=

Step 1:

© 2015 IBM Corporation

���� Fused mmchain operator– Key observation: values of D are row-aligned wrt to X

– Single-pass operation (map-side in MR/Spark / cache-conscious in CP/GPU)

17 IBM Research

t(X)

Step 2a:D

q

X

t(D)

Step 2b:

t(q)q

[Arash Ashari et al.: On optimizing machine learning workloads via kernel fusion. PPOPP 2015]

Page 18: SystemML's Optimizer: Advanced Compilation Techniques for Large

Example Fused Operators (2): WSLoss

� Weighted Squared Loss: wsl = sum(W * (X – L %*% t(R))^2)

– Common pattern for factorization algorithms

– W and X usually very sparse (< 0.001)

– Problem: “Outer” product of L%*%t(R) creates three denseintermediates in the size of X

���� Fused wsloss operator– Key observations: Sparse W* allows selective computation, full aggregate

significantly reduces memory requirements 2

© 2015 IBM Corporation

significantly reduces memory requirements

18 IBM Research

L–

t(R)

XWsum *

2

Page 19: SystemML's Optimizer: Advanced Compilation Techniques for Large

Example Poisson NMF (PNMF)

� Example Script

while( iter < max_iterations ){

iter = iter + 1;

H = (H * (t(W) %*% (V/(W%*%H))))/t(colSums(W));

W = (W * ((V/(W%*%H)) %*% t(H)))/t(rowSums(H));

obj = as.scalar(colSums(W)%*%rowSums(H))

-sum(V * log(W%*%H));

wdivmm (left)

wdivmm (right)

wcemm

© 2015 IBM Corporation

-sum(V * log(W%*%H));

print("ITER=" + iter + " obj=" + obj);

}

� Notes– Similar complex patterns in various factorization (ENMF, ALS), and

deep learning algorithms (GloVe, SkipGram)

– Automatic optimization via dynamic simplification rewrites

– Different fused physical operators (cp, mr/spark map/red)

– Details: https://github.com/apache/incubator-systemml/blob/master/docs/ devdocs/MatrixMultiplicationOperators.txt

19 IBM Research

wcemm

Page 20: SystemML's Optimizer: Advanced Compilation Techniques for Large

Rewrites and Operator Selection in Action

� Example: Use case Mlogreg, X: 108x103, K=1 (2 classes), 2GB mem

� Applied Rewrites– Original DML snippet of inner loop:

Q = P[, 1:K] * (X %*% ssX_V);

HV = t(X) %*% (Q - P[, 1:K] * (rowSums(Q) %*% matrix(1, rows=1, cols=K)));

– After remove unnecessary (1) matrix multiply (2) unary aggregateQ = P[, 1:K] * (X %*% ssX_V);

HV = t(X) %*% (Q - P[, 1:K] * Q);

– After simplify distributive binary operation Recall:

© 2015 IBM Corporation

– After simplify distributive binary operationQ = P[, 1:K] * (X %*% ssX_V);

HV = t(X) %*% ((1 - P[, 1:K]) * Q);

– After simplify bushy binary operationHV = t(X) %*% (((1 - P[, 1:K]) * P[, 1:K]) * (X %*% ssX_V));

– After fuse binary dag to unary operation (sample proportion)HV = t(X) %*% (sprop(P[, 1:K]) * (X %*% ssX_V));

� Operator Selection– Exec Type: MR, because mem estimate > 800GB

– MM Type: MapMMChain, because XtwXv pattern and w=sprop(P[, 1:K]) < 2GB

– CP partitioning of w into 32MB chunks of rowblocks

20 IBM Research

Recall: Cascading rewrite effect

Page 21: SystemML's Optimizer: Advanced Compilation Techniques for Large

Dynamic Recompilation – Motivation

� Problem of unknown/changing sizes – Unknown or changing sizes and sparsity of intermediates

(across loop iterations / conditional control flow).

– These unknowns lead to very conservative fallback plans.

� Example ML program scenarios– Scripts w/ complex function call patterns

– Scripts w/ UDFs

– Data-dependent operators

Ex: Stepwise LinregDS

while( continue ){

parfor( i in 1:n ){

if( fixed[1,i]==0 ) {

© 2015 IBM Corporation

– Data-dependent operatorsY = table( seq(1,nrow(X)), y )

grad = t(X) %*% (P - Y);

– Computed size expressions

– Changing dimensions or sparsity

���� Dynamic recompilation techniques as robust fallback strategy– Shares goals and challenges with adaptive query processing

– However, ML domain-specific techniques and rewrites

21 IBM Research

if( fixed[1,i]==0 ) {

X = cbind(Xg, Xorig[,i])

AIC[1,i] = linregDS(X,y)

}

}

#select and append best feature

}

Page 22: SystemML's Optimizer: Advanced Compilation Techniques for Large

� Optimizer recompilation decisions– Split HOP DAGs for recompilation: prevent unknowns but keep DAGs as

large as possible; we split after reads w/ unknown sizes and specific operators

– Mark HOP DAGs for recompilation: MR due to unknown sizes / sparsity

� Dynamic recompilation at runtime on recompilation hooks (last level

program blocks, predicates, recompile once functions, specific MR jobs)

– Deep Copy DAG: (e.g., for non-reversible dynamic rewrites)

X 1Mx100,99M

Dynamic Recompilation – Compiler and Runtime

© 2015 IBM Corporation

non-reversible dynamic rewrites)

– Update DAG Statistics: (based

on exact symbol table meta data)

– Dynamic Rewrites: (exact stats allow very aggressive rewrites)

– Recompute Memory Estimates:(w/ unconditional scope of single DAG)

– Generate Runtime Instructions:(construct LOPs / instructions)

22 IBM Research

X

r(t)

ba(+*)

P

CP

MR

b(-)

Y

MR[100x1M,-1]

[100x-1,-1]

[1Mx100,-1] [1Mx-1,-1] [1Mx-1,-1]

[1Mx-1,-1]

P 1Mx7,7M

Y 1Mx7,7M

[1Mx100,99M] [1Mx7,7M] [1Mx7,7M]

[1Mx7,-1][100x1M,99M]

[100x7,-1]

CP

CP

Page 23: SystemML's Optimizer: Advanced Compilation Techniques for Large

Spark-Specific Optimizations

� Spark-Specific Rewrites– Automatic caching/checkpoint injection

(MEM_DISK / MEM_DISK_SER)

– Automatic repartition injection

� Operator Selection– Spark exec type selection

– Transitive Spark exec type

– Physical operator selection

X = read($1);

y = read($2);

...

r = -(t(X) %*% y);

while(i < maxi &

norm_r2 > norm_r2_trgt) {

q = t(X)%*%(X%*%p) + lambda*p;

alpha = norm_r2 / (t(p)%*%q);

w = w + alpha * p;

chkpt X MEM_DISK

Ex: Checkpoint Injection LinregCG

© 2015 IBM Corporation

– Physical operator selection

� Extended ParFor Optimizer– Deferred checkpoint/repartition injection

– Eager checkpointing/repartitioning

– Fair scheduling for concurrent jobs

– Local degree of parallelism

� Runtime Optimizations– Lazy Spark context creation

– Short-circuit read/collect

23 IBM Research

w = w + alpha * p;

old_norm_r2 = norm_r2;

r = r + alpha * q;

norm_r2 = sum(r * r);

beta = norm_r2 / old_norm_r2;

p = -r + beta * p;

i = i + 1;

}

...

write(w, $4);

Spark Exec (24 cores)

60% data

20% shuffle

20% tasks

Page 24: SystemML's Optimizer: Advanced Compilation Techniques for Large

Excursus: Spark Buffer Pool Integration

� Distributed MatrixRepresentation– Binary block matrices

(JavaPairRDD<MatrixIndexes, MatrixBlock>)

– Serialization: block formats (dense, sparse,

ultra-sparse, empty)

– Hash-partitioning

Logical Blocking

(w/ Bc=1,000)

Physical Blocking and Partitioning

(w/ Bc=1,000)

© 2015 IBM Corporation

� Buffer Pool Integration– Basic Buffer Pool Integration

– Lineage tracking RDDs/broadcasts

– Guarded RDD collect/parallelize

– Partitioned Broadcast variables

24 IBM Research

Page 25: SystemML's Optimizer: Advanced Compilation Techniques for Large

Partitioning-Preserving Operations on Spark

� Partitioning-preserving ops – Op is partitioning-preserving if key not changed (guaranteed)

– 1) Implicit: Use restrictive APIs (mapValues() vs mapToPair())

– 2) Explicit: Partition computation w/ declaration of partitioning-preserving

(memory-efficiency via “lazy iterators”)

� Partitioning-exploiting ops– 1) Implicit: Operations based on join, cogroup, etc

– 2) Explicit: Custom physical operators on original keys (e.g., zipmm)

© 2015 IBM Corporation

– 2) Explicit: Custom physical operators on original keys (e.g., zipmm)

� Example: Multiclass SVM– Vectors in

nrow(X) neither

fit into driver nor

broadcast

– ncol(X) ≤ Bc

25 IBM Research

parfor(iter_class in 1:num_classes) {

Y_local = 2 * (Y == iter_class) - 1

g_old = t(X) %*% Y_local

...

while( continue ) {

Xd = X %*% s

... inner while loop (compute step_sz)

Xw = Xw + step_sz * Xd;

out = 1 - Y_local * Xw;

out = (out > 0) * out;

g_new = t(X) %*% (out * Y_local) ...

repart, chkpt X MEM_DISK

chkpt y_local MEM_DISK

zipmm

chkpt Xd, Xw MEM_DISK

Page 26: SystemML's Optimizer: Advanced Compilation Techniques for Large

From SystemR to SystemML – A Comparison

� Similarities– Declarative specification (fixed semantics): SQL vs DML

– Simplification rewrites (Starburst QGM rewrites vs static/dynamic rewrites)

– Operator selection (physical operators for join vs matrix multiply)

– Operator reordering (join enumeration vs matrix multiplication chain opt)

– Adaptive query processing (progressive reop vs dynamic recompile)

– Physical layout (NSM/DSM/PAX page layouts vs dense/sparse block formats)

– Buffer pool (pull-based page cache vs anti-caching of in-memory variables)

© 2015 IBM Corporation

– Buffer pool (pull-based page cache vs anti-caching of in-memory variables)

– Advanced optimizations (source code gen, compression, GPUs, etc)

– Cost model / stats (est. time for IO/compute/latency; histograms vs dims/nnz)

� Differences– Algebra (relational algebra vs linear algebra)

– Programs (query trees vs DAGs, conditional control flow, often iterative)

– Optimizations (algebra-specific semantics, rewrites, and constraints)

– Scale (10s-100s vs 10s-10,000s of operators)

– Data preparation (ETL vs feature engineering)

– Physical design, transactions processing, multi-tenancy, etc

26 IBM Research

Page 27: SystemML's Optimizer: Advanced Compilation Techniques for Large

Conclusions

� Takeaway Message– The right abstraction level matters ���� Flexibility and independence

– Automatic optimization of declarative ML programs is important and challenging ���� Efficiency and scalability

� Summary– Declarative Machine Learning

– SystemML’s Compilation Chain

– Rewrites and Operator Selection

© 2015 IBM Corporation

– Rewrites and Operator Selection

– Dynamic Recompilation

– Spark-Specific Optimizer Extensions

� Ongoing / Future Work– Advanced special-purpose optimizers (e.g., parfor, global data flow opt)

– Optimizer/runtime support for next gen runtime platforms (e.g., YARN, Spark)

– Low level optimization techniques (e.g., compression, NUMA, source code gen)

– Integrated HW accelerators (e.g., GPUs, many cores)

– Benchmarking ML systems (data/workload characteristics; accuracy/runtime)

27 IBM Research

Page 28: SystemML's Optimizer: Advanced Compilation Techniques for Large

© 2015 IBM Corporation28 IBM Research

SystemML is Open Source:• Apache Incubator Project (11/2015)

• Website: http://systemml.apache.org/

• Source code: https://github.com/

apache/incubator-systemml

IBM Spark Technology Center425 Market St., San Francisco• Growing pool of contributors

• Founding member of AMPLab

• Partnerships in the ecosystem