systemml's optimizer: advanced compilation techniques for large

SystemML’s Optimizer: Advanced Compilation Techniques for Large-Scale Machine Learning Programs

124, 165, 206? 203

© 2015 IBM Corporation

Machine Learning Programs

Matthias Boehm

IBM Research – Almaden

San Jose, CA, USA

IBM Research

Acknowledgements: A. V. Evfimievski, F. Makari Manshadi, N. Pansare, B. Reinwald, F. R. Reiss, P. Sen, S. Tatikonda,

M. W. Dusenberry, D. Eriksson, C. R. Kadner, J. Kim, N. Kokhlikyan, D. Kumar, M. Li, L. Resende, A. Singh, A. C. Surve, G. Weidner

Outline

� Part I: SystemML Overview– Declarative Machine Learning

– SystemML Architecture

� Part II: SystemML’s Optimizer– SystemML’s Compilation Chain

– Rewrites and Operator Selection

– Dynamic Recompilation

– Spark-Specific Optimizer Extensions

Open Source:• Apache Incubator Project (11/2015)

• Website: http://systemml.apache.org/

• Source code: https://github.com/

apache/incubator-systemml



2 IBM Research

[A. Ghoting et al. SystemML: Declarative Machine Learning on MapReduce, ICDE 2011]

[M. Boehm et al. SystemML's Optimizer: Plan Generation for Large-Scale Machine Learning Programs. IEEE Data Eng. Bull, 2014]

[M. Boehm et al. SystemML: Declarative Machine Learning on Spark, submitted]

Case Study: An Automobile Manufacturer

� Goal: Design a model to predict car reacquisition

Warranty Claims

Predictive

Features MachineLearning

Algorithm

Labels

Algorithm

• Class skew• Low precision

© 2015 IBM Corporation3 IBM Research

Repair History

Diagnostic Readouts

Models

Algorithm

Algorithm

Algorithm

Result: 25x improvement in precision!

Common Patterns Across Customers

� Algorithm customization

� Changes in feature set

� Changes in data size

Custom Analytics

Declarative Machine Learning


� Changes in data size

� Quick iteration

4 IBM Research

Machine Learning

Motivation Large-Scale Machine Learning

� Need for Large Scale ML– Analyzing large datasets

– Advanced analytics / machine learning �� Gain value from collected data

– Specific workload characteristics �� Specialized systems for large-scale ML

� Landscape of Existing Work– Classification by abstraction level (different target users)


Distributed Systems w/ DSLs

Large-Scale ML Libraries (fixed plan)

Declarative ML (fixed algorithm)

Declarative ML++ (fixed task)

Spark, Flink, REEF, GraphLab,

(R, Matlab, SAS)

MLlib, Mahout MR, MADlib, ORE,

Rev R, HP Dist R, Custom alg.

SystemML, (Mahout Samsara,

Tupleware, Cumulon, DMac)

Mlbase*, Specific sys.

Declarative Machine Learning – Requirements

� Goal: Write ML algorithms independent of input data and cluster characteristics

� R1: Full flexibility– Specify new / customize existing ML algorithms

�� ML DSL

� R2: Data independence– Hide physical data representation (sparse/dense, row/column-major, blocking


– Hide physical data representation (sparse/dense, row/column-major, blocking

configs, partitioning, caching, compression)

�� Abstract data types and coarse-grained logical operations

� R3: Efficiency and scalability– Very small to very large usecases

�� Automatic optimization and hybrid runtime plans

� R4: Specified algorithm semantics– Understand, debug and control algorithm behavior

�� Optimization for performance only, not accuracy

6 IBM Research

Automatic optimization is key here!

Simplified DML Examples

Linear Regression (Direct Solve via Normal Equations)

Logistic Regression (via Trust Region Method)

Spectrum of Use Cases(size, characteristics)

DML (Declarative Machine Learning Language)


High-Level SystemML Architecture

Runtime

Compiler

Language

DML Scripts

ThisTalk

DML (Declarative Machine Learning Language)


Hadoop or Spark Cluster (scale-out)

In-Memory Single Node (scale-up)

since 2010since 2012 since 2015

SystemML Architecture (APIs and runtime)

Command Line

JMLCSpark

MLContextSpark

MLAPIs

High-Level Operators (HOPs)

Parser/Language

Low-Level Operators (LOPs)

Compiler Cost-based optimizations


Low-Level Operators (LOPs)

Runtime

Control Program

RuntimeProg

Buffer Pool

ParFor Optimizer/Runtime

MRInstSpark

InstCPInst

Recompiler

DFS IOMem/FS IO

GenericMR Jobs

MatrixBlock Library(single/multi-threaded)

Outline

� Part I: SystemML Overview– Declarative Machine Learning

– SystemML Architecture

� Part II: SystemML’s Optimizer– SystemML’s Compilation Chain






10 IBM Research

SystemML’s Compilation Chain

DEBUG

Various optimization decisions at different

compilation steps

[Matthias Boehm et al:

SystemML's Optimizer: Plan Generation for Large-Scale

Machine Learning Programs. IEEE Data Eng. Bull 2014]


EXPLAINhops

STATS

IPA

ParFor Opt

Resource Opt

GDF Opt

EXPLAINruntime

HOP (High-level operator)

LOP (Low-level operator)

Basic HOP and LOP DAG Compilation

� Example Linreg DS

X = read($1);

y = read($2);

intercept = $3;

lambda = 0.001;

...

if( intercept == 1 ) {

ones = matrix(1,nrow(X),1);

X = append(X, ones);

}

I = matrix(1, ncol(X), 1);

HOP DAG(after rewrites)

Cluster Config:• client mem: 4 GB

• map/red mem: 2 GB

dg(rand)

(103x1,103)

r(diag)

X

(108x103,1011)

y

(108x1,108)

ba(+*) ba(+*)

r(t)

b(+)

b(solve)

write

Scenario: X: 108 x 103, 1011

y: 108 x 1, 108

800MB

800GB

800GB8KB

172KB

1.6TB

1.6TB

16MB8MB

8KB

CP

MR

CP

CP

CP

MRMR

CP


I = matrix(1, ncol(X), 1);

A = t(X) %*% X + diag(I)*lambda;

b = t(X) %*% y;

beta = solve(A, b);

...

write(beta, $4);

LOP DAG(after rewrites)

(103x1,103) (108x103,1011) (108x1,108)

map

reduce

1.6GB

800MB

16KB

� Hybrid Runtime Plans:• Size propagation over ML programs • Worst-case sparsity / memory estimates• Integrated CP / MR / Spark runtime

X

y

r’(CP)

mapmm(MR) tsmm(MR)

ak+(MR) ak+(MR)

r’(CP)

part(CP)

800MB

Static and Dynamic Rewrites

� Types of Rewrites– Static: size-independent rewrites

– Dynamic: size-dependent rewrites

� Examples Static Rewrites– Common Subexpression Elimination

– Constant Folding

– Static Algebraic Simplification Rewrites

Cascading rewrite effect(enables other rewrites, IPA,

operator selection)


– Static Algebraic Simplification Rewrites

– Branch Removal

– Right/Left Indexing Vectorization

– For Loop Vectorization

– Checkpoint injection (caching)

– Repartition injection

� Examples Dynamic Rewrites– Matrix Multiplication Chain Optimization

– Dynamic Algebraic Simplification Rewrites

13 IBM Research

operator selection)

High performance impact(direct/indirect)

� Static simplification rewrites: size-independent patterns

Example Algebraic Simplification Rewrites (1)

Rewrite Category Static Patterns

Remove Unnecessary

Operations

t(t(X)), X/1, X*1, X-0, -(-X) � X

matrix(1,)/X � 1/X sum(t(X)) �� sum(X)

rand(,min=-1,max=1)*7 � rand(,min=-7,max=7)

-rand(,min=-2,max=1) � rand(,min=-1,max=2)

t(cbind(t(X),t(Y))) � rbind(X,Y)

Simplify Bushy Binary (X*(Y*(Z%*%v))) � (X*Y)*(Z%*%v)


Simplify Bushy Binary (X*(Y*(Z%*%v))) � (X*Y)*(Z%*%v)

Binary to Unary X+X � 2*X X*X �� X^2 X-X*Y � X*(1-Y)

X*(1-X) � sprop(X) 1/(1+exp(-X)) � sigmoid(X)

X*(X>0) � selp(X) (X-7)*(X!=0) �� X -nz 7

(X!=0)*log(X) � log_nz(X)

aggregate(X,y,count) � aggregate(y,y,count)

Simplify Permutation

Matrix Construction

outer(X,seq(1,N),"==") � rexpand(v,max=N,row)

table(seq(1,nrow(v)),v,N) � rexpand(v,max=N,row)

Simplify Operation over

Matrix Multiplication

trace(X%*%Y) �� sum(X*t(Y))

(X%*%Y)[7,3] � X[7,] %*% Y[,3]

� Dynamic simplification rewrites: size-dependent patterns

Example Algebraic Simplification Rewrites (2)

Rewrite Category Dynamic Patterns

Remove / Simplify

Unnecessary Indexing

X[a:b,c:d] = Y � X = Y iff dims(X)=dims(Y)

X = Y[, 1] �� X = Y iff dims(X)=dims(Y)

X[,1]=Y;X[,2]=Z � X=cbind(Y,Z) iff ncol(X)=2,col

Fuse / Pushdown

Operations

t(rand(10, 1)) � rand(1, 10) iff nrow/ncol=1

sum(diag(X)) � trace(X) iff ncol(X)>1

diag(X)*7 � diag(X*7) iff ncol(X)=1


diag(X)*7 � diag(X*7) iff ncol(X)=1

sum(X^2) ��t(X)%*%X, �sumSq(X) iff ncol(X)=1, >1

Remove Empty /

Unnecessary Operations

X%*%Y �� matrix(0,…) iff nnz(X)=0|nnz(Y)=0

X*Y � matrix(0,…), X+Y�X, X-Y�X iff nnz(Y)=0

round(X)�matrix(0), t(X)�matrix(0) iff nnz(X)=0

X*(Y%*%matrix(1,)) � X*Y iff ncol(Y)=1

Simplify Aggregates /

Scalar Operations

rowSums(X) ��sum(X) ��X iff nrow(X)=1, ncol(X)=1

rowSums(X*Y) � X%*%t(Y) iff nrow(Y)=1

X*Y � X*as.scalar(Y) iff nrow(Y)=1&ncol(Y)=1

Simplify Diag Matrix

Multiplications

diag(X)%*%Y �� Y*X iff ncol(X)=1&ncol(Y)>1

diag(X%*%Y)->rowSums(X*t(Y)) iff ncol(Y)>1

Example Operator Selection: Matrix Multiplication

X

r(t)

ba(+*)

y

MR

MR

t(X)%*%yExample:

Transform(CP,’)

Aggregate(MR,ak+)

Group

Exec Type MM Ops Pattern

CP MM

MMChain

TSMM

PMM

X %*% Y

t(X) (w * (X %*% v))

t(X) %*% X

rmr(diag(v)) %*% X

MR / Spark

(* only Spark)

MapMM

MapMMChain

TSMM

ZipMM *

X %*% Y

t(X) (w * (X %*% v))

t(X) %*% X

t(X) %*% Y


� Hop-Lop Rewrites– Aggregation (w/o, singleblock/multiblock)

– Partitioning (w/o, CP/MR, col/rowblock)

– Empty block materialization in output

– Transpose-MM rewrite t(X)%*%y � t(t(y)%*%X)

– CP degree of parallelism (multi-threaded mm)

16 IBM Research

MapMM(MR,left)

Partition(CP,col)

Transform(CP,’)

Xy

Group(MR)

Spark) ZipMM *

CPMM

RMM

PMM

t(X) %*% Y

rmr(diag(v)) %*% X

X %*% Y

X %*% Y

Example Fused Operators (1): MMChain

� Matrix Multiplication Chains: q = t(X) %*% (w * (X %*% v))

– Very common pattern

– MV-ops IO / memory-bandwidth bound

– Problem: Data dependencyforces two passes over X X

v

* w D=

Step 1:


�� Fused mmchain operator– Key observation: values of D are row-aligned wrt to X

– Single-pass operation (map-side in MR/Spark / cache-conscious in CP/GPU)

17 IBM Research

t(X)

Step 2a:D

q

X

t(D)

Step 2b:

t(q)q

�

[Arash Ashari et al.: On optimizing machine learning workloads via kernel fusion. PPOPP 2015]

Example Fused Operators (2): WSLoss

� Weighted Squared Loss: wsl = sum(W * (X – L %*% t(R))^2)

– Common pattern for factorization algorithms

– W and X usually very sparse (< 0.001)

– Problem: “Outer” product of L%*%t(R) creates three denseintermediates in the size of X

�� Fused wsloss operator– Key observations: Sparse W* allows selective computation, full aggregate

significantly reduces memory requirements 2


significantly reduces memory requirements

18 IBM Research

L–

t(R)

XWsum *

2

Example Poisson NMF (PNMF)

� Example Script

while( iter < max_iterations ){

iter = iter + 1;

H = (H * (t(W) %*% (V/(W%*%H))))/t(colSums(W));

W = (W * ((V/(W%*%H)) %*% t(H)))/t(rowSums(H));

obj = as.scalar(colSums(W)%*%rowSums(H))

-sum(V * log(W%*%H));

wdivmm (left)

wdivmm (right)

wcemm


-sum(V * log(W%*%H));

print("ITER=" + iter + " obj=" + obj);

}

� Notes– Similar complex patterns in various factorization (ENMF, ALS), and

deep learning algorithms (GloVe, SkipGram)

– Automatic optimization via dynamic simplification rewrites

– Different fused physical operators (cp, mr/spark map/red)

– Details: https://github.com/apache/incubator-systemml/blob/master/docs/ devdocs/MatrixMultiplicationOperators.txt

19 IBM Research

wcemm

Rewrites and Operator Selection in Action

� Example: Use case Mlogreg, X: 108x103, K=1 (2 classes), 2GB mem

� Applied Rewrites– Original DML snippet of inner loop:

Q = P[, 1:K] * (X %*% ssX_V);

HV = t(X) %*% (Q - P[, 1:K] * (rowSums(Q) %*% matrix(1, rows=1, cols=K)));

– After remove unnecessary (1) matrix multiply (2) unary aggregateQ = P[, 1:K] * (X %*% ssX_V);

HV = t(X) %*% (Q - P[, 1:K] * Q);

– After simplify distributive binary operation Recall:


– After simplify distributive binary operationQ = P[, 1:K] * (X %*% ssX_V);

HV = t(X) %*% ((1 - P[, 1:K]) * Q);

– After simplify bushy binary operationHV = t(X) %*% (((1 - P[, 1:K]) * P[, 1:K]) * (X %*% ssX_V));

– After fuse binary dag to unary operation (sample proportion)HV = t(X) %*% (sprop(P[, 1:K]) * (X %*% ssX_V));

� Operator Selection– Exec Type: MR, because mem estimate > 800GB

– MM Type: MapMMChain, because XtwXv pattern and w=sprop(P[, 1:K]) < 2GB

– CP partitioning of w into 32MB chunks of rowblocks

20 IBM Research

Recall: Cascading rewrite effect

Dynamic Recompilation – Motivation

� Problem of unknown/changing sizes – Unknown or changing sizes and sparsity of intermediates

(across loop iterations / conditional control flow).

– These unknowns lead to very conservative fallback plans.

� Example ML program scenarios– Scripts w/ complex function call patterns

– Scripts w/ UDFs

– Data-dependent operators

Ex: Stepwise LinregDS

while( continue ){

parfor( i in 1:n ){

if( fixed[1,i]==0 ) {


– Data-dependent operatorsY = table( seq(1,nrow(X)), y )

grad = t(X) %*% (P - Y);

– Computed size expressions

– Changing dimensions or sparsity

�� Dynamic recompilation techniques as robust fallback strategy– Shares goals and challenges with adaptive query processing

– However, ML domain-specific techniques and rewrites

21 IBM Research

if( fixed[1,i]==0 ) {

X = cbind(Xg, Xorig[,i])

AIC[1,i] = linregDS(X,y)

}

}

#select and append best feature

}

� Optimizer recompilation decisions– Split HOP DAGs for recompilation: prevent unknowns but keep DAGs as

large as possible; we split after reads w/ unknown sizes and specific operators

– Mark HOP DAGs for recompilation: MR due to unknown sizes / sparsity

� Dynamic recompilation at runtime on recompilation hooks (last level

program blocks, predicates, recompile once functions, specific MR jobs)

– Deep Copy DAG: (e.g., for non-reversible dynamic rewrites)

X 1Mx100,99M

Dynamic Recompilation – Compiler and Runtime


non-reversible dynamic rewrites)

– Update DAG Statistics: (based

on exact symbol table meta data)

– Dynamic Rewrites: (exact stats allow very aggressive rewrites)

– Recompute Memory Estimates:(w/ unconditional scope of single DAG)

– Generate Runtime Instructions:(construct LOPs / instructions)

22 IBM Research

X

r(t)

ba(+*)

P

CP

MR

b(-)

Y

MR[100x1M,-1]

[100x-1,-1]

[1Mx100,-1] [1Mx-1,-1] [1Mx-1,-1]

[1Mx-1,-1]

P 1Mx7,7M

Y 1Mx7,7M

[1Mx100,99M] [1Mx7,7M] [1Mx7,7M]

[1Mx7,-1][100x1M,99M]

[100x7,-1]

CP

CP

Spark-Specific Optimizations

� Spark-Specific Rewrites– Automatic caching/checkpoint injection

(MEM_DISK / MEM_DISK_SER)

– Automatic repartition injection

� Operator Selection– Spark exec type selection

– Transitive Spark exec type

– Physical operator selection

X = read($1);

y = read($2);

...

r = -(t(X) %*% y);

while(i < maxi &

norm_r2 > norm_r2_trgt) {

q = t(X)%*%(X%*%p) + lambda*p;

alpha = norm_r2 / (t(p)%*%q);

w = w + alpha * p;

chkpt X MEM_DISK

Ex: Checkpoint Injection LinregCG


– Physical operator selection

� Extended ParFor Optimizer– Deferred checkpoint/repartition injection

– Eager checkpointing/repartitioning

– Fair scheduling for concurrent jobs

– Local degree of parallelism

� Runtime Optimizations– Lazy Spark context creation

– Short-circuit read/collect

23 IBM Research

w = w + alpha * p;

old_norm_r2 = norm_r2;

r = r + alpha * q;

norm_r2 = sum(r * r);

beta = norm_r2 / old_norm_r2;

p = -r + beta * p;

i = i + 1;

}

...

write(w, $4);

Spark Exec (24 cores)

60% data

20% shuffle

20% tasks

Excursus: Spark Buffer Pool Integration

� Distributed MatrixRepresentation– Binary block matrices

(JavaPairRDD<MatrixIndexes, MatrixBlock>)

– Serialization: block formats (dense, sparse,

ultra-sparse, empty)

– Hash-partitioning

Logical Blocking

(w/ Bc=1,000)

Physical Blocking and Partitioning

(w/ Bc=1,000)


� Buffer Pool Integration– Basic Buffer Pool Integration

– Lineage tracking RDDs/broadcasts

– Guarded RDD collect/parallelize

– Partitioned Broadcast variables

24 IBM Research

Partitioning-Preserving Operations on Spark

� Partitioning-preserving ops – Op is partitioning-preserving if key not changed (guaranteed)

– 1) Implicit: Use restrictive APIs (mapValues() vs mapToPair())

– 2) Explicit: Partition computation w/ declaration of partitioning-preserving

(memory-efficiency via “lazy iterators”)

� Partitioning-exploiting ops– 1) Implicit: Operations based on join, cogroup, etc

– 2) Explicit: Custom physical operators on original keys (e.g., zipmm)


– 2) Explicit: Custom physical operators on original keys (e.g., zipmm)

� Example: Multiclass SVM– Vectors in

nrow(X) neither

fit into driver nor

broadcast

– ncol(X) ≤ Bc

25 IBM Research

parfor(iter_class in 1:num_classes) {

Y_local = 2 * (Y == iter_class) - 1

g_old = t(X) %*% Y_local

...

while( continue ) {

Xd = X %*% s

... inner while loop (compute step_sz)

Xw = Xw + step_sz * Xd;

out = 1 - Y_local * Xw;

out = (out > 0) * out;

g_new = t(X) %*% (out * Y_local) ...

repart, chkpt X MEM_DISK

chkpt y_local MEM_DISK

zipmm

chkpt Xd, Xw MEM_DISK

From SystemR to SystemML – A Comparison

� Similarities– Declarative specification (fixed semantics): SQL vs DML

– Simplification rewrites (Starburst QGM rewrites vs static/dynamic rewrites)

– Operator selection (physical operators for join vs matrix multiply)

– Operator reordering (join enumeration vs matrix multiplication chain opt)

– Adaptive query processing (progressive reop vs dynamic recompile)

– Physical layout (NSM/DSM/PAX page layouts vs dense/sparse block formats)

– Buffer pool (pull-based page cache vs anti-caching of in-memory variables)


– Buffer pool (pull-based page cache vs anti-caching of in-memory variables)

– Advanced optimizations (source code gen, compression, GPUs, etc)

– Cost model / stats (est. time for IO/compute/latency; histograms vs dims/nnz)

� Differences– Algebra (relational algebra vs linear algebra)

– Programs (query trees vs DAGs, conditional control flow, often iterative)

– Optimizations (algebra-specific semantics, rewrites, and constraints)

– Scale (10s-100s vs 10s-10,000s of operators)

– Data preparation (ETL vs feature engineering)

– Physical design, transactions processing, multi-tenancy, etc

26 IBM Research

Conclusions

� Takeaway Message– The right abstraction level matters �� Flexibility and independence

– Automatic optimization of declarative ML programs is important and challenging �� Efficiency and scalability

� Summary– Declarative Machine Learning

– SystemML’s Compilation Chain






� Ongoing / Future Work– Advanced special-purpose optimizers (e.g., parfor, global data flow opt)

– Optimizer/runtime support for next gen runtime platforms (e.g., YARN, Spark)

– Low level optimization techniques (e.g., compression, NUMA, source code gen)

– Integrated HW accelerators (e.g., GPUs, many cores)

– Benchmarking ML systems (data/workload characteristics; accuracy/runtime)

27 IBM Research


SystemML is Open Source:• Apache Incubator Project (11/2015)

• Website: http://systemml.apache.org/

• Source code: https://github.com/

apache/incubator-systemml

IBM Spark Technology Center425 Market St., San Francisco• Growing pool of contributors

• Founding member of AMPLab

• Partnerships in the ecosystem

systemml's optimizer: advanced compilation techniques for large

Documents