systemml's optimizer: advanced compilation techniques for large
Post on 13-Feb-2017
229 Views
Preview:
TRANSCRIPT
SystemML’s Optimizer: Advanced Compilation Techniques for Large-Scale Machine Learning Programs
124, 165, 206? 203
© 2015 IBM Corporation
Machine Learning Programs
Matthias Boehm
IBM Research – Almaden
San Jose, CA, USA
IBM Research
Acknowledgements: A. V. Evfimievski, F. Makari Manshadi, N. Pansare, B. Reinwald, F. R. Reiss, P. Sen, S. Tatikonda,
M. W. Dusenberry, D. Eriksson, C. R. Kadner, J. Kim, N. Kokhlikyan, D. Kumar, M. Li, L. Resende, A. Singh, A. C. Surve, G. Weidner
Outline
� Part I: SystemML Overview– Declarative Machine Learning
– SystemML Architecture
� Part II: SystemML’s Optimizer– SystemML’s Compilation Chain
– Rewrites and Operator Selection
– Dynamic Recompilation
– Spark-Specific Optimizer Extensions
Open Source:• Apache Incubator Project (11/2015)
• Website: http://systemml.apache.org/
• Source code: https://github.com/
apache/incubator-systemml
© 2015 IBM Corporation
– Spark-Specific Optimizer Extensions
2 IBM Research
[A. Ghoting et al. SystemML: Declarative Machine Learning on MapReduce, ICDE 2011]
[M. Boehm et al. SystemML's Optimizer: Plan Generation for Large-Scale Machine Learning Programs. IEEE Data Eng. Bull, 2014]
[M. Boehm et al. SystemML: Declarative Machine Learning on Spark, submitted]
Case Study: An Automobile Manufacturer
� Goal: Design a model to predict car reacquisition
Warranty Claims
Predictive
Features MachineLearning
Algorithm
Labels
Algorithm
• Class skew• Low precision
© 2015 IBM Corporation3 IBM Research
Repair History
Diagnostic Readouts
Models
Algorithm
Algorithm
Algorithm
Result: 25x improvement in precision!
Common Patterns Across Customers
� Algorithm customization
� Changes in feature set
� Changes in data size
Custom Analytics
Declarative Machine Learning
© 2015 IBM Corporation
� Changes in data size
� Quick iteration
4 IBM Research
Machine Learning
Motivation Large-Scale Machine Learning
� Need for Large Scale ML– Analyzing large datasets
– Advanced analytics / machine learning ���� Gain value from collected data
– Specific workload characteristics ���� Specialized systems for large-scale ML
� Landscape of Existing Work– Classification by abstraction level (different target users)
© 2015 IBM Corporation5 IBM Research
Distributed Systems w/ DSLs
Large-Scale ML Libraries (fixed plan)
Declarative ML (fixed algorithm)
Declarative ML++ (fixed task)
Spark, Flink, REEF, GraphLab,
(R, Matlab, SAS)
MLlib, Mahout MR, MADlib, ORE,
Rev R, HP Dist R, Custom alg.
SystemML, (Mahout Samsara,
Tupleware, Cumulon, DMac)
Mlbase*, Specific sys.
Declarative Machine Learning – Requirements
� Goal: Write ML algorithms independent of input data and cluster characteristics
� R1: Full flexibility– Specify new / customize existing ML algorithms
���� ML DSL
� R2: Data independence– Hide physical data representation (sparse/dense, row/column-major, blocking
© 2015 IBM Corporation
– Hide physical data representation (sparse/dense, row/column-major, blocking
configs, partitioning, caching, compression)
���� Abstract data types and coarse-grained logical operations
� R3: Efficiency and scalability– Very small to very large usecases
���� Automatic optimization and hybrid runtime plans
� R4: Specified algorithm semantics– Understand, debug and control algorithm behavior
���� Optimization for performance only, not accuracy
6 IBM Research
Automatic optimization is key here!
Simplified DML Examples
Linear Regression (Direct Solve via Normal Equations)
Logistic Regression (via Trust Region Method)
Spectrum of Use Cases(size, characteristics)
DML (Declarative Machine Learning Language)
© 2015 IBM Corporation7 IBM Research
High-Level SystemML Architecture
Runtime
Compiler
Language
DML Scripts
ThisTalk
DML (Declarative Machine Learning Language)
© 2015 IBM Corporation8 IBM Research
Hadoop or Spark Cluster (scale-out)
In-Memory Single Node (scale-up)
since 2010since 2012 since 2015
SystemML Architecture (APIs and runtime)
Command Line
JMLCSpark
MLContextSpark
MLAPIs
High-Level Operators (HOPs)
Parser/Language
Low-Level Operators (LOPs)
Compiler Cost-based optimizations
© 2015 IBM Corporation9 IBM Research
Low-Level Operators (LOPs)
Runtime
Control Program
RuntimeProg
Buffer Pool
ParFor Optimizer/Runtime
MRInstSpark
InstCPInst
Recompiler
DFS IOMem/FS IO
GenericMR Jobs
MatrixBlock Library(single/multi-threaded)
Outline
� Part I: SystemML Overview– Declarative Machine Learning
– SystemML Architecture
� Part II: SystemML’s Optimizer– SystemML’s Compilation Chain
– Rewrites and Operator Selection
– Dynamic Recompilation
– Spark-Specific Optimizer Extensions
© 2015 IBM Corporation
– Spark-Specific Optimizer Extensions
10 IBM Research
SystemML’s Compilation Chain
DEBUG
Various optimization decisions at different
compilation steps
[Matthias Boehm et al:
SystemML's Optimizer: Plan Generation for Large-Scale
Machine Learning Programs. IEEE Data Eng. Bull 2014]
© 2015 IBM Corporation11 IBM Research
EXPLAINhops
STATS
IPA
ParFor Opt
Resource Opt
GDF Opt
EXPLAINruntime
HOP (High-level operator)
LOP (Low-level operator)
Basic HOP and LOP DAG Compilation
� Example Linreg DS
X = read($1);
y = read($2);
intercept = $3;
lambda = 0.001;
...
if( intercept == 1 ) {
ones = matrix(1,nrow(X),1);
X = append(X, ones);
}
I = matrix(1, ncol(X), 1);
HOP DAG(after rewrites)
Cluster Config:• client mem: 4 GB
• map/red mem: 2 GB
dg(rand)
(103x1,103)
r(diag)
X
(108x103,1011)
y
(108x1,108)
ba(+*) ba(+*)
r(t)
b(+)
b(solve)
write
Scenario: X: 108 x 103, 1011
y: 108 x 1, 108
800MB
800GB
800GB8KB
172KB
1.6TB
1.6TB
16MB8MB
8KB
CP
MR
CP
CP
CP
MRMR
CP
© 2015 IBM Corporation12 IBM Research
I = matrix(1, ncol(X), 1);
A = t(X) %*% X + diag(I)*lambda;
b = t(X) %*% y;
beta = solve(A, b);
...
write(beta, $4);
LOP DAG(after rewrites)
(103x1,103) (108x103,1011) (108x1,108)
map
reduce
1.6GB
800MB
16KB
� Hybrid Runtime Plans:• Size propagation over ML programs • Worst-case sparsity / memory estimates• Integrated CP / MR / Spark runtime
X
y
r’(CP)
mapmm(MR) tsmm(MR)
ak+(MR) ak+(MR)
r’(CP)
part(CP)
800MB
Static and Dynamic Rewrites
� Types of Rewrites– Static: size-independent rewrites
– Dynamic: size-dependent rewrites
� Examples Static Rewrites– Common Subexpression Elimination
– Constant Folding
– Static Algebraic Simplification Rewrites
Cascading rewrite effect(enables other rewrites, IPA,
operator selection)
© 2015 IBM Corporation
– Static Algebraic Simplification Rewrites
– Branch Removal
– Right/Left Indexing Vectorization
– For Loop Vectorization
– Checkpoint injection (caching)
– Repartition injection
� Examples Dynamic Rewrites– Matrix Multiplication Chain Optimization
– Dynamic Algebraic Simplification Rewrites
13 IBM Research
operator selection)
High performance impact(direct/indirect)
� Static simplification rewrites: size-independent patterns
Example Algebraic Simplification Rewrites (1)
Rewrite Category Static Patterns
Remove Unnecessary
Operations
t(t(X)), X/1, X*1, X-0, -(-X) � X
matrix(1,)/X � 1/X sum(t(X)) ���� sum(X)
rand(,min=-1,max=1)*7 � rand(,min=-7,max=7)
-rand(,min=-2,max=1) � rand(,min=-1,max=2)
t(cbind(t(X),t(Y))) � rbind(X,Y)
Simplify Bushy Binary (X*(Y*(Z%*%v))) � (X*Y)*(Z%*%v)
© 2015 IBM Corporation14 IBM Research
Simplify Bushy Binary (X*(Y*(Z%*%v))) � (X*Y)*(Z%*%v)
Binary to Unary X+X � 2*X X*X ���� X^2 X-X*Y � X*(1-Y)
X*(1-X) � sprop(X) 1/(1+exp(-X)) � sigmoid(X)
X*(X>0) � selp(X) (X-7)*(X!=0) ���� X -nz 7
(X!=0)*log(X) � log_nz(X)
aggregate(X,y,count) � aggregate(y,y,count)
Simplify Permutation
Matrix Construction
outer(X,seq(1,N),"==") � rexpand(v,max=N,row)
table(seq(1,nrow(v)),v,N) � rexpand(v,max=N,row)
Simplify Operation over
Matrix Multiplication
trace(X%*%Y) ���� sum(X*t(Y))
(X%*%Y)[7,3] � X[7,] %*% Y[,3]
� Dynamic simplification rewrites: size-dependent patterns
Example Algebraic Simplification Rewrites (2)
Rewrite Category Dynamic Patterns
Remove / Simplify
Unnecessary Indexing
X[a:b,c:d] = Y � X = Y iff dims(X)=dims(Y)
X = Y[, 1] ���� X = Y iff dims(X)=dims(Y)
X[,1]=Y;X[,2]=Z � X=cbind(Y,Z) iff ncol(X)=2,col
Fuse / Pushdown
Operations
t(rand(10, 1)) � rand(1, 10) iff nrow/ncol=1
sum(diag(X)) � trace(X) iff ncol(X)>1
diag(X)*7 � diag(X*7) iff ncol(X)=1
© 2015 IBM Corporation15 IBM Research
diag(X)*7 � diag(X*7) iff ncol(X)=1
sum(X^2) ����t(X)%*%X, �sumSq(X) iff ncol(X)=1, >1
Remove Empty /
Unnecessary Operations
X%*%Y ���� matrix(0,…) iff nnz(X)=0|nnz(Y)=0
X*Y � matrix(0,…), X+Y�X, X-Y�X iff nnz(Y)=0
round(X)�matrix(0), t(X)�matrix(0) iff nnz(X)=0
X*(Y%*%matrix(1,)) � X*Y iff ncol(Y)=1
Simplify Aggregates /
Scalar Operations
rowSums(X) ����sum(X) ����X iff nrow(X)=1, ncol(X)=1
rowSums(X*Y) � X%*%t(Y) iff nrow(Y)=1
X*Y � X*as.scalar(Y) iff nrow(Y)=1&ncol(Y)=1
Simplify Diag Matrix
Multiplications
diag(X)%*%Y ���� Y*X iff ncol(X)=1&ncol(Y)>1
diag(X%*%Y)->rowSums(X*t(Y)) iff ncol(Y)>1
Example Operator Selection: Matrix Multiplication
X
r(t)
ba(+*)
y
MR
MR
t(X)%*%yExample:
Transform(CP,’)
Aggregate(MR,ak+)
Group
Exec Type MM Ops Pattern
CP MM
MMChain
TSMM
PMM
X %*% Y
t(X) (w * (X %*% v))
t(X) %*% X
rmr(diag(v)) %*% X
MR / Spark
(* only Spark)
MapMM
MapMMChain
TSMM
ZipMM *
X %*% Y
t(X) (w * (X %*% v))
t(X) %*% X
t(X) %*% Y
© 2015 IBM Corporation
� Hop-Lop Rewrites– Aggregation (w/o, singleblock/multiblock)
– Partitioning (w/o, CP/MR, col/rowblock)
– Empty block materialization in output
– Transpose-MM rewrite t(X)%*%y � t(t(y)%*%X)
– CP degree of parallelism (multi-threaded mm)
16 IBM Research
MapMM(MR,left)
Partition(CP,col)
Transform(CP,’)
Xy
Group(MR)
Spark) ZipMM *
CPMM
RMM
PMM
t(X) %*% Y
rmr(diag(v)) %*% X
X %*% Y
X %*% Y
Example Fused Operators (1): MMChain
� Matrix Multiplication Chains: q = t(X) %*% (w * (X %*% v))
– Very common pattern
– MV-ops IO / memory-bandwidth bound
– Problem: Data dependencyforces two passes over X X
v
* w D=
Step 1:
© 2015 IBM Corporation
���� Fused mmchain operator– Key observation: values of D are row-aligned wrt to X
– Single-pass operation (map-side in MR/Spark / cache-conscious in CP/GPU)
17 IBM Research
t(X)
Step 2a:D
q
X
t(D)
Step 2b:
t(q)q
�
[Arash Ashari et al.: On optimizing machine learning workloads via kernel fusion. PPOPP 2015]
Example Fused Operators (2): WSLoss
� Weighted Squared Loss: wsl = sum(W * (X – L %*% t(R))^2)
– Common pattern for factorization algorithms
– W and X usually very sparse (< 0.001)
– Problem: “Outer” product of L%*%t(R) creates three denseintermediates in the size of X
���� Fused wsloss operator– Key observations: Sparse W* allows selective computation, full aggregate
significantly reduces memory requirements 2
© 2015 IBM Corporation
significantly reduces memory requirements
18 IBM Research
L–
t(R)
XWsum *
2
Example Poisson NMF (PNMF)
� Example Script
while( iter < max_iterations ){
iter = iter + 1;
H = (H * (t(W) %*% (V/(W%*%H))))/t(colSums(W));
W = (W * ((V/(W%*%H)) %*% t(H)))/t(rowSums(H));
obj = as.scalar(colSums(W)%*%rowSums(H))
-sum(V * log(W%*%H));
wdivmm (left)
wdivmm (right)
wcemm
© 2015 IBM Corporation
-sum(V * log(W%*%H));
print("ITER=" + iter + " obj=" + obj);
}
� Notes– Similar complex patterns in various factorization (ENMF, ALS), and
deep learning algorithms (GloVe, SkipGram)
– Automatic optimization via dynamic simplification rewrites
– Different fused physical operators (cp, mr/spark map/red)
– Details: https://github.com/apache/incubator-systemml/blob/master/docs/ devdocs/MatrixMultiplicationOperators.txt
19 IBM Research
wcemm
Rewrites and Operator Selection in Action
� Example: Use case Mlogreg, X: 108x103, K=1 (2 classes), 2GB mem
� Applied Rewrites– Original DML snippet of inner loop:
Q = P[, 1:K] * (X %*% ssX_V);
HV = t(X) %*% (Q - P[, 1:K] * (rowSums(Q) %*% matrix(1, rows=1, cols=K)));
– After remove unnecessary (1) matrix multiply (2) unary aggregateQ = P[, 1:K] * (X %*% ssX_V);
HV = t(X) %*% (Q - P[, 1:K] * Q);
– After simplify distributive binary operation Recall:
© 2015 IBM Corporation
– After simplify distributive binary operationQ = P[, 1:K] * (X %*% ssX_V);
HV = t(X) %*% ((1 - P[, 1:K]) * Q);
– After simplify bushy binary operationHV = t(X) %*% (((1 - P[, 1:K]) * P[, 1:K]) * (X %*% ssX_V));
– After fuse binary dag to unary operation (sample proportion)HV = t(X) %*% (sprop(P[, 1:K]) * (X %*% ssX_V));
� Operator Selection– Exec Type: MR, because mem estimate > 800GB
– MM Type: MapMMChain, because XtwXv pattern and w=sprop(P[, 1:K]) < 2GB
– CP partitioning of w into 32MB chunks of rowblocks
20 IBM Research
Recall: Cascading rewrite effect
Dynamic Recompilation – Motivation
� Problem of unknown/changing sizes – Unknown or changing sizes and sparsity of intermediates
(across loop iterations / conditional control flow).
– These unknowns lead to very conservative fallback plans.
� Example ML program scenarios– Scripts w/ complex function call patterns
– Scripts w/ UDFs
– Data-dependent operators
Ex: Stepwise LinregDS
while( continue ){
parfor( i in 1:n ){
if( fixed[1,i]==0 ) {
© 2015 IBM Corporation
– Data-dependent operatorsY = table( seq(1,nrow(X)), y )
grad = t(X) %*% (P - Y);
– Computed size expressions
– Changing dimensions or sparsity
���� Dynamic recompilation techniques as robust fallback strategy– Shares goals and challenges with adaptive query processing
– However, ML domain-specific techniques and rewrites
21 IBM Research
if( fixed[1,i]==0 ) {
X = cbind(Xg, Xorig[,i])
AIC[1,i] = linregDS(X,y)
}
}
#select and append best feature
}
� Optimizer recompilation decisions– Split HOP DAGs for recompilation: prevent unknowns but keep DAGs as
large as possible; we split after reads w/ unknown sizes and specific operators
– Mark HOP DAGs for recompilation: MR due to unknown sizes / sparsity
� Dynamic recompilation at runtime on recompilation hooks (last level
program blocks, predicates, recompile once functions, specific MR jobs)
– Deep Copy DAG: (e.g., for non-reversible dynamic rewrites)
X 1Mx100,99M
Dynamic Recompilation – Compiler and Runtime
© 2015 IBM Corporation
non-reversible dynamic rewrites)
– Update DAG Statistics: (based
on exact symbol table meta data)
– Dynamic Rewrites: (exact stats allow very aggressive rewrites)
– Recompute Memory Estimates:(w/ unconditional scope of single DAG)
– Generate Runtime Instructions:(construct LOPs / instructions)
22 IBM Research
X
r(t)
ba(+*)
P
CP
MR
b(-)
Y
MR[100x1M,-1]
[100x-1,-1]
[1Mx100,-1] [1Mx-1,-1] [1Mx-1,-1]
[1Mx-1,-1]
P 1Mx7,7M
Y 1Mx7,7M
[1Mx100,99M] [1Mx7,7M] [1Mx7,7M]
[1Mx7,-1][100x1M,99M]
[100x7,-1]
CP
CP
Spark-Specific Optimizations
� Spark-Specific Rewrites– Automatic caching/checkpoint injection
(MEM_DISK / MEM_DISK_SER)
– Automatic repartition injection
� Operator Selection– Spark exec type selection
– Transitive Spark exec type
– Physical operator selection
X = read($1);
y = read($2);
...
r = -(t(X) %*% y);
while(i < maxi &
norm_r2 > norm_r2_trgt) {
q = t(X)%*%(X%*%p) + lambda*p;
alpha = norm_r2 / (t(p)%*%q);
w = w + alpha * p;
chkpt X MEM_DISK
Ex: Checkpoint Injection LinregCG
© 2015 IBM Corporation
– Physical operator selection
� Extended ParFor Optimizer– Deferred checkpoint/repartition injection
– Eager checkpointing/repartitioning
– Fair scheduling for concurrent jobs
– Local degree of parallelism
� Runtime Optimizations– Lazy Spark context creation
– Short-circuit read/collect
23 IBM Research
w = w + alpha * p;
old_norm_r2 = norm_r2;
r = r + alpha * q;
norm_r2 = sum(r * r);
beta = norm_r2 / old_norm_r2;
p = -r + beta * p;
i = i + 1;
}
...
write(w, $4);
Spark Exec (24 cores)
60% data
20% shuffle
20% tasks
Excursus: Spark Buffer Pool Integration
� Distributed MatrixRepresentation– Binary block matrices
(JavaPairRDD<MatrixIndexes, MatrixBlock>)
– Serialization: block formats (dense, sparse,
ultra-sparse, empty)
– Hash-partitioning
Logical Blocking
(w/ Bc=1,000)
Physical Blocking and Partitioning
(w/ Bc=1,000)
© 2015 IBM Corporation
� Buffer Pool Integration– Basic Buffer Pool Integration
– Lineage tracking RDDs/broadcasts
– Guarded RDD collect/parallelize
– Partitioned Broadcast variables
24 IBM Research
Partitioning-Preserving Operations on Spark
� Partitioning-preserving ops – Op is partitioning-preserving if key not changed (guaranteed)
– 1) Implicit: Use restrictive APIs (mapValues() vs mapToPair())
– 2) Explicit: Partition computation w/ declaration of partitioning-preserving
(memory-efficiency via “lazy iterators”)
� Partitioning-exploiting ops– 1) Implicit: Operations based on join, cogroup, etc
– 2) Explicit: Custom physical operators on original keys (e.g., zipmm)
© 2015 IBM Corporation
– 2) Explicit: Custom physical operators on original keys (e.g., zipmm)
� Example: Multiclass SVM– Vectors in
nrow(X) neither
fit into driver nor
broadcast
– ncol(X) ≤ Bc
25 IBM Research
parfor(iter_class in 1:num_classes) {
Y_local = 2 * (Y == iter_class) - 1
g_old = t(X) %*% Y_local
...
while( continue ) {
Xd = X %*% s
... inner while loop (compute step_sz)
Xw = Xw + step_sz * Xd;
out = 1 - Y_local * Xw;
out = (out > 0) * out;
g_new = t(X) %*% (out * Y_local) ...
repart, chkpt X MEM_DISK
chkpt y_local MEM_DISK
zipmm
chkpt Xd, Xw MEM_DISK
From SystemR to SystemML – A Comparison
� Similarities– Declarative specification (fixed semantics): SQL vs DML
– Simplification rewrites (Starburst QGM rewrites vs static/dynamic rewrites)
– Operator selection (physical operators for join vs matrix multiply)
– Operator reordering (join enumeration vs matrix multiplication chain opt)
– Adaptive query processing (progressive reop vs dynamic recompile)
– Physical layout (NSM/DSM/PAX page layouts vs dense/sparse block formats)
– Buffer pool (pull-based page cache vs anti-caching of in-memory variables)
© 2015 IBM Corporation
– Buffer pool (pull-based page cache vs anti-caching of in-memory variables)
– Advanced optimizations (source code gen, compression, GPUs, etc)
– Cost model / stats (est. time for IO/compute/latency; histograms vs dims/nnz)
� Differences– Algebra (relational algebra vs linear algebra)
– Programs (query trees vs DAGs, conditional control flow, often iterative)
– Optimizations (algebra-specific semantics, rewrites, and constraints)
– Scale (10s-100s vs 10s-10,000s of operators)
– Data preparation (ETL vs feature engineering)
– Physical design, transactions processing, multi-tenancy, etc
26 IBM Research
Conclusions
� Takeaway Message– The right abstraction level matters ���� Flexibility and independence
– Automatic optimization of declarative ML programs is important and challenging ���� Efficiency and scalability
� Summary– Declarative Machine Learning
– SystemML’s Compilation Chain
– Rewrites and Operator Selection
© 2015 IBM Corporation
– Rewrites and Operator Selection
– Dynamic Recompilation
– Spark-Specific Optimizer Extensions
� Ongoing / Future Work– Advanced special-purpose optimizers (e.g., parfor, global data flow opt)
– Optimizer/runtime support for next gen runtime platforms (e.g., YARN, Spark)
– Low level optimization techniques (e.g., compression, NUMA, source code gen)
– Integrated HW accelerators (e.g., GPUs, many cores)
– Benchmarking ML systems (data/workload characteristics; accuracy/runtime)
27 IBM Research
© 2015 IBM Corporation28 IBM Research
SystemML is Open Source:• Apache Incubator Project (11/2015)
• Website: http://systemml.apache.org/
• Source code: https://github.com/
apache/incubator-systemml
IBM Spark Technology Center425 Market St., San Francisco• Growing pool of contributors
• Founding member of AMPLab
• Partnerships in the ecosystem
top related