@spcl eth al en un performance ml ↔ hpc: optimizing ... · spcl.inf.ethz.ch @spcl_eth tal ben-nun...
TRANSCRIPT
spcl.inf.ethz.ch
@spcl_eth
TAL BEN-NUN
ML ↔ HPC: Optimizing Optimizers for OptimizationWorkshop on the Convergence of ML & HPC @ ASPLOS 2020 Zoom
WITH CONTRIBUTIONS FROM DAN ALISTARH, NIKOLI DRYDEN, YOSUKE OYAMA, CEDRIC RENGGLI, AND OTHERS AT SPCL
Stochastic Performance
spcl.inf.ethz.ch
@spcl_eth
2
20 TB/night
Source: OpenAI
spcl.inf.ethz.ch
@spcl_eth
4
A brief intro to supervised deep learning
1.00
0.00
0.00
0.00
0.00
0.00
Cat
Dog
Airplane
Truck
Horse
Banana
labeled samples 𝑥 ∈ 𝑋 ⊂ 𝒟 𝑓 𝑥 : 𝑋 → 𝑌 label domain 𝑌
network structure(fixed)
weights 𝑤(learned)
𝑤∗ = argmin𝑤∈ℝ𝑑 𝔼𝑥~𝒟 ℓ 𝑤, 𝑥
true label 𝑙(𝑥)
0.54
0.28
0.02
0.07
0.33
0.02
Cat
Dog
Airplane
Truck
Horse
Banana
𝑓(𝑥)
layer-wise parameter update
ℓ𝑐𝑒 𝑤, 𝑥 = −
𝑖
𝑙 𝑥 𝑖 ⋅ log𝑒𝑓 𝑥 𝑖
σ𝑘 𝑒𝑓 𝑥 𝑘
ℓ𝑠𝑞 𝑤, 𝑥 = 𝑓 𝑥 − 𝑙 𝑥2
spcl.inf.ethz.ch
@spcl_eth
5
A brief intro to supervised deep learning
1.00
0.00
0.00
0.00
0.00
0.00
Cat
Dog
Airplane
Truck
Horse
Banana
labeled samples 𝑥 ∈ 𝑋 ⊂ 𝒟 𝑓 𝑥 : 𝑋 → 𝑌 label domain 𝑌
network structure(fixed)
weights 𝑤(learned)
𝑤∗ = argmin𝑤∈ℝ𝑑 𝔼𝑥~𝒟 ℓ 𝑤, 𝑥
true label 𝑙(𝑥)
0.54
0.28
0.02
0.07
0.33
0.02
Cat
Dog
Airplane
Truck
Horse
Banana
𝑓(𝑥)
layer-wise parameter update
ℓ𝑐𝑒 𝑤, 𝑥 = −
𝑖
𝑙 𝑥 𝑖 ⋅ log𝑒𝑓 𝑥 𝑖
σ𝑘 𝑒𝑓 𝑥 𝑘
ℓ𝑠𝑞 𝑤, 𝑥 = 𝑓 𝑥 − 𝑙 𝑥2
≥TBs of random access
100MiB-32GiB and beyond 30k-billions
spcl.inf.ethz.ch
@spcl_eth
8
Trends in deep learning: hardware and multi-node
The field is moving fast – trying everything imaginable – survey results from 252 papers in the area of parallel deep learning
Hardware used Shared vs. distributed memory
Deep Learning is largely on distributed memory today!
T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, CSUR 2019
spcl.inf.ethz.ch
@spcl_eth
9
Trends in distributed deep learning: node count and communication
Deep Learning research is converging to MPI!
The field is moving fast – trying everything imaginable – survey results from 252 papers in the area of parallel deep learning
T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, CSUR 2019
Communication mode
spcl.inf.ethz.ch
@spcl_eth
10
Computational Principles
Data
spcl.inf.ethz.ch
@spcl_eth
11
Computational Principles
Data
spcl.inf.ethz.ch
@spcl_eth
Indirect
12
Example: Options for computing convolutional layers
Direct
𝑤 ℱ
ℱ
ℱ−1
=
×ෝ𝑤
FFT4 1 9 8
5 9 9 8
0 7 3 4
2 6 3 1
1 -1 0
0.1 -2 0
3 4 1.1*
21.9 59.3 53.9 43.9
-6.3 16.8 12.3 12
9.6 15.3 25.8 14
0.4 7.1 52.1 53.1
=Winograd
Direct
im2col
K. Chellapilla et al.: High Performance Convolutional Neural Networks for Document Processing, Int’l Workshop on Frontiers in Handwriting Recognition 2016M. Mathieu et al.: Fast Training of Convolutional Networks through FFTs, ICLR’14A. Lavin and S. Gray: Fast Algorithms for Convolutional Neural Networks, CVPR’16
𝐶𝑖𝑛
𝑁
Reshape
im2col
𝐶𝑖𝑛
𝐶𝑜𝑢𝑡 ⋅ 𝐶𝑖𝑛
𝐾𝑦
𝐾𝑥
𝑊
𝐻
𝐶𝑜𝑢𝑡
𝐶𝑖𝑛 ⋅ 𝐾𝑦 ⋅ 𝐾𝑥
…
𝑁 ⋅ 𝐻′ ⋅ 𝑊′
𝐶𝑖𝑛⋅𝐾𝑦⋅𝐾
𝑥
×
GEMM,col2im
𝑊′
𝐻′
𝐶𝑜𝑢𝑡
𝒘
𝑭(𝒎, 𝒓)Winograd
DomainChannel-wise
summation+
𝐴𝑇 ⋅ ⋅ 𝐴
𝐺 ⋅ ⋅ 𝐺𝑇𝐵𝑇 ⋅ ⋅ 𝐵
Element-wise
product
𝑚×𝑚
𝑟 × 𝑟𝑚′ ×𝑚′
𝑚′ = 𝑚+ 𝑟 − 1
spcl.inf.ethz.ch
@spcl_eth
13
Operator Design
* = * *
Separable convolution
A.G. Howard et al. “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,” arXiv 2017.F.N. Iandola et al “Squeezenet: alexnet-level accuracy with 50x fewer parameters and <0.5MB model size,” ICLR 2017.
spcl.inf.ethz.ch
@spcl_eth
14
Transformers – Multi-Head Attention
A. Vaswani et al. “Attention Is All You Need,” NeurIPS 2017.
spcl.inf.ethz.ch
@spcl_eth
▪ Use techniques from compiler construction: DNN → Graph → IR → Transformations → HW Mapping
16
DNN Compilers
TensorFlow XLA Facebook Glow / TorchScript JIT
spcl.inf.ethz.ch
@spcl_eth
▪ Use techniques from compiler construction: DNN → Graph → IR → Transformations → HW Mapping
17
DNN Compilers
TensorFlow XLA Facebook Glow / TorchScript JIT
TVM StackIntel nGraph
spcl.inf.ethz.ch
@spcl_eth
18
Partitioning Computation?
…
……
…
Data Parallelism
spcl.inf.ethz.ch
@spcl_eth
19
Minibatch Stochastic Gradient Descent (SGD)
0.54
0.28
0.02
0.07
0.03
0.04
0.02
Cat
Dog
Airplane
Truck
Horse
Bicycle
1.00
0.00
0.00
0.00
0.00
0.00
0.00
Cat
Dog
Airplane
Truck
Horse
Bicycle
T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, CSUR 2019
spcl.inf.ethz.ch
@spcl_eth
20
Partitioning Computation?
…
……
…
Data Parallelism
Model Parallelism 1
3
Channel/Filter Spatial
…PIpeline Parallelism
Layer
Idle Idle1 2 31 2 3
1 2 3 3 2 13 2 1
3 2 111
1 11
1Proc 1Proc 2Proc 3
spcl.inf.ethz.ch
@spcl_eth
21
Partitioning Computation?
…
……
…
Data Parallelism
Model Parallelism 1
3
Channel/Filter Spatial
…PIpeline Parallelism
Layer
▪ Parameters/domain can be distributed across processors
▪ Good for: large inputs, wide networks
▪ Complex communication per-layer
▪ Performance hinges on implementation
▪ Parameters can be distributed across processors
▪ Good for: deep models, few activations
▪ Sparse communication pattern (only pipeline stages)
▪ Consistent model introduces idle-time “Bubble”
▪ Simple and efficient solution, easy to implement
▪ Duplicate parameters at all processors
▪ Affects generalization
spcl.inf.ethz.ch
@spcl_eth
25
Hybrid parallelism
A. Krizhevsky: One weird trick for parallelizing convolutional neural networks, arXiv 2014J. Dean et al.: Large scale distributed deep networks, NIPS’12.T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, CSUR 2019
▪ Layers/parameters can be distributed across processors
▪ Can distribute minibatch
▪ Often specific to layer-types (e.g., distribute fc layers but handle conv layers data-parallel)
▪ Enables arbitrary combinations of data, model, and pipeline parallelism – very powerful!
ModelParallelism
DataParallelism
Layer (pipeline) Parallelism
……
…
spcl.inf.ethz.ch
@spcl_eth
27
Training is not just Training
K. Osawa et al., “Second-order Optimization Method for Large Mini-batch: Training ResNet-50 on ImageNet in 35 Epochs”, arXiv 2018.T. Karras et al., “Progressive Growing of GANs for Improved Quality, Stability, and Variation”, arXiv 2017.
Imbalanced workload over timeNontrivial gradient aggregation
Data/compute redistribution
spcl.inf.ethz.ch
@spcl_eth
28
Hyperparameter and Architecture search
Reinforcement Learning [1] Evolutionary Algorithms [4]
▪ Meta-optimization of hyper-parameters (momentum) and DNN architecture
▪ Using Reinforcement Learning [1] (explore/exploit different configurations)
▪ Genetic Algorithms with modified (specialized) mutations [2]
▪ Particle Swarm Optimization [3] and other meta-heuristics
▪ Multi-level optimization
[1] M. Jaderberg et al.: Population Based Training of Neural Networks, arXiv 2017[2] E. Real et al.: Regularized Evolution for Image Classifier Architecture Search, arXiv 2018[3] P. R. Lorenzo et al.: Hyper-parameter Selection in Deep Neural Networks Using Parallel Particle Swarm Optimization, GECCO’17[4] H. Liu et al.: Hierarchical Representations for Efficient Architecture Search, ICLR’18[5] R. Luo et al.: Neural Architecture Optimization, NeurIPS’18[6] H. Liu et al.: DARTS: Differentiable Architecture Search, ICLR’19
Model-Based Optimization [5,6]
spcl.inf.ethz.ch
@spcl_eth
29
Hyperparameter and Architecture search
Reinforcement Learning [1] Evolutionary Algorithms [4]
▪ Meta-optimization of hyper-parameters (momentum) and DNN architecture
▪ Using Reinforcement Learning [1] (explore/exploit different configurations)
▪ Genetic Algorithms with modified (specialized) mutations [2]
▪ Particle Swarm Optimization [3] and other meta-heuristics
▪ Multi-level optimization
[1] M. Jaderberg et al.: Population Based Training of Neural Networks, arXiv 2017[2] E. Real et al.: Regularized Evolution for Image Classifier Architecture Search, arXiv 2018[3] P. R. Lorenzo et al.: Hyper-parameter Selection in Deep Neural Networks Using Parallel Particle Swarm Optimization, GECCO’17[4] H. Liu et al.: Hierarchical Representations for Efficient Architecture Search, ICLR’18[5] R. Luo et al.: Neural Architecture Optimization, NeurIPS’18[6] H. Liu et al.: DARTS: Differentiable Architecture Search, ICLR’19
Model-Based Optimization [5,6]
spcl.inf.ethz.ch
@spcl_eth
31
Updating parameters in distributed data parallelism
Cen
tral
Decentral
parameter server (sharded) 𝑤’ = 𝑢(𝑤, 𝛻𝑤)
𝑤𝛻𝑤
Training Agent Training Agent Training Agent Training Agent
Training Agent Training Agent Training Agent Training Agent
collective allreduce of 𝒘
𝑇 = 2𝐿 log2 𝑃 +2𝛾𝑚𝐺(𝑃 − 1)/𝑃
𝑇 = 2𝐿 + 2𝑃 𝛾𝑚/𝑠 𝐺 - Collective operations- Topologies- Neighborhood collectives- RMA?
Hierarchical Parameter ServerS. Gupta et al.: Model Accuracy and Runtime Tradeoff in Distributed Deep Learning:
A SystematicStudy. ICDM’16
spcl.inf.ethz.ch
@spcl_eth
▪ Trades off “statistical performance” for “hardware performance”
32
Parameter (and Model) consistency - centralized parameter server (sharded) 𝑤’ = 𝑢(𝑤, 𝛻𝑤)
𝑤𝛻𝑤
Training Agent Training Agent Training Agent Training Agent
Synchronous Stale Synchronous / Bounded Asynchronous Asynchronous
𝑤 1
𝑤 1
Time
Parameter Server
Synchronization
𝑤 2
𝑤 2
Agent 1
Agent m
. . .
𝑤 𝑇𝑤 0 …
Sync.
Time
Parameter Server
Agent 1
Agent m. . .
𝑤 𝑇𝑤 0 …
𝑤 1,𝑚 𝑤 2,𝑚
𝑤 2,1𝑤 1,1 𝑤 3,1
𝑤 3,𝑚
Max. Staleness
Time
Agent 1
Agent m
. . .
𝑤 1,1
𝑤 1,𝑚 𝑤 2,𝑚
𝑤 2,1 𝑤 3,1 𝑤 4,1
Parameter Server𝑤 0 𝑤 𝑇…
Sync.
▪ Parameter exchange frequency can be controlled, while still attaining convergence:
Inconsistent
EnsembleLearning
SynchronousSGD
Consistent
Stale-SynchronousSGD
ModelAveraging
(e.g., elastic)
AsynchronousSGD (HOGWILD!)
J. Dean et al.: Large scale distributed deep networks, NIPS’12.F. Niu et al.: Hogwild: A lock-free approach to parallelizing stochastic gradient descent, NIPS’11.
spcl.inf.ethz.ch
@spcl_eth
▪ Parameter exchange frequency can be controlled, while still attaining convergence:
▪ Using Gossip Algorithms [Jin et al. 2016] or Partial Collectives [Li et al. 2020]
33
Synchronous Stale Synchronous / Bounded Asynchronous Asynchronous
Training Agent Training Agent Training Agent Training Agent
collective allreduce of 𝒘
Time
All-Reduce
Agent 1
Agent m
. . .
…
…
.
.
. Mer
ge
𝑤 1,1
𝑤 1,𝑚 𝑤 2,𝑚
Max. Staleness
𝑤(0) 𝑤(𝑇)
𝑤 2,1 𝑤 3,1 𝑤 4,1
All-Reduce
𝑤 1
Time
𝑤(0) All-Reduce
𝑤 𝑇𝑤 2
𝑤 2
Agent 1
Agent m
. . .
𝑤 1
𝑤 𝑇
…
…
All-Reduce
Time
Agent 1
Agent m 𝑤 1,𝑚 𝑤 2,𝑚
𝑤 2,1𝑤 1,1 𝑤 3,1
𝑤 3,𝑚
Agent r
Agent k
𝑤 1,𝑟 𝑤 2,𝑟 𝑤 3,𝑟 𝑤 4,𝑟 𝑤 5,𝑟
𝑤 1,𝑘 𝑤 2,𝑘 𝑤 3,𝑘
Peter H. Jin et al., “How to scale distributed deep learning?”, NIPS MLSystems 2016Shigang Li et al., “Taming unbalanced training workloads in deep learning with partial collective operations”, PPoPP 2020
Parameter (and Model) consistency - decentralized
Inconsistent
EnsembleLearning
SynchronousSGD
Consistent
Stale-SynchronousSGD
ModelAveraging
(e.g., elastic)
AsynchronousSGD (HOGWILD!)
spcl.inf.ethz.ch
@spcl_eth
34
Parameter consistency in deep learning
Inconsistent
EnsembleLearning
SynchronousSGD
Consistent
Stale-SynchronousSGD
ModelAveraging
(e.g., elastic)
AsynchronousSGD (HOGWILD!)
𝑤 𝑡+1,𝑖 = 𝑤 𝑡,𝑖 − 𝜂𝛻𝑤 𝑡,𝑖 − 𝛼 𝑤 𝑡,𝑖 − 𝑤𝑡
𝑤𝑡+1 = 1− 𝛽 𝑤𝑡 +𝛽
𝑚
𝑖=1
𝑚
𝑤 𝑡,𝑖
𝑤 1,1
Time
Parameter Server
Agent 1
Agent m
. . .
𝑤 𝑇𝑤 0 …
Sync.
𝑤 2,1 𝑤 3,1 𝑤 4,1 𝑤 5,1 𝑤 6,1
𝑤 1,𝑚 𝑤 2,𝑚 𝑤 3,𝑚 𝑤 4,𝑚 𝑤 5,𝑚 𝑤 6,𝑚
ElasticAverage
S. Zhang et al.: Deep learning with Elastic Averaging SGD, NIPS’15
Using physical forces betweendifferent versions of 𝑤:
spcl.inf.ethz.ch
@spcl_eth
35
Parameter consistency in deep learning
Inconsistent
EnsembleLearning
SynchronousSGD
Consistent
Stale-SynchronousSGD
ModelAveraging
(e.g., elastic)
AsynchronousSGD (HOGWILD!)
Avg.
0.54
0.28
0.02
0.07
0.33
0.04
0.02
Cat
Dog
Airplane
Truck
Horse
Bicycle
T. G. Dietterich: Ensemble Methods in Machine Learning, MCS 2000
spcl.inf.ethz.ch
@spcl_eth
▪ Different options how to optimize updates
▪ Send 𝛻𝑤, receive 𝑤
▪ Send FC factors (𝑜𝑙−1, 𝑜𝑙), compute 𝛻𝑤 on parameter server
Broadcast factors to not receive full w
▪ Use lossy compression when sending, accumulate error locally!
▪ Quantization
▪ Quantize weight updates and potentially weights
▪ Main trick is stochastic rounding [1] – expectation is more accurate
Enables low precision (half, quarter) to become standard
▪ TernGrad - ternary weights [2], 1-bit SGD [3], …
▪ Sparsification
▪ Do not send small weight updates or only send top-k [4]
Accumulate omitted gradients locally
37
Communication optimizationsparameter server (sharded) 𝑤’ = 𝑢(𝑤, 𝛻𝑤)
𝑤𝛻𝑤
Training Agent Training Agent Training Agent Training Agent
[1] S. Gupta et al. Deep Learning with Limited Numerical Precision, ICML’15[2] F. Li and B. Liu. Ternary Weight Networks, arXiv 2016[3] F. Seide et al. 1-Bit Stochastic Gradient Descent and Application to Data-Parallel Distributed Training of Speech DNNs, In Interspeech 2014[4] C. Renggli et al. SparCML: High-Performance Sparse Communication for Machine Learning, arXiv 2018
source: ai.intel.com
spcl.inf.ethz.ch
@spcl_eth
38
SparCML – Quantized sparse allreduce for decentral updates
𝛻𝑤1 𝛻𝑤2 𝛻𝑤3 𝛻𝑤4
+ +
+ +
C. Renggli et al. SparCML: High-Performance Sparse Communication for Machine Learning, SC 2019
Microsoft Speech Production Workload Results – 2 weeks → 2 days!
Six epochs, 60 million params
spcl.inf.ethz.ch
@spcl_eth
42
20 TB/night
Source: OpenAI
spcl.inf.ethz.ch
@spcl_eth
Opportunities
43
C/C++
FORTRAN
Python
Java
CUDA
OpenCL. . .
SSA Representation(LLVM IR)
Neural Code Comprehension
DNN
DNN
DNN
DNN
Learnable Representation
Malicious Code Detection
Guided Programming
Code Optimization
Hardware Mapping
Anti-Virus
IDE
Compiler
High-Level (Downstream)Tasks
Fron
tend
SourceCode
spcl.inf.ethz.ch
@spcl_eth
Inspiration from Natural Language Processing (NLP)
The Naturalness of Software Hypothesis1
Software is a form of human communication; software corpora
have similar statistical properties to natural language corpora;
and these properties can be exploited to build better software
engineering tools.
1. Miltiadis Allamanis, Earl T. Barr, Premkumar T. Devanbu, and Charles A. Sutton. A survey of machine learning for big code and naturalness. CoRR, abs/1709.06182, 2017. 44
spcl.inf.ethz.ch
@spcl_eth
Natural Language Programming Languages
45
The domestic cat is a small, typically furry,carnivorous mammal. They are often called housecats when kept as indoor pets or simply cats whenthere is no need to distinguish them from otherfelids and felines. They are often valued byhumans for companionship and for their ability tohunt. There are more than seventy cat breedsrecognized by various cat registries.Source: Wikipedia: Cat, https://en.wikipedia.org/wiki/Cat
int fibonacci(int n){
if ((n==1)||(n==0))
return(n);
else
return fibonacci(n-1) +
fibonacci(n-2);
}
Read sequentially Distinct structural features
spcl.inf.ethz.ch
@spcl_eth
Natural Language Programming Languages
46
The domestic cat is a small, typically furry,carnivorous mammal. They are often called housecats when kept as indoor pets or simply cats whenthere is no need to distinguish them from otherfelids and felines. They are often valued byhumans for companionship and for their ability tohunt. There are more than seventy cat breedsrecognized by various cat registries.Source: Wikipedia: Cat, https://en.wikipedia.org/wiki/Cat
int fibonacci(int n){
if ((n==1)||(n==0))
return(n);
else
return fibonacci(n-1) +
fibonacci(n-2);
}
Read sequentially Distinct structural features
Local references Long range dependencies
...
int f = fibonacci(my_number);
...
spcl.inf.ethz.ch
@spcl_eth
Natural Language Programming Languages
47
The domestic cat is a small, typically furry,carnivorous mammal. They are often called housecats when kept as indoor pets or simply cats whenthere is no need to distinguish them from otherfelids and felines. They are often valued byhumans for companionship and for their ability tohunt. There are more than seventy cat breedsrecognized by various cat registries.Source: Wikipedia: Cat, https://en.wikipedia.org/wiki/Cat
int fibonacci(int n){
if ((n==1)||(n==0))
return(n);
else
return fibonacci(n-1) +
fibonacci(n-2);
}
Read sequentially Distinct structural features
Local references Long range dependencies
Words come from a set vocabulary High rate of neologisms
my_number p flowerfib my_func
spcl.inf.ethz.ch
@spcl_eth
Natural Language Programming Languages
48
The domestic cat is a small, typically furry,carnivorous mammal. They are often called housecats when kept as indoor pets or simply cats whenthere is no need to distinguish them from otherfelids and felines. They are often valued byhumans for companionship and for their ability tohunt. There are more than seventy cat breedsrecognized by various cat registries.Source: Wikipedia: Cat, https://en.wikipedia.org/wiki/Cat
int fibonacci(int n){
if ((n==1)||(n==0))
return(n);
else
return fibonacci(n-1) +
fibonacci(n-2);
}
Read sequentially Distinct structural features
Local references Long range dependencies
Words come from a set vocabulary High rate of neologisms
Semantically robust Semantically brittle
0 1 1 2 3 5 8 13 21 34 ...
0 1 2 4 8 16 32 64 128 256 ...
X
1
spcl.inf.ethz.ch
@spcl_eth
Representations of code
49
if
return
+
fib(n-1) fib(n-2)
return
n
Source Code Abstract Syntax Tree (AST) AssemblyStatic Single Assignment (SSA)
define i32 @_Z9fibonaccii(i32 %n) #0 !dbg !4 {%1 = or i32 %n, 1, !dbg !16%2 = icmp eq i32 %1, 1, !dbg !16br i1 %2, label %9, label %3, !dbg !16
%4 = add nsw i32 %n, -1, !dbg !18%5 = tail call i32 @_Z9fibonaccii(i32 %4), !dbg !19%6 = add nsw i32 %n, -2, !dbg !20%7 = tail call i32 @_Z9fibonaccii(i32 %6), !dbg !21%8 = add nsw i32 %7, %5, !dbg !22ret i32 %8, !dbg !23
ret i32 %n, !dbg !23}
spcl.inf.ethz.ch
@spcl_eth
Representations of code
50
if
return
+
fib(n-1) fib(n-2)
return
n
Source Code Abstract Syntax Tree (AST) AssemblyStatic Single Assignment (SSA)
define i32 @_Z9fibonaccii(i32 %n) #0 !dbg !4 {%1 = or i32 %n, 1, !dbg !16%2 = icmp eq i32 %1, 1, !dbg !16br i1 %2, label %9, label %3, !dbg !16
%4 = add nsw i32 %n, -1, !dbg !18%5 = tail call i32 @_Z9fibonaccii(i32 %4), !dbg !19%6 = add nsw i32 %n, -2, !dbg !20%7 = tail call i32 @_Z9fibonaccii(i32 %6), !dbg !21%8 = add nsw i32 %7, %5, !dbg !22ret i32 %8, !dbg !23
ret i32 %n, !dbg !23}
IR2Vec (LLVM) [Keerthy et al. 2019]
inst2vec (LLVM) [Ben-Nun et al. 2018]
V. Raychev et al., “Predicting Program Properties from ‘Big Code’” POPL 2015.C. Cummins et al., “End-to-end Deep Learning of Optimization Heuristics”, PACT 2017.U. Alon et al., “code2vec: Learning Distributed Representations of Code”, POPL 2018.T. Ben-Nun et al., “Neural Code Comprehension: A Learnable Representation of Code Semantics”, NeurIPS 2018.Q. Le et al., “Deep learning at the shallow end: Malware classification for non-domain experts”, DFRWS 2018.V. Keerthy et al., “IR2Vec: A Flow Analysisbased Scalable Infrastructure for Program Encodings”, arXiv 2019.A. Brauckmann et al., “Compiler-Based Graph Representations for Deep Learning Models of Code”, CC 2020.
CDFG (LLVM) [Brauckmann et al. 2020]
AST paths [Raychev et al. 2015] [Le et al. 2018]DeepTune [Cummins et al. 2017]
ProGraML (LLVM, XLA) [WIP]
code2vec [Alon et al. 2018]
spcl.inf.ethz.ch
@spcl_eth
51
Representations of code
code2vec inst2vec CDFG
spcl.inf.ethz.ch
@spcl_eth
52
Self-Supervised Models of Code
spcl.inf.ethz.ch
@spcl_eth
53
ProGraML Overview
spcl.inf.ethz.ch
@spcl_eth
54
ProGraML Overview
spcl.inf.ethz.ch
@spcl_eth
55
ProGraML Overview
spcl.inf.ethz.ch
@spcl_eth
56
ProGraML on compiler tasks
spcl.inf.ethz.ch
@spcl_eth
57
Case Study: Algorithm Classification (104 Classes)
L. Mou et al., “Convolutional neural networks over tree structures for programming language processing”, AAAI 2016.
spcl.inf.ethz.ch
@spcl_eth
58
Case Study: Heterogeneous Device Mapping
CPU or GPU?
GPU thread-block size
spcl.inf.ethz.ch
@spcl_eth
59
Case Study: Branch Prediction
spcl.inf.ethz.ch
@spcl_eth
▪ A supercomputing problem - amenable to established tools and tricks from HPC
▪ Concurrency is easy to attain, hard to program beyond data-parallelism
▪ Main bottleneck in distributed is communication – reduction by using the robustness of SGD
▪ Co-design is prevalent
▪ Very different environment from traditional HPC
▪ Trade-off accuracy for performance!
▪ Main objective is generalization
▪ Performance-centric view in HPC can be harmful for accuracy
70
Summary – HPC → ML
https://www.arxiv.org/abs/1802.09941
spcl.inf.ethz.ch
@spcl_eth
▪ Categorizing and understanding code is essential for various tasks
▪ Reasoning about code requires different tools than natural languages
▪ Using classical compiler construction, structure can be recovered
▪ Results are promising for various classes of downstream tasks
71
Summary – ML → HPC
https://arxiv.org/abs/1806.07336