@spcl eth al en un performance ml ↔ hpc: optimizing ... · spcl.inf.ethz.ch @spcl_eth tal ben-nun...

spcl.inf.ethz.ch

@spcl_eth

TAL BEN-NUN

ML ↔ HPC: Optimizing Optimizers for OptimizationWorkshop on the Convergence of ML & HPC @ ASPLOS 2020 Zoom

WITH CONTRIBUTIONS FROM DAN ALISTARH, NIKOLI DRYDEN, YOSUKE OYAMA, CEDRIC RENGGLI, AND OTHERS AT SPCL

Stochastic Performance

spcl.inf.ethz.ch

@spcl_eth

2

20 TB/night

Source: OpenAI

spcl.inf.ethz.ch

@spcl_eth

4

A brief intro to supervised deep learning

1.00

0.00

0.00

0.00

0.00

0.00

Cat

Dog

Airplane

Truck

Horse

Banana

labeled samples 𝑥 ∈ 𝑋 ⊂ 𝒟 𝑓 𝑥 : 𝑋 → 𝑌 label domain 𝑌

network structure(fixed)

weights 𝑤(learned)

𝑤∗ = argmin𝑤∈ℝ𝑑 𝔼𝑥~𝒟 ℓ 𝑤, 𝑥

true label 𝑙(𝑥)

0.54

0.28

0.02

0.07

0.33

0.02

Cat

Dog

Airplane

Truck

Horse

Banana

𝑓(𝑥)

layer-wise parameter update

ℓ𝑐𝑒 𝑤, 𝑥 = −

𝑖

𝑙 𝑥 𝑖 ⋅ log𝑒𝑓 𝑥 𝑖

σ𝑘 𝑒𝑓 𝑥 𝑘

ℓ𝑠𝑞 𝑤, 𝑥 = 𝑓 𝑥 − 𝑙 𝑥2

spcl.inf.ethz.ch

@spcl_eth

5

A brief intro to supervised deep learning

1.00

0.00

0.00

0.00

0.00

0.00

Cat

Dog

Airplane

Truck

Horse

Banana

labeled samples 𝑥 ∈ 𝑋 ⊂ 𝒟 𝑓 𝑥 : 𝑋 → 𝑌 label domain 𝑌

network structure(fixed)

weights 𝑤(learned)

𝑤∗ = argmin𝑤∈ℝ𝑑 𝔼𝑥~𝒟 ℓ 𝑤, 𝑥

true label 𝑙(𝑥)

0.54

0.28

0.02

0.07

0.33

0.02

Cat

Dog

Airplane

Truck

Horse

Banana

𝑓(𝑥)

layer-wise parameter update

ℓ𝑐𝑒 𝑤, 𝑥 = −

𝑖

𝑙 𝑥 𝑖 ⋅ log𝑒𝑓 𝑥 𝑖

σ𝑘 𝑒𝑓 𝑥 𝑘

ℓ𝑠𝑞 𝑤, 𝑥 = 𝑓 𝑥 − 𝑙 𝑥2

≥TBs of random access

100MiB-32GiB and beyond 30k-billions

spcl.inf.ethz.ch

@spcl_eth

8

Trends in deep learning: hardware and multi-node

The field is moving fast – trying everything imaginable – survey results from 252 papers in the area of parallel deep learning

Hardware used Shared vs. distributed memory

Deep Learning is largely on distributed memory today!

T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, CSUR 2019

spcl.inf.ethz.ch

@spcl_eth

9

Trends in distributed deep learning: node count and communication

Deep Learning research is converging to MPI!

The field is moving fast – trying everything imaginable – survey results from 252 papers in the area of parallel deep learning


Communication mode

spcl.inf.ethz.ch

@spcl_eth

10

Computational Principles

Data

spcl.inf.ethz.ch

@spcl_eth

11

Computational Principles

Data

spcl.inf.ethz.ch

@spcl_eth

Indirect

12

Example: Options for computing convolutional layers

Direct

𝑤 ℱ

ℱ

ℱ−1

=

×ෝ𝑤

FFT4 1 9 8

5 9 9 8

0 7 3 4

2 6 3 1

1 -1 0

0.1 -2 0

3 4 1.1*

21.9 59.3 53.9 43.9

-6.3 16.8 12.3 12

9.6 15.3 25.8 14

0.4 7.1 52.1 53.1

=Winograd

Direct

im2col

K. Chellapilla et al.: High Performance Convolutional Neural Networks for Document Processing, Int’l Workshop on Frontiers in Handwriting Recognition 2016M. Mathieu et al.: Fast Training of Convolutional Networks through FFTs, ICLR’14A. Lavin and S. Gray: Fast Algorithms for Convolutional Neural Networks, CVPR’16

𝐶𝑖𝑛

𝑁

Reshape

im2col

𝐶𝑖𝑛

𝐶𝑜𝑢𝑡 ⋅ 𝐶𝑖𝑛

𝐾𝑦

𝐾𝑥

𝑊

𝐻

𝐶𝑜𝑢𝑡

𝐶𝑖𝑛 ⋅ 𝐾𝑦 ⋅ 𝐾𝑥

…

𝑁 ⋅ 𝐻′ ⋅ 𝑊′

𝐶𝑖𝑛⋅𝐾𝑦⋅𝐾

𝑥

×

GEMM,col2im

𝑊′

𝐻′

𝐶𝑜𝑢𝑡

𝒘

𝑭(𝒎, 𝒓)Winograd

DomainChannel-wise

summation+

𝐴𝑇 ⋅ ⋅ 𝐴

𝐺 ⋅ ⋅ 𝐺𝑇𝐵𝑇 ⋅ ⋅ 𝐵

Element-wise

product

𝑚×𝑚

𝑟 × 𝑟𝑚′ ×𝑚′

𝑚′ = 𝑚+ 𝑟 − 1

spcl.inf.ethz.ch

@spcl_eth

13

Operator Design

* = * *

Separable convolution

A.G. Howard et al. “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,” arXiv 2017.F.N. Iandola et al “Squeezenet: alexnet-level accuracy with 50x fewer parameters and <0.5MB model size,” ICLR 2017.

spcl.inf.ethz.ch

@spcl_eth

14

Transformers – Multi-Head Attention

A. Vaswani et al. “Attention Is All You Need,” NeurIPS 2017.

spcl.inf.ethz.ch

@spcl_eth

▪ Use techniques from compiler construction: DNN → Graph → IR → Transformations → HW Mapping

16

DNN Compilers

TensorFlow XLA Facebook Glow / TorchScript JIT

spcl.inf.ethz.ch

@spcl_eth

▪ Use techniques from compiler construction: DNN → Graph → IR → Transformations → HW Mapping

17

DNN Compilers

TensorFlow XLA Facebook Glow / TorchScript JIT

TVM StackIntel nGraph

spcl.inf.ethz.ch

@spcl_eth

18

Partitioning Computation?

…

……

…

Data Parallelism

spcl.inf.ethz.ch

@spcl_eth

19

Minibatch Stochastic Gradient Descent (SGD)

0.54

0.28

0.02

0.07

0.03

0.04

0.02

Cat

Dog

Airplane

Truck

Horse

Bicycle

1.00

0.00

0.00

0.00

0.00

0.00

0.00

Cat

Dog

Airplane

Truck

Horse

Bicycle


spcl.inf.ethz.ch

@spcl_eth

20


…

……

…

Data Parallelism

Model Parallelism 1

3

Channel/Filter Spatial

…PIpeline Parallelism

Layer

Idle Idle1 2 31 2 3

1 2 3 3 2 13 2 1

3 2 111

1 11

1Proc 1Proc 2Proc 3

spcl.inf.ethz.ch

@spcl_eth

21


…

……

…

Data Parallelism

Model Parallelism 1

3

Channel/Filter Spatial

…PIpeline Parallelism

Layer

▪ Parameters/domain can be distributed across processors

▪ Good for: large inputs, wide networks

▪ Complex communication per-layer

▪ Performance hinges on implementation

▪ Parameters can be distributed across processors

▪ Good for: deep models, few activations

▪ Sparse communication pattern (only pipeline stages)

▪ Consistent model introduces idle-time “Bubble”

▪ Simple and efficient solution, easy to implement

▪ Duplicate parameters at all processors

▪ Affects generalization

spcl.inf.ethz.ch

@spcl_eth

25

Hybrid parallelism

A. Krizhevsky: One weird trick for parallelizing convolutional neural networks, arXiv 2014J. Dean et al.: Large scale distributed deep networks, NIPS’12.T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, CSUR 2019

▪ Layers/parameters can be distributed across processors

▪ Can distribute minibatch

▪ Often specific to layer-types (e.g., distribute fc layers but handle conv layers data-parallel)

▪ Enables arbitrary combinations of data, model, and pipeline parallelism – very powerful!

ModelParallelism

DataParallelism

Layer (pipeline) Parallelism

……

…

spcl.inf.ethz.ch

@spcl_eth

27

Training is not just Training

K. Osawa et al., “Second-order Optimization Method for Large Mini-batch: Training ResNet-50 on ImageNet in 35 Epochs”, arXiv 2018.T. Karras et al., “Progressive Growing of GANs for Improved Quality, Stability, and Variation”, arXiv 2017.

Imbalanced workload over timeNontrivial gradient aggregation

Data/compute redistribution

spcl.inf.ethz.ch

@spcl_eth

28

Hyperparameter and Architecture search

Reinforcement Learning [1] Evolutionary Algorithms [4]

▪ Meta-optimization of hyper-parameters (momentum) and DNN architecture

▪ Using Reinforcement Learning [1] (explore/exploit different configurations)

▪ Genetic Algorithms with modified (specialized) mutations [2]

▪ Particle Swarm Optimization [3] and other meta-heuristics

▪ Multi-level optimization

[1] M. Jaderberg et al.: Population Based Training of Neural Networks, arXiv 2017[2] E. Real et al.: Regularized Evolution for Image Classifier Architecture Search, arXiv 2018[3] P. R. Lorenzo et al.: Hyper-parameter Selection in Deep Neural Networks Using Parallel Particle Swarm Optimization, GECCO’17[4] H. Liu et al.: Hierarchical Representations for Efficient Architecture Search, ICLR’18[5] R. Luo et al.: Neural Architecture Optimization, NeurIPS’18[6] H. Liu et al.: DARTS: Differentiable Architecture Search, ICLR’19

Model-Based Optimization [5,6]

spcl.inf.ethz.ch

@spcl_eth

29

Hyperparameter and Architecture search

Reinforcement Learning [1] Evolutionary Algorithms [4]

▪ Meta-optimization of hyper-parameters (momentum) and DNN architecture

▪ Using Reinforcement Learning [1] (explore/exploit different configurations)

▪ Genetic Algorithms with modified (specialized) mutations [2]

▪ Particle Swarm Optimization [3] and other meta-heuristics

▪ Multi-level optimization

[1] M. Jaderberg et al.: Population Based Training of Neural Networks, arXiv 2017[2] E. Real et al.: Regularized Evolution for Image Classifier Architecture Search, arXiv 2018[3] P. R. Lorenzo et al.: Hyper-parameter Selection in Deep Neural Networks Using Parallel Particle Swarm Optimization, GECCO’17[4] H. Liu et al.: Hierarchical Representations for Efficient Architecture Search, ICLR’18[5] R. Luo et al.: Neural Architecture Optimization, NeurIPS’18[6] H. Liu et al.: DARTS: Differentiable Architecture Search, ICLR’19

Model-Based Optimization [5,6]

spcl.inf.ethz.ch

@spcl_eth

31

Updating parameters in distributed data parallelism

Cen

tral

Decentral

parameter server (sharded) 𝑤’ = 𝑢(𝑤, 𝛻𝑤)

𝑤𝛻𝑤

Training Agent Training Agent Training Agent Training Agent


collective allreduce of 𝒘

𝑇 = 2𝐿 log2 𝑃 +2𝛾𝑚𝐺(𝑃 − 1)/𝑃

𝑇 = 2𝐿 + 2𝑃 𝛾𝑚/𝑠 𝐺 - Collective operations- Topologies- Neighborhood collectives- RMA?

Hierarchical Parameter ServerS. Gupta et al.: Model Accuracy and Runtime Tradeoff in Distributed Deep Learning:

A SystematicStudy. ICDM’16

spcl.inf.ethz.ch

@spcl_eth

▪ Trades off “statistical performance” for “hardware performance”

32

Parameter (and Model) consistency - centralized parameter server (sharded) 𝑤’ = 𝑢(𝑤, 𝛻𝑤)

𝑤𝛻𝑤


Synchronous Stale Synchronous / Bounded Asynchronous Asynchronous

𝑤 1

𝑤 1

Time

Parameter Server

Synchronization

𝑤 2

𝑤 2

Agent 1

Agent m

. . .

𝑤 𝑇𝑤 0 …

Sync.

Time

Parameter Server

Agent 1

Agent m. . .

𝑤 𝑇𝑤 0 …

𝑤 1,𝑚 𝑤 2,𝑚

𝑤 2,1𝑤 1,1 𝑤 3,1

𝑤 3,𝑚

Max. Staleness

Time

Agent 1

Agent m

. . .

𝑤 1,1

𝑤 1,𝑚 𝑤 2,𝑚

𝑤 2,1 𝑤 3,1 𝑤 4,1

Parameter Server𝑤 0 𝑤 𝑇…

Sync.

▪ Parameter exchange frequency can be controlled, while still attaining convergence:

Inconsistent

EnsembleLearning

SynchronousSGD

Consistent

Stale-SynchronousSGD

ModelAveraging

(e.g., elastic)

AsynchronousSGD (HOGWILD!)

J. Dean et al.: Large scale distributed deep networks, NIPS’12.F. Niu et al.: Hogwild: A lock-free approach to parallelizing stochastic gradient descent, NIPS’11.

spcl.inf.ethz.ch

@spcl_eth

▪ Parameter exchange frequency can be controlled, while still attaining convergence:

▪ Using Gossip Algorithms [Jin et al. 2016] or Partial Collectives [Li et al. 2020]

33

Synchronous Stale Synchronous / Bounded Asynchronous Asynchronous


collective allreduce of 𝒘

Time

All-Reduce

Agent 1

Agent m

. . .

…

…

.

.

. Mer

ge

𝑤 1,1

𝑤 1,𝑚 𝑤 2,𝑚

Max. Staleness

𝑤(0) 𝑤(𝑇)

𝑤 2,1 𝑤 3,1 𝑤 4,1

All-Reduce

𝑤 1

Time

𝑤(0) All-Reduce

𝑤 𝑇𝑤 2

𝑤 2

Agent 1

Agent m

. . .

𝑤 1

𝑤 𝑇

…

…

All-Reduce

Time

Agent 1

Agent m 𝑤 1,𝑚 𝑤 2,𝑚

𝑤 2,1𝑤 1,1 𝑤 3,1

𝑤 3,𝑚

Agent r

Agent k

𝑤 1,𝑟 𝑤 2,𝑟 𝑤 3,𝑟 𝑤 4,𝑟 𝑤 5,𝑟

𝑤 1,𝑘 𝑤 2,𝑘 𝑤 3,𝑘

Peter H. Jin et al., “How to scale distributed deep learning?”, NIPS MLSystems 2016Shigang Li et al., “Taming unbalanced training workloads in deep learning with partial collective operations”, PPoPP 2020

Parameter (and Model) consistency - decentralized

Inconsistent

EnsembleLearning

SynchronousSGD

Consistent


ModelAveraging

(e.g., elastic)


spcl.inf.ethz.ch

@spcl_eth

34

Parameter consistency in deep learning

Inconsistent

EnsembleLearning

SynchronousSGD

Consistent


ModelAveraging

(e.g., elastic)


𝑤 𝑡+1,𝑖 = 𝑤 𝑡,𝑖 − 𝜂𝛻𝑤 𝑡,𝑖 − 𝛼 𝑤 𝑡,𝑖 − 𝑤𝑡

𝑤𝑡+1 = 1− 𝛽 𝑤𝑡 +𝛽

𝑚

𝑖=1

𝑚

𝑤 𝑡,𝑖

𝑤 1,1

Time

Parameter Server

Agent 1

Agent m

. . .

𝑤 𝑇𝑤 0 …

Sync.

𝑤 2,1 𝑤 3,1 𝑤 4,1 𝑤 5,1 𝑤 6,1

𝑤 1,𝑚 𝑤 2,𝑚 𝑤 3,𝑚 𝑤 4,𝑚 𝑤 5,𝑚 𝑤 6,𝑚

ElasticAverage

S. Zhang et al.: Deep learning with Elastic Averaging SGD, NIPS’15

Using physical forces betweendifferent versions of 𝑤:

spcl.inf.ethz.ch

@spcl_eth

35

Parameter consistency in deep learning

Inconsistent

EnsembleLearning

SynchronousSGD

Consistent


ModelAveraging

(e.g., elastic)


Avg.

0.54

0.28

0.02

0.07

0.33

0.04

0.02

Cat

Dog

Airplane

Truck

Horse

Bicycle

T. G. Dietterich: Ensemble Methods in Machine Learning, MCS 2000

spcl.inf.ethz.ch

@spcl_eth

▪ Different options how to optimize updates

▪ Send 𝛻𝑤, receive 𝑤

▪ Send FC factors (𝑜𝑙−1, 𝑜𝑙), compute 𝛻𝑤 on parameter server

Broadcast factors to not receive full w

▪ Use lossy compression when sending, accumulate error locally!

▪ Quantization

▪ Quantize weight updates and potentially weights

▪ Main trick is stochastic rounding [1] – expectation is more accurate

Enables low precision (half, quarter) to become standard

▪ TernGrad - ternary weights [2], 1-bit SGD [3], …

▪ Sparsification

▪ Do not send small weight updates or only send top-k [4]

Accumulate omitted gradients locally

37

Communication optimizationsparameter server (sharded) 𝑤’ = 𝑢(𝑤, 𝛻𝑤)

𝑤𝛻𝑤


[1] S. Gupta et al. Deep Learning with Limited Numerical Precision, ICML’15[2] F. Li and B. Liu. Ternary Weight Networks, arXiv 2016[3] F. Seide et al. 1-Bit Stochastic Gradient Descent and Application to Data-Parallel Distributed Training of Speech DNNs, In Interspeech 2014[4] C. Renggli et al. SparCML: High-Performance Sparse Communication for Machine Learning, arXiv 2018

source: ai.intel.com

spcl.inf.ethz.ch

@spcl_eth

38

SparCML – Quantized sparse allreduce for decentral updates

𝛻𝑤1 𝛻𝑤2 𝛻𝑤3 𝛻𝑤4

+ +

+ +

C. Renggli et al. SparCML: High-Performance Sparse Communication for Machine Learning, SC 2019

Microsoft Speech Production Workload Results – 2 weeks → 2 days!

Six epochs, 60 million params

spcl.inf.ethz.ch

@spcl_eth

42

20 TB/night

Source: OpenAI

spcl.inf.ethz.ch

@spcl_eth

Opportunities

43

C/C++

FORTRAN

Python

Java

CUDA

OpenCL. . .

SSA Representation(LLVM IR)

Neural Code Comprehension

DNN

DNN

DNN

DNN

Learnable Representation

Malicious Code Detection

Guided Programming

Code Optimization

Hardware Mapping

Anti-Virus

IDE

Compiler

High-Level (Downstream)Tasks

Fron

tend

SourceCode

spcl.inf.ethz.ch

@spcl_eth

Inspiration from Natural Language Processing (NLP)

The Naturalness of Software Hypothesis1

Software is a form of human communication; software corpora

have similar statistical properties to natural language corpora;

and these properties can be exploited to build better software

engineering tools.

1. Miltiadis Allamanis, Earl T. Barr, Premkumar T. Devanbu, and Charles A. Sutton. A survey of machine learning for big code and naturalness. CoRR, abs/1709.06182, 2017. 44

spcl.inf.ethz.ch

@spcl_eth

Natural Language Programming Languages

45

The domestic cat is a small, typically furry,carnivorous mammal. They are often called housecats when kept as indoor pets or simply cats whenthere is no need to distinguish them from otherfelids and felines. They are often valued byhumans for companionship and for their ability tohunt. There are more than seventy cat breedsrecognized by various cat registries.Source: Wikipedia: Cat, https://en.wikipedia.org/wiki/Cat

int fibonacci(int n){

if ((n==1)||(n==0))

return(n);

else

return fibonacci(n-1) +

fibonacci(n-2);

}

Read sequentially Distinct structural features

https://en.wikipedia.org/wiki/Cat

spcl.inf.ethz.ch

@spcl_eth


46



if ((n==1)||(n==0))

return(n);

else


fibonacci(n-2);

}


Local references Long range dependencies

...

int f = fibonacci(my_number);

...


spcl.inf.ethz.ch

@spcl_eth


47



if ((n==1)||(n==0))

return(n);

else


fibonacci(n-2);

}



Words come from a set vocabulary High rate of neologisms

my_number p flowerfib my_func


spcl.inf.ethz.ch

@spcl_eth


48



if ((n==1)||(n==0))

return(n);

else


fibonacci(n-2);

}



Words come from a set vocabulary High rate of neologisms

Semantically robust Semantically brittle

0 1 1 2 3 5 8 13 21 34 ...

0 1 2 4 8 16 32 64 128 256 ...

X

1


spcl.inf.ethz.ch

@spcl_eth

Representations of code

49

if

return

+

fib(n-1) fib(n-2)

return

n

Source Code Abstract Syntax Tree (AST) AssemblyStatic Single Assignment (SSA)

define i32 @_Z9fibonaccii(i32 %n) #0 !dbg !4 {%1 = or i32 %n, 1, !dbg !16%2 = icmp eq i32 %1, 1, !dbg !16br i1 %2, label %9, label %3, !dbg !16

%4 = add nsw i32 %n, -1, !dbg !18%5 = tail call i32 @_Z9fibonaccii(i32 %4), !dbg !19%6 = add nsw i32 %n, -2, !dbg !20%7 = tail call i32 @_Z9fibonaccii(i32 %6), !dbg !21%8 = add nsw i32 %7, %5, !dbg !22ret i32 %8, !dbg !23

ret i32 %n, !dbg !23}

spcl.inf.ethz.ch

@spcl_eth


50

if

return

+

fib(n-1) fib(n-2)

return

n

Source Code Abstract Syntax Tree (AST) AssemblyStatic Single Assignment (SSA)

define i32 @_Z9fibonaccii(i32 %n) #0 !dbg !4 {%1 = or i32 %n, 1, !dbg !16%2 = icmp eq i32 %1, 1, !dbg !16br i1 %2, label %9, label %3, !dbg !16

%4 = add nsw i32 %n, -1, !dbg !18%5 = tail call i32 @_Z9fibonaccii(i32 %4), !dbg !19%6 = add nsw i32 %n, -2, !dbg !20%7 = tail call i32 @_Z9fibonaccii(i32 %6), !dbg !21%8 = add nsw i32 %7, %5, !dbg !22ret i32 %8, !dbg !23

ret i32 %n, !dbg !23}

IR2Vec (LLVM) [Keerthy et al. 2019]

inst2vec (LLVM) [Ben-Nun et al. 2018]

V. Raychev et al., “Predicting Program Properties from ‘Big Code’” POPL 2015.C. Cummins et al., “End-to-end Deep Learning of Optimization Heuristics”, PACT 2017.U. Alon et al., “code2vec: Learning Distributed Representations of Code”, POPL 2018.T. Ben-Nun et al., “Neural Code Comprehension: A Learnable Representation of Code Semantics”, NeurIPS 2018.Q. Le et al., “Deep learning at the shallow end: Malware classification for non-domain experts”, DFRWS 2018.V. Keerthy et al., “IR2Vec: A Flow Analysisbased Scalable Infrastructure for Program Encodings”, arXiv 2019.A. Brauckmann et al., “Compiler-Based Graph Representations for Deep Learning Models of Code”, CC 2020.

CDFG (LLVM) [Brauckmann et al. 2020]

AST paths [Raychev et al. 2015] [Le et al. 2018]DeepTune [Cummins et al. 2017]

ProGraML (LLVM, XLA) [WIP]

code2vec [Alon et al. 2018]

spcl.inf.ethz.ch

@spcl_eth

51


code2vec inst2vec CDFG

spcl.inf.ethz.ch

@spcl_eth

52

Self-Supervised Models of Code

spcl.inf.ethz.ch

@spcl_eth

53

ProGraML Overview

spcl.inf.ethz.ch

@spcl_eth

54

ProGraML Overview

spcl.inf.ethz.ch

@spcl_eth

55

ProGraML Overview

spcl.inf.ethz.ch

@spcl_eth

56

ProGraML on compiler tasks

spcl.inf.ethz.ch

@spcl_eth

57

Case Study: Algorithm Classification (104 Classes)

L. Mou et al., “Convolutional neural networks over tree structures for programming language processing”, AAAI 2016.

spcl.inf.ethz.ch

@spcl_eth

58

Case Study: Heterogeneous Device Mapping

CPU or GPU?

GPU thread-block size

spcl.inf.ethz.ch

@spcl_eth

59

Case Study: Branch Prediction

spcl.inf.ethz.ch

@spcl_eth

▪ A supercomputing problem - amenable to established tools and tricks from HPC

▪ Concurrency is easy to attain, hard to program beyond data-parallelism

▪ Main bottleneck in distributed is communication – reduction by using the robustness of SGD

▪ Co-design is prevalent

▪ Very different environment from traditional HPC

▪ Trade-off accuracy for performance!

▪ Main objective is generalization

▪ Performance-centric view in HPC can be harmful for accuracy

70

Summary – HPC → ML

https://www.arxiv.org/abs/1802.09941

https://www.arxiv.org/abs/1802.09941

spcl.inf.ethz.ch

@spcl_eth

▪ Categorizing and understanding code is essential for various tasks

▪ Reasoning about code requires different tools than natural languages

▪ Using classical compiler construction, structure can be recovered

▪ Results are promising for various classes of downstream tasks

71

Summary – ML → HPC

https://arxiv.org/abs/1806.07336

https://arxiv.org/abs/1806.07336

@spcl eth al en un performance ml ↔ hpc: optimizing ... · spcl.inf.ethz.ch @spcl_eth tal ben-nun...

Documents