scaling deep learning algorithms on extreme scale...

21
Scaling Deep Learning Algorithms on Extreme Scale Architectures ABHINAV VISHNU 1 Team Lead, Scalable Machine Learning, Pacific Northwest National Laboratory MVAPICH User Group (MUG) 2017

Upload: others

Post on 21-May-2020

22 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scaling Deep Learning Algorithms on Extreme Scale ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · Algorithms on Extreme Scale Architectures ABHINAV VISHNU 1 Team Lead, Scalable

Scaling Deep Learning Algorithms on Extreme Scale Architectures ABHINAV VISHNU

1

Team Lead, Scalable Machine Learning, Pacific Northwest National Laboratory MVAPICH User Group (MUG) 2017

Page 2: Scaling Deep Learning Algorithms on Extreme Scale ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · Algorithms on Extreme Scale Architectures ABHINAV VISHNU 1 Team Lead, Scalable

The rise of Deep Learning!

2

FeedForward    

Back-­‐propaga/on  

Several  scien,fic  applica,ons  have  shown  remarkable  improvements  in  modeling/classifica,on  tasks  !!  

Human  accuracy!  

Page 3: Scaling Deep Learning Algorithms on Extreme Scale ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · Algorithms on Extreme Scale Architectures ABHINAV VISHNU 1 Team Lead, Scalable

Challenges in Using Deep Learning

3

•  How  to  design  DNN  topology?  

•  Which  samples  are  important?  

•  How  to  handle  unlabeled  data?  

•  Supercomputers  are  typically  used  for  simula=on  –  effec=ve  for  DL  implementa=ons?  

•  How  much  effort  required  for  using  DL  algorithms?  

•  Will  it  only  reduce  =me-­‐to-­‐solu=on  or  improve  baseline  performance  of  the  model?  

Page 4: Scaling Deep Learning Algorithms on Extreme Scale ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · Algorithms on Extreme Scale Architectures ABHINAV VISHNU 1 Team Lead, Scalable

Vision for Machine/Deep Learning R&D

4

Novel  Machine  Learning/Deep  

Learning  Algorithms  

Extreme  Scale    ML/DL  Algorithms  

MaTEx:  Machine  Learning  Toolkit  for  

Extreme  Scale  

DL  Applica=ons:  HEP,  SLAC,  Power  

Grid,  HPC,  Chemistry  

Page 5: Scaling Deep Learning Algorithms on Extreme Scale ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · Algorithms on Extreme Scale Architectures ABHINAV VISHNU 1 Team Lead, Scalable

Novel ML/DL Algorithms: Pruning Neurons

Training Phase Re-training Phase

Proposed Adaptive Pruning During the Training Phase

State of the art Pruning After training, requiring Re-training

(a) (b)

Error decay Error fixed

Which  neurons  are  important?  Adap=ve  Neuron  Apoptosis  for  Accelera=ng    DL  

Algorithms  

Area  Under  Curve  -­‐  ROC:  1)  Improved  from  0.88  to  0.94  2)  2.5x  speedup  in  learning  

/me  3)  3x  simpler  model  

1 1.5 3

5

1.7 2.3

5

8

2.8 4.1

9

15

1

4

11

21

0

5

10

15

20

25

Default Conser. Normal Aggressive

Impro

vem

ent

Speedup and Parameter Reduction vs 20 cores without Apoptosis

20 Cores 40 Cores 80 Cores Parameter Reduction

Page 6: Scaling Deep Learning Algorithms on Extreme Scale ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · Algorithms on Extreme Scale Architectures ABHINAV VISHNU 1 Team Lead, Scalable

Novel ML/DL Algorithms: Neuro-genesis

6

Training

Can  you  create  neural  network  topologies  semi-­‐automa,cally?  Genera=ng  Neural  Networks  from  BluePrints  

1020 30 40 50

2000 1500

42

16

10

843220

1255830

1676440

2088050

2880 21601600 12002000 1500

16

10

3220

5830

6440

8050

1600 12002000 1500

Page 7: Scaling Deep Learning Algorithms on Extreme Scale ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · Algorithms on Extreme Scale Architectures ABHINAV VISHNU 1 Team Lead, Scalable

Novel ML/DL Algorithms: Sample Pruning

7

Epoch

Epoch #

Batch0

Batch1

Batchn

Batch0

Batch1

Batchn

Eon

Epoch

Batch0

Batch1

Batchp

Batch0

Batch1

Batchn

Eon

Which  Samples  are  Important?  YinYang  Deep  Learning  for  Large  Scale  Systems  

Page 8: Scaling Deep Learning Algorithms on Extreme Scale ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · Algorithms on Extreme Scale Architectures ABHINAV VISHNU 1 Team Lead, Scalable

Scaling DL Algorithms Using Asynchronous Primitives

August 15, 2017 8

Interconnect((NVLINK,(PCI1Ex,(InfiniBand)(

All-to-All reduction (MPI_Allreduce, NCCL_allreduce)

Interconnect((NVLINK,(PCI1Ex,(InfiniBand)(

Not started Completed In Progress Asynchronous thread

Master thread Enqueues

Async thread Dequeues

MPI_Allreduce

Page 9: Scaling Deep Learning Algorithms on Extreme Scale ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · Algorithms on Extreme Scale Architectures ABHINAV VISHNU 1 Team Lead, Scalable

Sample Results

August 15, 2017 9

0123456789

4 8 16 32 64 128

Batche

sperSecon

d

NumberofGPUs

AGD BaG

00.20.40.60.81

1.21.41.61.8

8 16 32 64

Batche

sperSecon

d

NumberofComputeNodes

SGD AGD

PIC  MVAPICH  

Strong  Scaling  

SummitDev    IBM  Spectrum  MPI  

Weak  Scaling  

Page 10: Scaling Deep Learning Algorithms on Extreme Scale ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · Algorithms on Extreme Scale Architectures ABHINAV VISHNU 1 Team Lead, Scalable

What does Fault Tolerant Deep Learning Need from MPI?

August 15, 2017 10

MPI  has  been  cri=cized  heavily    for  lack  of  fault  tolerance  support  

1)  Exis=ng  MPI  implementa=on  2)  User-­‐Level  Fault  Mi=ga=on  3)  Reinit  Proposal  

Which  proposal  is  necessary  and  sufficient?  

…"//"Original"on_gradients_ready"On_gradients_ready(float"*buf)"{""//"conduct"in;place"allreduce"of"gradients""rc"="MPI_Allreduce"(…",""…);""//"average"the"gradients"by"communicator"size""…""

…"//"Fault"tolerant"on_gradients_ready"On_gradients_ready(float"*buf)"{""//"conduct"in;place"allreduce"of"gradients""rc"="MPI_Allreduce"(…,""…);""While&(rc&!=&MPI_SUCCESS)&{&//&shrink&the&communicator&to&a&new&comm.&

MPIX_Comm_shrink(origcomm,&&newcomm);&

rc&=&MPI_Allreduce(…,&…);&}&

//"average"the"gradients"by"communicator"size"…""

Code Snippet of Original Callback Code Snippet for Fault tolerant DL

Page 11: Scaling Deep Learning Algorithms on Extreme Scale ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · Algorithms on Extreme Scale Architectures ABHINAV VISHNU 1 Team Lead, Scalable

Impact of DL on Other Application Domains

11

Computa=onal    Chemistry  

Buildings,  Power  Grid  

Fault  

When  mul,-­‐bit  faults  result  in  applica,on  error?  

HPC  

What  DL  techniques  are  useful  for  Energy  Modeling  of  Buildings?  

Can  molecular  structure  predict  the  molecular  

proper,es?  

Page 12: Scaling Deep Learning Algorithms on Extreme Scale ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · Algorithms on Extreme Scale Architectures ABHINAV VISHNU 1 Team Lead, Scalable

MaTEx: Machine Learning Toolkit for Extreme Scale

MaTEx

August 14, 2017 12

1)  Open  source  soNware  with  users  in  academia,  laboratories  and  industry  2)  Supports  graphics  processing  unit  (GPU),  central  processing  unit  (CPU)  clusters/

LCFs  with  high-­‐end  systems/interconnects  3)  Machine  Learning  Toolkit  for  Extreme  Scale  -­‐MaTEx:  github.com/matex-­‐org/

matex  

Page 13: Scaling Deep Learning Algorithms on Extreme Scale ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · Algorithms on Extreme Scale Architectures ABHINAV VISHNU 1 Team Lead, Scalable

Architectures Supported by MaTEx

August 14, 2017 13

K20    (Gemini)  

GPU  Arch.   K40   K80   P100  

Interconnect   InfiniBand   Ethernet   Omni-­‐Path  

CPU  Arch.  

Xeon  (SB,  Haswell)  

Intel  Knights  Landing   Power  8  

Comparing  the  Performance  of  NVIDIA  DGX-­‐1  and    Intel  KNL  on  Deep  Learning  Workloads,  

ParLearning’17,  IPDPS’17  

Page 14: Scaling Deep Learning Algorithms on Extreme Scale ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · Algorithms on Extreme Scale Architectures ABHINAV VISHNU 1 Team Lead, Scalable

Demystifying Extreme Scale DL

August 14, 2017 14

 TF  Run=me  

 

TF  Scripts  (gRPC)  

Data  Readers  

Architectures  

Google-­‐TensorFlow  

 TF  Run=me  

(MPI  Changes)    

Data  Readers  

Architectures  

MaTEx-­‐TensorFlow  

TF  Scripts  Requires  no  TF  specific  changes  for  

users  

Not  aerac/ve  for  scien/sts!  

Supports  automa=c  distribu=on  of  HDF5,  CSV,  PNetCDF  formats  

Page 15: Scaling Deep Learning Algorithms on Extreme Scale ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · Algorithms on Extreme Scale Architectures ABHINAV VISHNU 1 Team Lead, Scalable

Example Code Changes

August 14, 2017 15

6

1 import tensorflow as tf 1 import tensorflow as tf

2 import numpy as np 2 import numpy as np

3 ... 3 ...

4 from datasets import DataSet 45 ... 5 ...

6 image_net = DataSet(...) 67 data = image_net.training_data 7 data = ... # Load training data

8 labels = image_net.training_labels 8 labels = ... # Load Labels

9 ... 9 ...

10 # Setting up the network 10 # Setting up the network

11 ... 11 ...

12 # Setting up optimizer 12 # Setting up optimizer

13 ... 13 ...

14 init = tf.global_variables_initializer() 14 init = tf.global_variables_initializer()

15 sess = tf.Session() 15 sess = tf.Session()

16 sess.run(init) 16 sess.run(init)

17 ... 17 ...

18 # Run training regime 18 # Run training regime

Fig. 3: (Left) A sample MaTEx-TensorFlow script, (Right) Original TensorFlow script. Notice that MaTEx-TensorFlow requiresno TensorFlow specific changes.

Name CPU (#cores) GPU Network MPI cuDNN CUDA Nodes #coresK40 Haswell (20) K40 IB OpenMPI 1.8.3 4 7.5 8 160SP Ivybridge (20) N/A IB OpenMPI 1.8.4 N/A N/A 20 400

TABLE I: Hardware and Software Description. IB (InfiniBand). The proposed research extends Baseline-Caffe incorporatingarchitecture specific optimizations provided by vendors.

Dataset Neural Network Description Training Samples Validation Samples Image Size ClassesImageNet [33] AlexNet [18] Diverse Images 1281167 50000 256⇥ 256⇥ 3 1000

ImageNet GoogLeNet [34] Diverse Images 1281167 50000 256⇥ 256⇥ 3 1000ImageNet InceptionV3 [35] Diverse Images 1281167 50000 256⇥ 256⇥ 3 1000ImageNet ResNet50 [36] Diverse Images 1281167 50000 256⇥ 256⇥ 3 1000

TABLE II: Datasets and neural networks used for performance evaluation

Fig. 4: Computation costs relative to AlexNet

It provides abstractions for both pair-wise and group com-munication and is capable of using high speed interconnectsnatively, making it particularly suitable to supercomputingenvironments. Among the toolkits that use MPI are MicrosoftCNTK, the Machine Learning Toolkit for Extreme Scale

Fig. 5: Number of parameters relative to AlexNet

(MaTEx) version of Caffe [37], [45], [46], [47], [48], [49],[50], and the multi-node version of Chainer.

TensorFlow itself provides abstractions for building DLalgorithms, including computational graph structures and au-tomatic differentiation. Furthermore, it provides methods for

MaTEx-­‐TensorFlow  Code   Original  TF  Code  

User-­‐transparent  Distributed  TensorFlow,  A.  Vishnu  et  al.,    Arxiv’17  

Supports  automa=c  distribu=on  of  HDF5,  CSV,  PNetCDF  formats  

Page 16: Scaling Deep Learning Algorithms on Extreme Scale ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · Algorithms on Extreme Scale Architectures ABHINAV VISHNU 1 Team Lead, Scalable

User-Transparent Distributed Keras

August 14, 2017 16

1)  Distributed  Keras  with  MPI  available    on  github.com/matex-­‐org/matex  2)  Currently  the  only  Keras  implementa/on  that  does  not  require  any  MPI  specific  

changes  to  code  3)  Tested  on  NERSC  architectures  

1

1 import tensorflow as tf 1 import tensorflow as tf

2 import numpy as np 2 import numpy as np

3 # Keras Imports 3 # Keras Imports

4 ... 4 ...

5 dataset = tf.DataSet(...) 56 data = dataset.training_data 6 data = ... # Load training data

7 labels = dataset.training_labels 7 labels = ... # Load Labels

8 ... 8 ...

9 # Defining Keras Model 9 # Defining Keras Model

10 ... 10 ...

11 # Call to Keras training method 11 # Call to Keras training method

12 ... 12 ...

Page 17: Scaling Deep Learning Algorithms on Extreme Scale ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · Algorithms on Extreme Scale Architectures ABHINAV VISHNU 1 Team Lead, Scalable

August 14, 2017 17

Use-Case: SLAC Water/Ice Classification

Reducing  the  ,me  to  new  science  -­‐  From  Experiment  to  Publica=on  

Typical  Experiment:  1)  ~100  images/sec  2)   ~100  TB  of  data  3)  Problem  further  exacerbated  for  upcoming  LCLS-­‐2  (up  to  1M  images/sec)  4)    Several  domains  exhibit  these  characteris=cs    

Typical  Problems:  1)  Too  many  images  –  can  we  find  the  important  ones?  2)  Unknown  whether  the  experiment  is  on  the  “right  track”:  

1)  Results  not  known  =ll  post-­‐hoc  data  analysis  3)  If  the  experiment  succeeds:  

1)  Exorbitant  =me  spent  (several  man  days)  in  data  cleaning/labeling  2)  Several  man  days  spent  in  manual  data  analysis  (such  as  genera=ng  probability  distribu=on  

func=ons)    Can  we  do  beJer?  

Page 18: Scaling Deep Learning Algorithms on Extreme Scale ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · Algorithms on Extreme Scale Architectures ABHINAV VISHNU 1 Team Lead, Scalable

August 14, 2017 18

Sample Proof: Distinguishing Water from Ice

   Dataset  Specifica/on:  1)  ~68GB  of  data  consis=ng  of  images  with  Water  

and  Ice  crystals  2)  Scien=sts  spent  17  man  days  labeling  each  

image  as  represen=ng  Water  or  Ice  3)  Objec=ve  –  can  we  reduce  the  labeling  =me,  

while  achieving  very  high  accuracy?  1)  We  take  4000  samples  and  consider  

following  data  splits:  1)  Label  1200  to  2800  samples  using  

Deep  Learning  (Convolu=onal  +  Deep  Neural  architectures)  and  see  the  accuracy  on  remaining  samples  (2800  –  1200)  

2)  Observa/on:  With  2800  samples,  we  can  accurately  classify  ~97%  of  remaining  samples  correctly    

4)  Conclusion:  major  reduc=on  in  labeling  =me  with  results  matching  human  labeling  1)  Poten=al  for  significant  reduc=on  in  =me  

for  scien=fic  discovery    2)  Labeling  only  “boundary”  samples  would  

further  reduce  the  human  effort  

0.45  

0.55  

0.65  

0.75  

0.85  

0.95  

0   20   40   60   80   100   120   140  

Tes=ng  Accuracy  vs.  Time  (in  minutes)  Water/Ice  dataset

accuracy  1203   accuracy  2005   accuracy  2807   accuracy  3609  

Page 19: Scaling Deep Learning Algorithms on Extreme Scale ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · Algorithms on Extreme Scale Architectures ABHINAV VISHNU 1 Team Lead, Scalable

Model  re-­‐trained  and  recommenda/ons  changed  

Prototype for Semi-Supervised Learning

Page 20: Scaling Deep Learning Algorithms on Extreme Scale ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · Algorithms on Extreme Scale Architectures ABHINAV VISHNU 1 Team Lead, Scalable

Collaborators

20

Jeff  Daily   Charles  Siegel   Vinay  Amatya   Leon  Song   Ang  Li  

Garrep    Goh  

Malachi  Schram  

Joseph  Manzano  

Vikas  Chandan  

Thomas  J  Lane@SLAC  

Page 21: Scaling Deep Learning Algorithms on Extreme Scale ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · Algorithms on Extreme Scale Architectures ABHINAV VISHNU 1 Team Lead, Scalable

Thanks!

MaTEx

August 14, 2017 21

Contact:  [email protected]  MaTEx  webpage:  hpps://github.com/matex-­‐org/matex/  

Publica=ons:  hpps://github.com/matex-­‐org/matex/wiki/publica=ons