recent progress on distributing deep learning

Recent progress on distributing deep learning

Viet-Trung Tran KDE lab

Department of Information Systems School of Information and Communication

Technology

Outline

•  State of the art •  Overview of neural network and deep

learning •  Deep learning driven factors •  Scaling deep learning

Perceptron

Feed forward neural network

Training algorithm

•  while not done yet – pick a random training case (x, y) –  run neural network on input x – modify connections to make prediction closer to

y, follow the gradient of the error w.r.t. the connections

Parameter learning: back propagation of error

•  Calculate total error at the top •  Calculate contributions to error at each step going

backwards

Stochastic gradient descent (SGD)

Anything humans can do in 0.1 sec, the right big 10-layer network can do too

DEEP LEARNING DRIVEN FACTORS

Big Data

15Source:EricP.Xing

Computing resources

"Modern" neural networks

•  Deeper but faster training models – Deep belief – ConvNet – RNN (LSTM, GRU)

SCALING DISTRIBUTED DEEP LEARNING

Growing Model Complexity

19Source:EricP.Xing

Objective: minimizing time to results

•  experiment turnaround time •  making fast rather than optimizing resources

Objective: improving results

•  Fact: increasing training examples, model parameters, or both, can drastically improve ultimate classification accuracy – D. C. Ciresan, U. Meier, L. M. Gambardella, and

J. Schmidhuber. Deep big simple neural nets excel on handwritten digit recognition. CoRR, 2010.

– R. Raina, A. Madhavan, and A. Y. Ng. Large-scale deep unsupervised learning using graphics processors. In ICML, 2009.

Scaling deep learning

•  Leverage GPU •  Exploit many kinds of parallelism – Model parallelism – Data parallelism

Why scaling out

•  We can use a cluster of machines to train a modestly sized speech model to the same classification accuracy in less than 1/10th the time required on a GPU

Model parallelism

•  Parallelism in DistBelief

Model parallelism [cont'd]

•  Message passing during upward and downward phases

•  Distributed computation •  Performance gains are held by

communication costs

Source:JeffDean

Data parallelism: Downpour SGD •  Divide the training data into a number of

subsets •  Run a copy of the model on each of

these subsets •  Before processing each mini-batch

–  model replica asks for up-to-date parameters

–  processes the mini-batch –  sending back the gradients

•  To reduce communication overhead –  request parameter servers every nfech

steps, update every npush steps •  A model replica is certainly working on a

set of out-of-date parameters

Sandblaster •  Coordinator assigns each of

the N model replicas a small portion of work, much smaller than 1/Nth of the total size of a batch

•  Assigns replicas new portions whenever they are free

•  Schedules multiple copies of the outstanding portions and uses the result from whichever model replica finishes first

AllReduce – Baidu DeepImage 2015

•  Each worker computes gradients and maintains a subset of parameters

•  Every node fetches up-to-date parameters from all other nodes

•  Optimization – Butterfly synchronization •  Require log(N) steps •  Last step to perform broadcasting

Butterfly barrier

Distributed Hogwild •  Used by Caffe. •  Each node maintains a

local replica of all parameters.

•  In an iteration, node computes gradients and updates locally

•  Exchange updates periodically

DISTRIBUTED DEEP LEARNING FRAMEWORK

Parameter server [OSDI 2014]

Apache Singa [2015]

•  National University of Singapore

Petuum CMU [ACML 2015]

Stale Synchronous Parallel (SSP)

Structure-Aware Parallelization (Strads engine)

•  Data flow graph •  Distributed version has

just been released (based on gRPC)

Deep learning on spark

•  Deeplearning4j •  Adatao/Amiro scaling Tensorflow on spark •  Yahoo lab released CaffeOnSpark •  Data parallelism

DEMO APPLICATIONS

Vietnamese OCR

•  Recognize text line rather than word, character

•  Very good results with just ~20mb model, ~30 pages

Vietnamese predictive text model •  ~ 20 MB plain text corpus •  Chú hoài linh đẹp trai. Chú hoài linh •  Chào buổi sáng •  chị hát hay wa!! nghe thick a. •  chị khởi my ơi e rất la hâm mộ •  làm gì bây giờ khi •  chú hoài linh thật đẹp zai và chú Trấn thành đẹp qá •  chú hoài linh thật đẹp zai và chú Phánh

•  ~ 14 MB plain text corpus •  lịch sử ghi nhớ năm 1979 •  tại hội nghị, đồng chí Phạm Ngọc Thủy Võ Văn Kiệt •  tại hội nghị, đồng chí Hồ Chí Minh nói •  tại hội nghị, đồng chí Võ Nguyên Giáp và đồng chí Hồ Chí

Minh đã ngồi ở •  tại đại hội Đảng lần thứ nhất vào năm 1945, •  Ngay từ những ngày đầu, Đúng như nhận xét của Giáo sư

Nguyễn Văn Linh

CONCLUSION

Principles of ML System Design •  ACML 2015. How to Go Really Big in AI: Strategies &

Principles for Distributed Machine Learning – How to distribute? – How to bridge computation and communication? – How to communicate? – What to communicate?

Thank you!

48HowtoGoReallyBiginAI:Strategies&PrinciplesforDistributedMachineLearning

recent progress on distributing deep learning

Data & Analytics

recent advances in photoacoustic imaging for deep-tissue...

recent trends in deep learning based natural language...

distributing books

weight distributing systems distributing - weaver d

distributing discretionality1

a critique of recent quantitative and deep-structure...

distributing materials

recent progress in computational photography using deep

a survey on recent advances in sequence labeling from deep

adaptive deep learning model selection on embedded...

deep excavations in dublin recent developments · deep...

recent developments in corporate and partnership...

some recent algorithmic questions in deep reinforcement

1 recent trends in deep learning based natural language

review of recent development in deep drawing · pdf...

deep...

presidential particularism: distributing funds between ......

recent planktonic foraminifera from deep · pdf file1 recent...

rl + dl · rl + dl rob platt northeastern university some...

recent advances in deep learning for media &...