scaling deep learning to 100s of gpus on hops hadoop€¦ · scaling deep learning to 100s of gpus...

Scaling Deep Learning to 100s of GPUs on Hops Hadoop

Fabio BusoSoftware EngineerLogical Clocks AB

2

HopsFS: Next generation HDFS

37xNumber of fles

16xThroughput

Scale Challenge Winner (2017)

*https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi**https://eurosys2017.github.io/assets/data/posters/poster09-Niazi.pdf

https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi

https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi

https://eurosys2017.github.io/assets/data/posters/poster09-Niazi.pdf



3

Hops platform

Projects, Datasets, Users

HopsFS, HopsYARN, MySQL NDB Cluster

Spark, Tensorfow, Hive, Kafka, Flink

Jupyter, Zeppelin

Jobs, Grafana, ELK

RESTAPI

Version 0.3.0 just released!

4

Python frst

Conda Repo

Project Conda env

Search

Install/Remove

Python-3.6, pandas-1.4,Numpy-0.9

Environment usable by Spark/Tensorfow

Hops python library: Make development easy● Hyperparameter searching● Manage Tensorboard lifecycle

5

Find big datasets - Dela*

● Discover, Share and experiment with interesting datasets

● p2p network of Hops Cluster● ImageNet, YouTube8M, Reddit comments...● Exploits unused bandwidth

*http://ieeexplore.ieee.org/document/7980225/ (ICDCS 2017)

http://ieeexplore.ieee.org/document/7980225/



Scale out level: 1Parallel Hyper parameter searching

7

Parallel Hyperparameter searching

def model(lr, dropout):…

args_dict = {'learning_rate': [0.001, 0.0005, 0.0001], 'dropout': [0.45, 0.7]}

args_dict_grid = util.grid_params(args_dict)

tflauncher.launch(spark, model, args_dict_grid)

Starts 6 parallel experiments

Scale out Level: 2Distributed Training

9

TensorFlowOnSpark (TFoS) by Yahoo!

● Distributed TensorFlow over Spark● Runs on top of a Hadoop cluster● PS/Workers executed inside Spark executors● Uses Spark for resource allocations

– Our version: exclusive GPUs allocations– Parameter server(s) do not get GPU(s)

● Manages Tensorboard

10

Run TFoS

def training_fun(argv, ctx):

…..

TFNode.start_cluster_server()

…..

TFCluster.run(spark, training_fun, num_exec, num_ps…)

Full conversion guide: https://github.com/yahoo/TensorFlowOnSpark/wiki/Conversion-Guide

https://github.com/yahoo/TensorFlowOnSpark/wiki/Conversion-Guide

https://github.com/yahoo/TensorFlowOnSpark/wiki/Conversion-Guide

Scale out level: Master of the dark artsHorovod

12

PS server architecture doesn’t scale

From: https://github.com/uber/horovod

https://github.com/uber/horovod

13

Horovod by Uber

● Based on previous work done by Baidu

● Organize workers in a ring● Gradients updates distributed using All-Reduce

● Synchronous protocol

14

All-Reduce

GPU1

GPU2

GPU3

a0 b0 c0

a1 b1 c1

a2 b2 c2

15

All-Reduce

a0 b0 c0 + c2

a0 + a1 b1 c1

a2 b1 + b2 c2

GPU1

GPU2

GPU3

16

All-Reduce

a0 b0 + b1 + b2 c0 + c2

a0 + a1 b1 c0 + c1 + c2

a0 + a1 + a2 b1 + b2 c2

GPU1

GPU2

GPU3

17

All-Reduce

a0 b0 + b1 + b2 c0 + c2

a0 + a1 b1 c0 + c1 + c2

a0 + a1 + a2 b1 + b2 c2

GPU1

GPU2

GPU3

18

All-Reduce

a0 + a1 + a2 b0 + b1 + b2 c0 + c2

a0 + a1 b0 + b1 + b2 c0 + c1 + c2

a0 + a1 + a2 b1 + b2 c0 + c1 + c2

GPU1

GPU2

GPU3

19

All-Reduce

a0 + a1 + a2 b0 + b1 + b2 c0 + c1 + c2

a0 + a1 + a2 b0 + b1 + b2 c0 + c1 + c2

a0 + a1 + a2 b0 + b1 + b2 c0 + c1 + c2

GPU1

GPU2

GPU3

20

Hops AllReduce

import horovod.tensorflow as hvddef conv_model(feature, target, mode) …..def main(_): hvd.init() opt = hvd.DistributedOptimizer(opt) if hvd.local_rank()==0: hooks = [hvd.BroadcastGlobalVariablesHook(0), ..] ….. else: hooks = [hvd.BroadcastGlobalVariablesHook(0), ..]

…..from hops import allreduceallreduce.launch(spark, 'hdfs:///Projects/…/all_reduce.ipynb')

Demo time!

Play with it → hops.io/?q=content/hopsworks-vagrant

Doc → hops.ioStar us! → github.com/hopshadoopFollow us! → @hopshadoop

scaling deep learning to 100s of gpus on hops hadoop€¦ · scaling deep learning to 100s of gpus...

Documents