distributed deep learning on hadoop clusters

25
Distributed Deep Learning on Hadoop Clusters Andy Feng & Jun Shi Yahoo! Inc.

Upload: hadoop-summit

Post on 06-Jan-2017

1.303 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: Distributed Deep Learning on Hadoop Clusters

Dist r ibuted Deep Learning on Hadoop Clusters

Andy Feng & Jun Sh iYa h o o ! I n c .

Page 2: Distributed Deep Learning on Hadoop Clusters

2

Our Talks @ Hadoop Summit

Storm on YARN (2013)› http://bit.ly/1W02tZy

Spark on YARN (2014)› http://bit.ly/1W03dxE

Machine Learning on Hadoop/Spark (2015)› http://bit.ly/1NW3GvO

Page 3: Distributed Deep Learning on Hadoop Clusters

Agenda

• Why Deep Learning on Hadoop?

• CaffeOnSpark– Architecture – API: Scala + Python

• Demo– CaffeOnSpark + Python Notebook

Page 4: Distributed Deep Learning on Hadoop Clusters

Deep Learning

4

Page 5: Distributed Deep Learning on Hadoop Clusters

Use Case: Flickr Magic View flickr.com/cameraroll

Page 6: Distributed Deep Learning on Hadoop Clusters

6

Yahoo Use Case: Yahoo Weather

Beauty› Computational

assessed

Relevant› Location› Time› Cloudy› Shower› …

Weather App Yahoo Weather App

Page 7: Distributed Deep Learning on Hadoop Clusters

7

Yahoo Vision Kit: Demo

Page 8: Distributed Deep Learning on Hadoop Clusters

(4)Apply

ML Model @ Scale

Flickr DL/ML Pipeline

(3) Non-deepLearning@ Scale

* http://bit.ly/1KIDfof by Pierre Garrigues, Deep Learning Summit 2015

(2)Deep

Learning@ Scale

(1)PrepareDatasets@ Scale

* 10 billion photos * 7.5 million per day

Page 9: Distributed Deep Learning on Hadoop Clusters

Deep Learning vs. Hadoop

9

Page 10: Distributed Deep Learning on Hadoop Clusters

10

Machine Learning & Deep Learning on Hadoop

Page 11: Distributed Deep Learning on Hadoop Clusters

11

Hadoop Cluster Enhanced

GPU servers added› 4 Tesla K80 cards

• 2 GK210 GPUs, 24GB memory

Network interface enhanced› InfiniBand for direct access to GPU

memory › Ethernet for external communication

Page 12: Distributed Deep Learning on Hadoop Clusters

Deep Learning Frameworks

Caffe› Available since Sept, 2013, 6.3k forks› Popular in vision community & Yahoo

TensorFlow › Released in Nov. 2015, 9.8k forks

Theano, Torch, DL4J, etc.

Page 13: Distributed Deep Learning on Hadoop Clusters

Released in Feb. 2016• Apache 2.0 license

• Distributed deep learning– GPU or CPU– Ethernet or InfiniBand

• Easily deployed on public cloud or private cloud

13

CaffeOnSpark Open Sourced

github.com/yahoo/CaffeOnSpark

Page 14: Distributed Deep Learning on Hadoop Clusters

CaffeOnSpark: Scalable Architecture

14

Page 15: Distributed Deep Learning on Hadoop Clusters

CaffeOnSpark: 19x Speedup (est.)

Training latency (hours)

Top-

5 Va

lidat

ion

Err

or

Page 16: Distributed Deep Learning on Hadoop Clusters

CaffeOnSpark: Deployment Options

16

• Single node– Spark-submit –master local

• Multiple nodes– Spark-submit –master URL –connection ethernet

– Ex. EC2– Spark-submit –master URL –connection infiniband

– Ex., Yahoo Hadoop cluster

Page 17: Distributed Deep Learning on Hadoop Clusters

Spark CLI• spark-submit --num-executors #_Processes --class com.yahoo.ml.CaffeOnSpark caffe-on-spark.jar -devices #_gpus_per_proc -conf solver_config_file -model model_file -train | -test | -feature

Caffe Configuration

layer { name: "data" type: "MemoryData" source_class=“com.yahoo.ml.caffe.LMDB” memory_data_param { source: ”hdfs:///mnist/trainingdata/" batch_size: 64; channels: 1; height: 28; width: 28; } …}

17

CaffeOnSpark: DL Made Easy

Page 18: Distributed Deep Learning on Hadoop Clusters

CaffeOnSpark: One Program (Scala) http://bit.ly/21ZY1c2

18

cos = new CaffeOnSpark(ctx)conf = new Config(ctx, args).init()

// (1) training DL modeldl_train_source = DataSource.getSource(conf, true)cos.train(dl_train_source) // (2) extract features via DLlr_raw_source = DataSource.getSource(conf, false)ext_df = cos.features(lr_raw_source) // (3) apply MLlr_input=ext_df.withColumn(“L", cos.floats2doubleUDF(ext_df(conf.label))) .withColumn(“F", cos.floats2doublesUDF(ext_df(conf.features(0))))lr = new LogisticRegression().setLabelCol(”L").setFeaturesCol(”F")lr_model = lr.fit(lr_input_df)

Non-deep

LearningD

eep Learning

Page 19: Distributed Deep Learning on Hadoop Clusters

CaffeOnSpark: One Notebook (Python) http://bit.ly/1REZ0cN

19

Page 20: Distributed Deep Learning on Hadoop Clusters

20

CaffeOnSpark: UI & Logs

Page 21: Distributed Deep Learning on Hadoop Clusters

Demo: CaffeOnSpark on EC2

https://github.com/yahoo/CaffeOnSpark/wiki› Get started on EC2› Python for CaffeOnSpark

Page 22: Distributed Deep Learning on Hadoop Clusters

CaffeOnSpark: What’s Next?

Validation within training Enhanced data layer RNN and LSTM Java API Asynchronous distributed training

Page 23: Distributed Deep Learning on Hadoop Clusters

Related Work: SparkNet & DL4J

1) [driver] sc.broadcast(model) to executors2) [executor] apply DL training against a mini-batch of dataset to

update models locally3) [driver] aggregate(models) to produce a new model

RE

PE

AT

Driver

Page 24: Distributed Deep Learning on Hadoop Clusters

Summary

24

Yahoo Hadoop clusters enhanced for deep learning› GPU nodes + CPU nodes› Infiniband network for fast communication

CaffeOnSpark open sourced› Empower Flickr and other Yahoo services

• In production since Q3 2015• Reduced training latency, and improved accuracy

› Scalable deep learning made easy• spark-submit on your Spark cluster

Page 25: Distributed Deep Learning on Hadoop Clusters

25

Thank You!

Repo: github.com/yahoo/CaffeOnSparkEmail: [email protected]