spark based distributed deep learning framework for big data applications

55
Thesis Topic: Spark based Distributed Deep Learning Framework for Big Data Applications SMCC Lab Social Media Cloud Computing Research Center Prof Lee Han-Ku

Upload: humoyun-ahmedov

Post on 12-Apr-2017

163 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Spark Based Distributed Deep Learning Framework For Big Data Applications

Thesis Topic: Spark based Distributed Deep Learning Framework for Big

Data Applications

SMCC

Lab Social Media Cloud Computing

Research Center

Prof Lee Han-Ku

Page 2: Spark Based Distributed Deep Learning Framework For Big Data Applications

III Challenges in Distributed Computing

IV Apache Spark

V Deep Learning in Big Data

VI Proposed System

I Motivation

II Introduction

VII Experiments and Results

Conclusion

Outline

Page 3: Spark Based Distributed Deep Learning Framework For Big Data Applications

Motivation

Page 4: Spark Based Distributed Deep Learning Framework For Big Data Applications

Problem

Page 5: Spark Based Distributed Deep Learning Framework For Big Data Applications

Solution: Cluster

Data Parallelism

(partitioning data)

Page 6: Spark Based Distributed Deep Learning Framework For Big Data Applications

Wait a minute!?

D << N

D (dimension/number of features) = 1,300 N (size of training data) = 5,000,000

Page 7: Spark Based Distributed Deep Learning Framework For Big Data Applications

What if : Feature size is almost as huge as dataset

D ~ N

D = 1,134,000

N = 5,000,000

Page 8: Spark Based Distributed Deep Learning Framework For Big Data Applications

Further solution

Model Parallelism

CPU 1 CPU 2 CPU 3 CPU 4

Page 9: Spark Based Distributed Deep Learning Framework For Big Data Applications

Computer Vision: Face Recognition

Finance: Fraud Detection …

Medicine: Medical Diagnosis …

Data Mining: Prediction, Classification …

Industry: Process Control …

Operational Analysis: Cash Flow Forecasting …

Sales and Marketing: Sales Forecasting …

Science: Pattern Recognition …

Introduction

Applications of Deep Learning

Page 10: Spark Based Distributed Deep Learning Framework For Big Data Applications

Map ping

Mountain

River

City

Sun

Blue Cloud

Input Layer

Output Layer Hidden

Layers

Some Examples

Page 11: Spark Based Distributed Deep Learning Framework For Big Data Applications

Map ping

Input Layer Output Layer

The Face

Successfully

Recognized

Hidden

Layers

Some Examples

Page 12: Spark Based Distributed Deep Learning Framework For Big Data Applications

Map ping

Hidden

Layers Input Layer Output Layer

love

Romeo

kiss

hugs

…………

Happy End

Romance

Detective

Historical

Scientific

Technical

Some Examples

Page 13: Spark Based Distributed Deep Learning Framework For Big Data Applications

https://theclevermachine.wordpress.com/tag/backpropagation/

How it works?

Page 14: Spark Based Distributed Deep Learning Framework For Big Data Applications

Challenges

Distributed Computing Complexities

Heterogeneity

Openness

Security

Scalability

Fault Handling

Concurrency

Transparency

Page 15: Spark Based Distributed Deep Learning Framework For Big Data Applications

Apache Spark

Most Machine Learning algorithms are inherently iterative because each iteration can improve the results

With disk based approach each iteration’s output is written to disk which makes reading back slow

In Spark, the output can be cached in memory which makes reading very fast (distributed cache)

Hadoop execution flow

Spark execution flow

Page 16: Spark Based Distributed Deep Learning Framework For Big Data Applications

Initially started at UC Berkeley in 2009

Fast and general purpose cluster computing system

10x (on disk) – 100x (in-memory) faster than Hadoop

Most popular for running Iterative Machine Learning Algorithms

Provides high level API in

Java

Scala

Python

R

Combine SQL, streaming, and complex analytics.

Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3.

Apache Spark

Page 17: Spark Based Distributed Deep Learning Framework For Big Data Applications

Spark Stack

Spark SQL For SQL and unstructured data processing

Spark Streaming Stream processing of live data streams

MLLib Machine Learning Algorithms

GraphX Graph Processing

Apache Spark

Page 18: Spark Based Distributed Deep Learning Framework For Big Data Applications

"Deep learning" is the new big trend in Machine Learning. It promises general, powerful, and fast machine learning, moving us one step closer to AI.

An algorithm is deep if the input is passed through several non-linear functions before being output. Most modern learning algorithms (including Decision Trees and SVMs and Naive Bayes) are "shallow".

Deep Learning is about learning multiple levels of representation and abstraction that help to make sense of data such as images, sound, and text.

Deep Learning in Big Data

Page 19: Spark Based Distributed Deep Learning Framework For Big Data Applications

A key task associated with Big Data Analytics is information retrieval

Instead of using raw input for data indexing, Deep Learning can be utilized to generate high-level abstract data representations which will be used for semantic indexing.

These representations can reveal complex associations and factors (especially if raw input is Big Data), leading to semantic knowledge and understanding, for example by making search engines work more quickly and efficiently.

Deep Learning aids in providing a semantic and relational understanding of the data.

Deep Learning in Big Data

Semantic Indexing

Page 20: Spark Based Distributed Deep Learning Framework For Big Data Applications

The learnt complex data representations contain semantic and relational information instead of just raw bit data, they can directly be used for semantic indexing when each data point is presented by a vector representation, allowing for a vector-based comparison which is more efficient than comparing instances based directly on raw data.

The data instances that have similar vector representations are likely to have similar semantic meaning.

Thus, using vector representations of complex high-level data abstractions for indexing the data makes semantic indexing feasible

Deep Learning in Big Data

Page 21: Spark Based Distributed Deep Learning Framework For Big Data Applications

Traditional methods for representing word vectors

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 … ]

[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … ]

[government debt problems turning into banking crisis as has happened]

[saying that Europe needs unified banking regulation to replace the old]

Motel Say Good Cat Main

Snake Award Business Cola Twitter

Google Save Money Florida Post

Great Success Today Amazon Hotel

…. …. …. …. ….

Keep word by its context

Page 22: Spark Based Distributed Deep Learning Framework For Big Data Applications

Word2Vec (distributed representation of words)

Deep Learning in Big Data

•The cake was just good (trained tweet)

Training data

•The cake was just great (new unseen tweet) Test data

Page 23: Spark Based Distributed Deep Learning Framework For Big Data Applications

Deep Learning in Big Data

Great ( 0.938401)

Awesome ( 0.8912334 )

Well ( 0.8242320 )

Fine ( 0.7943241 )

Outstanding ( 0.71239 )

Normal ( 0.640323 )

…. ( ….. )

Good ( 1.0 )

They are close in

vector space

Word2Vec (distributed representation of words)

•The cake was just good (trained tweet)

Training data

•The cake was just great (new unseen tweet) Test data

Page 24: Spark Based Distributed Deep Learning Framework For Big Data Applications

Proposed System should deal with:

Concurrency

Asynchrony

Distributed Computing

Parallelism

model parallelism

data parallelism

Proposed System

Page 25: Spark Based Distributed Deep Learning Framework For Big Data Applications

1 2 3 4 5 6

Data Shard 1

Data Shard 1

Data Shard 1

Model Replicas

Parameter Servers

Master Spark

Driver

HDFS data nodes

Architecture

Page 26: Spark Based Distributed Deep Learning Framework For Big Data Applications

Domain Entities

Master

Start

Done

JobDone

DataShard

ReadyToProcess

FetchParameters

ParameterShard

ParameterRequest

LatestParameters

NeuralNetworkLayer

DoneFetchingParameters

Gradient

ForwardPass

BackwardPass

ChildLayer

Page 27: Spark Based Distributed Deep Learning Framework For Big Data Applications

Backward Pass

Child Layer Gradient Fetching Parameters

Forward Pass

Ready To Process

MASTER

Deep Layer Worker

Parameter Shard Worker

Job Done

Start

Data Shard Worker

Fetch Parameters

Parameter Request

Latest Parameters

Output

Proposed System

Class Hierarchy

Class Hierarchy

Page 28: Spark Based Distributed Deep Learning Framework For Big Data Applications

Data Shards (HDFS)

X1 𝑊11 𝑊12 𝑊12 𝑊14 𝑊15 𝑊16 …

X2 𝑊21 𝑊22 … … 𝑊26 …

X3 𝑊31 𝑊32 … … 𝑊36 …

… … … … … … … …

h1 𝑊11 𝑊12 𝑊12 𝑊14 𝑊15 𝑊16 …

h2 𝑊21 𝑊22 … … 𝑊26 …

h3 𝑊31 𝑊32 … … 𝑊36 …

… … … … … … … …

Corresponding

Model Replica

Input-to-hidden parameters

Hidden-to-output parameters

Data Shards

Page 29: Spark Based Distributed Deep Learning Framework For Big Data Applications

W W W W W W W W W W W W W W W W W W W W W W W W W W W W W . . . W W

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

.

.

.

X

X

Parameter Server

Page 30: Spark Based Distributed Deep Learning Framework For Big Data Applications

1.Start

Master

Client

Data Shards (HDFS)

Parameter Shards (HDFS)

Initialize Parameters

Node1 Node2 Node3 Node4

Node1 Node2 Node3 Node4

Node1 Node2 Node3 Node4

Initialization

Workflow

Page 31: Spark Based Distributed Deep Learning Framework For Big Data Applications

Master

Client

Data Shards

Parameter Shards

2. Ready

To Process

Node1 Node2 Node3 Node4

Node1 Node2 Node3 Node4

Node1 Node2 Node3 Node4

Initialization Initialize Neural

Network Layers

Initialize Parameters

1.Start

Workflow

Page 32: Spark Based Distributed Deep Learning Framework For Big Data Applications

Master

Client

Data Shards

Parameter Shards

2.Ready

To Process

Node1 Node2 Node3 Node4

Node1 Node2 Node3 Node4

Node1 Node2 Node3 Node4

5.Parameter Request

4.FetchParams

1.Start

Workflow

Page 33: Spark Based Distributed Deep Learning Framework For Big Data Applications

Master

Client

Data Shards

Parameter Shards

2.Ready

To Process

Initial

Parameters

Node1 Node2 Node3 Node4

Node1 Node2 Node3 Node4

Node1 Node2 Node3 Node4

5.Parameter Request

4.FetchParams

6.Latest Parameters 1.Start

Workflow

Page 34: Spark Based Distributed Deep Learning Framework For Big Data Applications

Master

Client

Data Shards

Parameter Shards

2.Ready

To Process

7.DoneFetchingParams

Node1 Node2 Node3 Node4

Node1 Node2 Node3 Node4

Node1 Node2 Node3 Node4

5.Parameter Request

6.Latest Parameters 1.Start

Workflow

Page 35: Spark Based Distributed Deep Learning Framework For Big Data Applications

Master

Client

Data Shards

Parameter Shards

2.Ready

To Process

7.DoneFetchingParams

Node1 Node2 Node3 Node4

Node1 Node2 Node3 Node4

Node1 Node2 Node3 Node4

8.Forward

5.Parameter Request

6.Latest Parameters

Training Data Examples

One by one

1.Start

Workflow

Page 36: Spark Based Distributed Deep Learning Framework For Big Data Applications

1.Start

Master

Client

Data Shards

Parameter Shards

2.Ready

To Process

Node1 Node2 Node3 Node4

Node1 Node2 Node3 Node4

Node1 Node2 Node3 Node4

9.Gradient

10.Latest Parameters

8.Forward

7.DoneFetchingParams

7.Backward

7.Backward

Logging

11. Output

Workflow

Page 37: Spark Based Distributed Deep Learning Framework For Big Data Applications

Master

Client

Data Shards

Parameter Shards

Node1 Node2 Node3 Node4

Node1 Node2 Node3 Node4

Node1 Node2 Node3 Node4

2.Gradient

5.Backward

5.Backward

Training(Learning) Phase

1.Forward

4.DoneFetchingParams

3.Latest Parameters

Logging 6. Output

Workflow

Page 38: Spark Based Distributed Deep Learning Framework For Big Data Applications

7.JobDone

Master

Client

Data Shards

Parameter Shards

6.Done

Node1 Node2 Node3 Node4

Node1 Node2 Node3 Node4

Node1 Node2 Node3 Node4

3.Backward

3.Backward

Training is Done

1.Gradient

2.Latest Parameters

5.DoneFetchingParams

Workflow

Logging 4. Output

Page 39: Spark Based Distributed Deep Learning Framework For Big Data Applications

Model Replica 1

Model Replica 2

Model Replica 3

Model Replica 4

Model Replica 5

Model Replica 6

Corresponding

Parameter

Shard

𝑥0

𝑥1 𝑥2

𝑥3

𝑥4

𝑥0 𝑥1 𝑥2 𝑥3 𝑥4

Learning Process

Page 40: Spark Based Distributed Deep Learning Framework For Big Data Applications

Cluster Nodes Single Node

3D view of the Model (Convergence point is the global minimum)

Global minimum

is the target

Page 41: Spark Based Distributed Deep Learning Framework For Big Data Applications

procedure STARTASYNCHRONOUSLYFETCHINGPARAMETERS(parameters)

parameters ← GETPARAMETERSFROMPARAMSERVER()

procedure STARTASYNCHRONOUSLYPUSHINGGRADIENTS(accruedgradients)

SENDGRADIENTSTOPARAMSERVER(accruedgradients)

accruedgradients ← 0

main

global parameters, accruedgradients

step ← 0

accruedgradients ← 0

while true do

if (step mod 𝑁𝑓𝑒𝑡𝑐ℎ) == 0

then STARTASYNCHRONOUSLYFETCHINGPARAMETERS(parameters)

data ← GETNEXTMINIBATCH()

gradient ← COMPUTEGRADIENT(parameters, data)

accruedgradients ← accruedgradients + gradient

parameters ← parameters − α ∗ gradient

if (step mod npush) == 0

then STARTASYNCHRONOUSLYPUSHINGGRADIENTS(accruedgradients)

step ← step + 1

SGD Algorithm

Page 42: Spark Based Distributed Deep Learning Framework For Big Data Applications

Sentiment Analysis

Experiments &Results

Page 43: Spark Based Distributed Deep Learning Framework For Big Data Applications

Traditional methods for representing word vectors

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 … ]

[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … ]

[government debt problems turning into banking crisis as has happened]

[saying that Europe needs unified banking regulation to replace the old]

Motel Say Good Cat Main

Snake Award Business Cola Twitter

Google Save Money Florida Post

Great Success Today Amazon Hotel

…. …. …. …. ….

Keep word by its context

Page 44: Spark Based Distributed Deep Learning Framework For Big Data Applications

Deep Learning in Big Data

Great ( 0.938401)

Awesome ( 0.8912334 )

Well ( 0.8242320 )

Fine ( 0.7943241 )

Outstanding ( 0.71239 )

Normal ( 0.640323 )

…. ( ….. )

Good ( 1.0 )

They are close in

vector space

Word2Vec (distributed representation of words)

•The cake was just good (trained tweet)

Training data

•The cake was just great (new unseen tweet) Test data

Page 45: Spark Based Distributed Deep Learning Framework For Big Data Applications

•Training Data

Tokenizer

•Tokenized Data

Count Vector •Word2Vec

(distributed represent)

Output

•Nonlinear classifier

Deep Net

Word2Vec - Deep Net

Page 46: Spark Based Distributed Deep Learning Framework For Big Data Applications

Deep Net Training

Page 47: Spark Based Distributed Deep Learning Framework For Big Data Applications
Page 48: Spark Based Distributed Deep Learning Framework For Big Data Applications

Assessment Cluster Specification (10 nodes)

CPU Intel Xeon 4 Core DP E5506 2.13GHz *2E

RAM 4GB Registered ECC DDR * 4EA

HDD 1TB SATA-2 7,200 RPM

OS Ubuntu 12.04 LTS 64bit

Spark Spark-1.6.0

Hadoop(HDFS) Hadoop 2.6.0

Java Oracle JDK 1.8.0_61 64 bit

Scala Scala-12.9.1

Python Python-2.7.9

Cluster Specs

Page 49: Spark Based Distributed Deep Learning Framework For Big Data Applications

0

5

10

15

20

25

30

2 nodes 4 nodes 6 nodes 8 nodes 10 nodes

Time Performance vs. Number of nodes

Ru

n T

ime

(m

ins)

Number of Nodes in Cluster

Performance

Page 50: Spark Based Distributed Deep Learning Framework For Big Data Applications

50

40

30

20

10

0

Iterations

Err

or

Ra

te

Accuracy

Page 51: Spark Based Distributed Deep Learning Framework For Big Data Applications

N p/n Sample from positive and negative tweets corpus

1 0 Very sad about Iran.

2 0 where is my picture i feel naked

3 1 the cake was just great!

4 1 had a WONDERFUL day G_D is GREAT!!!!!

5 1 I have passed 70-542 exam today

6 0 #3turnoffwords this shit sucks

7 1 @alexrauchman I am happy you are staying around here.

8 1 praise God for this beautiful day!!!

9 0 probably guna get off soon since no one is talkin no more

10 0 i still Feel like a Douchebag

11 1 Just another day in paradise. ;)

12 1 No no no. Tonight goes on the books as the worst SYTYCD results show.

13 0 i couldnt even have one fairytale night

14 0 AFI are not at reading till sunday this sucks !!

Samples

Page 52: Spark Based Distributed Deep Learning Framework For Big Data Applications

Spark Metrics

Page 53: Spark Based Distributed Deep Learning Framework For Big Data Applications

Tweet Statistics

Page 54: Spark Based Distributed Deep Learning Framework For Big Data Applications

The main goal of this work was to build Distributed Deep Learning Framework which is targeted for Big Data applications. We managed to implement the proposed system on top of Apache Spark, well-known general purpose data processing engine.

Deep network training of proposed system depends on well-known distributed Stochastic Gradient Descent method, namely Downpour SGD.

The system can be used in building Big Data application or can be integrated to Big Data analytics pipeline as it showed satisfactory performance in terms of both time and accuracy.

However, there are a lot of room for further enhancement and new features.

Conclusion

Page 55: Spark Based Distributed Deep Learning Framework For Big Data Applications

Thank You For Your Attention