gaia: geo-distributed machine learning approaching lan speeds · background: parameter server...

Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds

Kevin HsiehAaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, Onur Mutlu†

†

Machine Learning and Big Data

•Machine learning is widely used to derive useful information from large-scale data

2

Image Classification

VideosPictures

Video Analytics

User Activities

Preference Prediction

……

Big Data is Geo-Distributed • A large amount of data is generated rapidly, all over the world

3

Centralizing Data is Infeasible [1, 2, 3]

• Moving data over wide-area networks (WANs) can be extremely slow

• It is also subject to data sovereignty laws

4

1.Vulimiri etal.,NSDI’152.Puetal.,SIGCOMM’153.Viswanathanetal.,OSDI’16

Geo-distributed ML is Challenging• No ML system is designed to run across data centers

(up to 53X slowdown in our study)

5

Our Goal

•Develop a geo-distributed ML system•Minimize communication over wide-area networks

•Retain the accuracy and correctness of ML algorithms

•Without requiring changes to the algorithms

6

KeyResult:1.8-53.5Xspeedupoverstate-of-the-artMLsystemsonWANs

Outline

•Problem & Goal•Background & Motivation•Gaia System Overview•Approximate Synchronous Parallel•System Implementation•Evaluation•Conclusion

7

WorkerMachineN

ParameterServer

Background: Parameter Server Architecture

• The parameter server architecture has been widely adopted in many ML systems

8

…WorkerMachine1

Data1

ParameterServer

MLModelUpdateUpdate

ReadRead …

TrainingData

DataN

UpdateUpdate

ReadRead

…

WorkerMachineN

ParameterServer

Background: Parameter Server Architecture

• The parameter server architecture has been widely adopted in many ML systems

9

…WorkerMachine1

Data1

ParameterServer

MLModelUpdateUpdate

ReadRead …

TrainingData

DataN

UpdateUpdate

ReadRead

…

Synchronization is critical to the accuracy and correctness of ML algorithms

Deploy Parameter Servers on WANs• Deploying parameter servers across data centers

requires a lot of communication over WANs

10

WorkerMachineN

ParameterServer

…WorkerMachine1

ParameterServer

MLModel

…

DataCenter1 DataCenter2

WAN: Low Bandwidth and High Cost• WAN bandwidth is 15X smaller than LAN bandwidth on average,

and up to 60X smaller• In Amazon EC2, the monetary cost of WAN communication is

up to 38X the cost of renting machines

11

VirginiaCaliforniaOregonIrelandFrankfurtTokyoSeoulSingaporeSydneyMumbaiSão Paulo

0100200300400500600700800900

1000

Net

wor

k Ba

ndw

idth

(Mb/

s)

11 Amazon EC2 Regions

3.7X 3.5X

23.8X

5.9X 4.4X

24.2X

05

10152025

LAN EC2-ALL V/C WAN S/S WANNor

mal

ized

Exe

cutio

n Ti

me

until

Con

verg

ence IterStore Bӧsen

ML System Performance on WANs

1) Cui et al., “Exploiting Iterative-ness for Parallel ML Computations”, SoCC’14

2) Wei et al., “Managed Communication and Consistency for Fast Data-Parallel Iterative Analytics”, SoCC’15 12

MatrixFactorization

3.7X 3.5X

23.8X

5.9X 4.4X

24.2X

05

10152025

LAN EC2-ALL V/C WAN S/S WANNor

mal

ized

Exe

cutio

n Ti

me

until

Con

verg

ence IterStore Bӧsen

ML System Performance on WANs

1) Cui et al., “Exploiting Iterative-ness for Parallel ML Computations”, SoCC’14

2) Wei et al., “Managed Communication and Consistency for Fast Data-Parallel Iterative Analytics”, SoCC’15 13

Running ML systems on WANs can seriously slow down ML applications

11EC2Regions

Virginia/CaliforniaSingapore/SãoPaulo

MatrixFactorization

Outline


14

Gaia System Overview• Key idea: Decouple the synchronization model within

the data center from the synchronization model between data centers

15

ParameterServer

DataCenter1

ParameterServer

WorkerMachine

LocalSync

ParameterServer

ParameterServer

…

DataCenter2

WorkerMachineWorkerMachine

…

ApproximatelyCorrectModelCopy


RemoteSync

Gaia System Overview• Key idea: Decouple the synchronization model within

the data center from the synchronization model between data centers

16

ParameterServer

DataCenter1

ParameterServer

WorkerMachine

LocalSync

ParameterServer

ParameterServer

…

DataCenter2

WorkerMachineWorkerMachine

…



RemoteSync

Communicate over WANsonly significant updates

95.6% 95.2% 97.0%

0%20%40%60%80%

100%

10% 5% 1% 0.5% 0.1% 0.05% 0.01%Insi

gnifi

cant

Upd

ates

%

Threshold of Significant Updates

Matrix Factorization Topic Modeling Image Classification

Key Finding: Study of Update Significance

17

The vast majority of updates are insignificant

Outline


18

Approximate Synchronous Parallel

Thesignificancefilter• Filterupdatesbasedontheirsignificance

ASPselectivebarrier• Ensuresignificantupdatesarereadintime

Mirrorclock• Safeguardforpathologicalcases

19

The Significance Filter

20

Worker Machine

ParameterServer

Parameter XValue Aggregated Update

Update (Δ1) on X

Significance Function

Other Parameters

Significance Threshold>?

𝐴𝑔𝑔. 𝑈𝑝𝑑𝑎𝑡𝑒𝑉𝑎𝑙𝑢𝑒

1%𝑇�

1%

Update (Δ2) on X

Δ1Δ1+Δ20


Thesignificancefilter• Filterupdatesbasedontheirsignificance



21

ASP Selective Barrier

22

Data Center 1 Data Center 2

Parameter Server Parameter Server

Significant Update

Significant Update

Significant Update

Significant Update

Arrive too late!



Significant Update

Significant Update

Significant Update

Significant Update

SelectiveBarrier

Only workers that depend on these parameters are blocked

Outline


23

Put it All Together: The Gaia System

24

LocalServer

GaiaParameterServer

WorkerMachine

SignificanceFilter

ParameterStore

WorkerMachine

WorkerMachine

MirrorServer

MirrorClient

DataCenterBoundary

ControlQueue

DataQueue

GaiaParameterServer

…

Update AggregatedUpdate

SelectiveBarrier

Put it All Together: The Gaia System

25

LocalServer

GaiaParameterServer

WorkerMachine

SignificanceFilter

ParameterStore

WorkerMachine

WorkerMachine

MirrorServer

MirrorClient

DataCenterBoundary

ControlQueue

DataQueue

GaiaParameterServer

…

Update AggregatedUpdate

SelectiveBarrier

Control messages (barriers, etc.) are always prioritized

No change is required for ML algorithms and ML programs

Problem: Broadcast Significant Updates

26

Communication overhead is proportional to the number of data centers

Mitigation: Overlay Networks and Hubs

27

Save communication on WANs by aggregating the updates at hubs

Data Center Group Data Center Group Data Center Group

Data Center Group

Hub

HubHub

Hub

Outline


28

Methodology

• Applications• Matrix Factorization with the Netflix dataset• Topic Modeling with the Nytimes dataset• Image Classification with the ILSVRC12 dataset

• Hardware platform• 22 machines with emulated EC2 WAN bandwidth• We validated the performance with a real EC2 deployment

• Baseline• IterStore (Cui et al., SoCC’14) and GeePS (Cui et al., EuroSys’16) on WAN

• Performance metrics• Execution time until algorithm convergence• Monetary cost of algorithm convergence

29

3.8X 3.7X6.0X

3.7X 4.8X8.5X

0

0.2

0.4

0.6

0.8

1

Matrix Factorization Topic Modeling Image Classification

Nor

mal

ized

Exe

c. T

ime

Baseline Gaia LAN

Performance – 11 EC2 Data Centers

30

Gaiaachieves3.7-6.0XspeedupoverBaselineGaiaisatmost1.40XofLANspeeds

25.4X 14.1X53.5X

23.8X 17.3X 53.7X

0

0.2

0.4

0.6

0.8

1

Matrix Factorization

Topic Modeling Image Classification

Baseline Gaia LAN

3.7X 3.7X7.4X

3.5X 3.9X7.4X

0

0.2

0.4

0.6

0.8

1


Topic Modeling Image Classification

Nor

mal

ized

Exe

c. T

ime

Baseline Gaia LAN

Performance and WAN Bandwidth

31

V/C WAN(Virginia/California)

S/S WAN(Singapore/São Paulo)

Gaiaachieves3.7-53.5XspeedupoverBaselineGaiaisatmost1.23XofLANspeeds

2.6X5.7X 18.7X

00.5

11.5

22.5

Base

line

Gai

a

Base

line

Gai

a

Base

line

Gai

a

EC2-ALL V/C WAN S/S WAN

Results – EC2 Monetary Cost

32

4.2X 6.0X 28.5X0

0.51

1.52

2.5

Base

line

Gai

a

Base

line

Gai

a

Base

line

Gai

a


Nor

mlia

ed C

ost

Communication CostMachine Cost (Network)Machine Cost (Compute)

Matrix Factorization Topic Modeling

8.5X 10.7X 59.0X0

0.51

1.52

2.53

3.54

Base

line

Gai

a

Base

line

Gai

a

Base

line

Gai

a


Image Classification

Gaiais2.6-59.0XcheaperthanBaseline

More in the Paper

•Convergence proof of Approximate Synchronous Parallel (ASP)

•ASP vs. fully asynchronous

•Gaia vs. centralizing data approach

33

Key Takeaways• The Problem: How to perform ML on geo-distributed data?

• Centralizing data is infeasible. Geo-distributed ML is very slow

• Our Gaia Approach• Decouple the synchronization model within the data center from

that across data centers • Eliminate insignificant updates across data centers

• A new synchronization model: Approximate Synchronous Parallel• Retain the correctness and accuracy of ML algorithms

• Key Results: • 1.8-53.5X speedup over state-of-the-art ML systems on WANs • at most 1.40X of LAN speeds• without requiring changes to algorithms 34

Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds

Kevin HsiehAaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, Onur Mutlu†

†

Executive Summary• The Problem: How to perform ML on geo-distributed data?

• Centralizing data is infeasible. Geo-distributed ML is very slow

• Our Goal• Minimize communication over WANs• Retain the correctness and accuracy of ML algorithms• Without requiring changes to ML algorithms

• Our Gaia Approach• Decouple the synchronization model within the data center from

that across data centers: Eliminate insignificant updates on WANs• A new synchronization model: Approximate Synchronous Parallel

• Key Results: • 1.8-53.5X speedup over state-of-the-art ML systems on WANs• within 1.40X of LAN speeds 36


Thesignificancefilter• Filterupdatesbasedtheirsignificance



37

Mirror Clock

38



d



Clock N

Clock N Clock N + DSGuarantees all significant updates

are seen after DS clocks

BarrierNo guarantee under extreme network conditions

Effect of Synchronization Mechanisms

39

-1.5E+09 -1.4E+09 -1.3E+09 -1.2E+09 -1.1E+09 -1.0E+09 -9.0E+08

0 250 500 750 1000

Object

ive valu

e

Time (Seconds)

Gaia Gaia_Async

Convergence value

0E+00 1E+08 2E+08 3E+08 4E+08 5E+08 6E+08 7E+08 8E+08 9E+08 1E+09

0 50 100 150 200 250 300 350

Object

ive valu

e

Time (Seconds)

Gaia Gaia_Async

Convergence value

MatrixFactorization TopicModeling

Methodology Details• Hardware

• A 22-node cluster. Each has a 16-core Intel Xeon CPU (E5-2698), a NVIDIA Titan X GPU, 64GB RAM, and a 40GbE NIC

• Application details• Matrix Factorization: SGD algorithm, 500 ranks• Topic Modeling: Gibbs sampling, 500 topics

• Convergence criteria• The value of the objective function changes less than 2% over the

course of 10 iterations• Significance Threshold

• 1% and shrinks over time 1%2�

40

ML System Performance Comparison

• IterStore [Cui et al. SoCC’15] shows 10X performance improvement over PowerGraph [Gonzalez et al., OSDI’12] for Matrix Factorization

• PowerGraph matches the performance of GraphX [Gonzalez et al., OSDI’14], a Spark-based system

41

Matrix Factorization (1/3)• Matrix factorization (also known as collaborative filtering) is a

technique commonly used in recommender systems

42

Matrix Factorization (2/3)

43

≅

Movie

User 4

Rank(UserPreferenceParameters)(θ)

Rank(MovieParameters)(x)

Matrix Factorization (3/3)

• Objective function (L2 regularization)

44

• Solve with stochastic gradient decent (SGD)

Background – BSP• BSP (Bulk Synchronous Parallel)

• All machines need to receive all updates before proceeding to the next iteration

45

Worker1

Worker2

Worker3Clock

0 1 2 3

Background – SSP• SSP (Stale Synchronous Parallel)

• Allows the fastest worker ahead of the slowest worker by a bounded number of iterations

46

Worker1

Worker2

Worker3Clock

0 1 2 3

Staleness=1

Compare Against Centralizing Approach

47

GaiaSpeedupoverCentralize

GaiatoCentralizeCostRatio

MatrixFactorization EC2-ALL 1.11 3.54V/CWAN 1.22 1.00S/SWAN 2.13 1.17

TopicModeling EC2-ALL 0.80 6.14V/CWAN 1.02 1.26S/SWAN 1.25 1.92

ImageClassification EC2-ALL 0.76 3.33V/CWAN 1.12 1.07S/SWAN 1.86 1.08

SSP Performance – 11 Data Centers

48

2.0X 2.0X 1.5X 1.5X1.8X 1.8X1.3X 1.3X

3.8X 3.7X 3.0X 2.7X

00.10.20.30.40.50.60.70.80.9

1

Baseline Gaia LAN Baseline Gaia LAN

BSP SSP

Nro

mal

ized

Exe

cutio

n Ti

me

Amazon-EC2Emulation-EC2Emulation-Full-Speed


SSP Performance – 11 Data Centers

49

Topic Modeling

2.0X2.5X 1.5X 1.7X

3.7X 4.8X2.0X

3.5X

00.10.20.30.40.50.60.70.80.9

1

Baseline Gaia LAN Baseline Gaia LAN

BSP SSP

Nro

mal

ized

Exe

cutio

n Ti

me Emulation-EC2

Emulation-Full-Speed

SSP Performance – V/C WAN

50


3.7X 2.6X3.5X 2.3X

00.10.20.30.40.50.60.70.80.9

1

BSP SSP

Baseline Gaia LAN

Topic Modeling

3.7X 3.1X3.9X 3.2X

00.10.20.30.40.50.60.70.80.9

1

BSP SSP

Baseline Gaia LAN

SSP Performance – S/S WAN

51

Matrix Factorization Topic Modeling

25X 16X24X 14X0

0.10.20.30.40.50.60.70.80.9

1

BSP SSP

Baseline Gaia LAN

14X 17X17X 21X0

0.10.20.30.40.50.60.70.80.9

1

BSP SSP

Baseline Gaia LAN

gaia: geo-distributed machine learning approaching lan speeds · background: parameter server...

Documents