gaia: geo-distributed machine learning approaching lan speeds · background: parameter server...
TRANSCRIPT
![Page 1: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/1.jpg)
Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds
Kevin HsiehAaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, Onur Mutlu†
†
![Page 2: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/2.jpg)
Machine Learning and Big Data
•Machine learning is widely used to derive useful information from large-scale data
2
Image Classification
VideosPictures
Video Analytics
User Activities
Preference Prediction
……
![Page 3: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/3.jpg)
Big Data is Geo-Distributed • A large amount of data is generated rapidly, all over the world
3
![Page 4: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/4.jpg)
Centralizing Data is Infeasible [1, 2, 3]
• Moving data over wide-area networks (WANs) can be extremely slow
• It is also subject to data sovereignty laws
4
1.Vulimiri etal.,NSDI’152.Puetal.,SIGCOMM’153.Viswanathanetal.,OSDI’16
![Page 5: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/5.jpg)
Geo-distributed ML is Challenging• No ML system is designed to run across data centers
(up to 53X slowdown in our study)
5
![Page 6: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/6.jpg)
Our Goal
•Develop a geo-distributed ML system•Minimize communication over wide-area networks
•Retain the accuracy and correctness of ML algorithms
•Without requiring changes to the algorithms
6
KeyResult:1.8-53.5Xspeedupoverstate-of-the-artMLsystemsonWANs
![Page 7: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/7.jpg)
Outline
•Problem & Goal•Background & Motivation•Gaia System Overview•Approximate Synchronous Parallel•System Implementation•Evaluation•Conclusion
7
![Page 8: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/8.jpg)
WorkerMachineN
ParameterServer
Background: Parameter Server Architecture
• The parameter server architecture has been widely adopted in many ML systems
8
…WorkerMachine1
Data1
ParameterServer
MLModelUpdateUpdate
ReadRead …
TrainingData
DataN
UpdateUpdate
ReadRead
…
![Page 9: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/9.jpg)
WorkerMachineN
ParameterServer
Background: Parameter Server Architecture
• The parameter server architecture has been widely adopted in many ML systems
9
…WorkerMachine1
Data1
ParameterServer
MLModelUpdateUpdate
ReadRead …
TrainingData
DataN
UpdateUpdate
ReadRead
…
Synchronization is critical to the accuracy and correctness of ML algorithms
![Page 10: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/10.jpg)
Deploy Parameter Servers on WANs• Deploying parameter servers across data centers
requires a lot of communication over WANs
10
WorkerMachineN
ParameterServer
…WorkerMachine1
ParameterServer
MLModel
…
DataCenter1 DataCenter2
![Page 11: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/11.jpg)
WAN: Low Bandwidth and High Cost• WAN bandwidth is 15X smaller than LAN bandwidth on average,
and up to 60X smaller• In Amazon EC2, the monetary cost of WAN communication is
up to 38X the cost of renting machines
11
VirginiaCaliforniaOregonIrelandFrankfurtTokyoSeoulSingaporeSydneyMumbaiSão Paulo
0100200300400500600700800900
1000
Net
wor
k Ba
ndw
idth
(Mb/
s)
11 Amazon EC2 Regions
![Page 12: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/12.jpg)
3.7X 3.5X
23.8X
5.9X 4.4X
24.2X
05
10152025
LAN EC2-ALL V/C WAN S/S WANNor
mal
ized
Exe
cutio
n Ti
me
until
Con
verg
ence IterStore Bӧsen
ML System Performance on WANs
1) Cui et al., “Exploiting Iterative-ness for Parallel ML Computations”, SoCC’14
2) Wei et al., “Managed Communication and Consistency for Fast Data-Parallel Iterative Analytics”, SoCC’15 12
MatrixFactorization
![Page 13: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/13.jpg)
3.7X 3.5X
23.8X
5.9X 4.4X
24.2X
05
10152025
LAN EC2-ALL V/C WAN S/S WANNor
mal
ized
Exe
cutio
n Ti
me
until
Con
verg
ence IterStore Bӧsen
ML System Performance on WANs
1) Cui et al., “Exploiting Iterative-ness for Parallel ML Computations”, SoCC’14
2) Wei et al., “Managed Communication and Consistency for Fast Data-Parallel Iterative Analytics”, SoCC’15 13
Running ML systems on WANs can seriously slow down ML applications
11EC2Regions
Virginia/CaliforniaSingapore/SãoPaulo
MatrixFactorization
![Page 14: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/14.jpg)
Outline
•Problem & Goal•Background & Motivation•Gaia System Overview•Approximate Synchronous Parallel•System Implementation•Evaluation•Conclusion
14
![Page 15: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/15.jpg)
Gaia System Overview• Key idea: Decouple the synchronization model within
the data center from the synchronization model between data centers
15
ParameterServer
DataCenter1
ParameterServer
WorkerMachine
LocalSync
ParameterServer
ParameterServer
…
DataCenter2
WorkerMachineWorkerMachine
…
ApproximatelyCorrectModelCopy
ApproximatelyCorrectModelCopy
RemoteSync
![Page 16: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/16.jpg)
Gaia System Overview• Key idea: Decouple the synchronization model within
the data center from the synchronization model between data centers
16
ParameterServer
DataCenter1
ParameterServer
WorkerMachine
LocalSync
ParameterServer
ParameterServer
…
DataCenter2
WorkerMachineWorkerMachine
…
ApproximatelyCorrectModelCopy
ApproximatelyCorrectModelCopy
RemoteSync
Communicate over WANsonly significant updates
![Page 17: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/17.jpg)
95.6% 95.2% 97.0%
0%20%40%60%80%
100%
10% 5% 1% 0.5% 0.1% 0.05% 0.01%Insi
gnifi
cant
Upd
ates
%
Threshold of Significant Updates
Matrix Factorization Topic Modeling Image Classification
Key Finding: Study of Update Significance
17
The vast majority of updates are insignificant
![Page 18: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/18.jpg)
Outline
•Problem & Goal•Background & Motivation•Gaia System Overview•Approximate Synchronous Parallel•System Implementation•Evaluation•Conclusion
18
![Page 19: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/19.jpg)
Approximate Synchronous Parallel
Thesignificancefilter• Filterupdatesbasedontheirsignificance
ASPselectivebarrier• Ensuresignificantupdatesarereadintime
Mirrorclock• Safeguardforpathologicalcases
19
![Page 20: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/20.jpg)
The Significance Filter
20
Worker Machine
ParameterServer
Parameter XValue Aggregated Update
Update (Δ1) on X
Significance Function
Other Parameters
Significance Threshold>?
𝐴𝑔𝑔. 𝑈𝑝𝑑𝑎𝑡𝑒𝑉𝑎𝑙𝑢𝑒
1%𝑇�
1%
Update (Δ2) on X
Δ1Δ1+Δ20
![Page 21: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/21.jpg)
Approximate Synchronous Parallel
Thesignificancefilter• Filterupdatesbasedontheirsignificance
ASPselectivebarrier• Ensuresignificantupdatesarereadintime
Mirrorclock• Safeguardforpathologicalcases
21
![Page 22: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/22.jpg)
ASP Selective Barrier
22
Data Center 1 Data Center 2
Parameter Server Parameter Server
Significant Update
Significant Update
Significant Update
Significant Update
Arrive too late!
Data Center 1 Data Center 2
Parameter Server Parameter Server
Significant Update
Significant Update
Significant Update
Significant Update
SelectiveBarrier
Only workers that depend on these parameters are blocked
![Page 23: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/23.jpg)
Outline
•Problem & Goal•Background & Motivation•Gaia System Overview•Approximate Synchronous Parallel•System Implementation•Evaluation•Conclusion
23
![Page 24: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/24.jpg)
Put it All Together: The Gaia System
24
LocalServer
GaiaParameterServer
WorkerMachine
SignificanceFilter
ParameterStore
WorkerMachine
WorkerMachine
MirrorServer
MirrorClient
DataCenterBoundary
ControlQueue
DataQueue
GaiaParameterServer
…
Update AggregatedUpdate
SelectiveBarrier
![Page 25: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/25.jpg)
Put it All Together: The Gaia System
25
LocalServer
GaiaParameterServer
WorkerMachine
SignificanceFilter
ParameterStore
WorkerMachine
WorkerMachine
MirrorServer
MirrorClient
DataCenterBoundary
ControlQueue
DataQueue
GaiaParameterServer
…
Update AggregatedUpdate
SelectiveBarrier
Control messages (barriers, etc.) are always prioritized
No change is required for ML algorithms and ML programs
![Page 26: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/26.jpg)
Problem: Broadcast Significant Updates
26
Communication overhead is proportional to the number of data centers
![Page 27: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/27.jpg)
Mitigation: Overlay Networks and Hubs
27
Save communication on WANs by aggregating the updates at hubs
Data Center Group Data Center Group Data Center Group
Data Center Group
Hub
HubHub
Hub
![Page 28: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/28.jpg)
Outline
•Problem & Goal•Background & Motivation•Gaia System Overview•Approximate Synchronous Parallel•System Implementation•Evaluation•Conclusion
28
![Page 29: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/29.jpg)
Methodology
• Applications• Matrix Factorization with the Netflix dataset• Topic Modeling with the Nytimes dataset• Image Classification with the ILSVRC12 dataset
• Hardware platform• 22 machines with emulated EC2 WAN bandwidth• We validated the performance with a real EC2 deployment
• Baseline• IterStore (Cui et al., SoCC’14) and GeePS (Cui et al., EuroSys’16) on WAN
• Performance metrics• Execution time until algorithm convergence• Monetary cost of algorithm convergence
29
![Page 30: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/30.jpg)
3.8X 3.7X6.0X
3.7X 4.8X8.5X
0
0.2
0.4
0.6
0.8
1
Matrix Factorization Topic Modeling Image Classification
Nor
mal
ized
Exe
c. T
ime
Baseline Gaia LAN
Performance – 11 EC2 Data Centers
30
Gaiaachieves3.7-6.0XspeedupoverBaselineGaiaisatmost1.40XofLANspeeds
![Page 31: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/31.jpg)
25.4X 14.1X53.5X
23.8X 17.3X 53.7X
0
0.2
0.4
0.6
0.8
1
Matrix Factorization
Topic Modeling Image Classification
Baseline Gaia LAN
3.7X 3.7X7.4X
3.5X 3.9X7.4X
0
0.2
0.4
0.6
0.8
1
Matrix Factorization
Topic Modeling Image Classification
Nor
mal
ized
Exe
c. T
ime
Baseline Gaia LAN
Performance and WAN Bandwidth
31
V/C WAN(Virginia/California)
S/S WAN(Singapore/São Paulo)
Gaiaachieves3.7-53.5XspeedupoverBaselineGaiaisatmost1.23XofLANspeeds
![Page 32: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/32.jpg)
2.6X5.7X 18.7X
00.5
11.5
22.5
Base
line
Gai
a
Base
line
Gai
a
Base
line
Gai
a
EC2-ALL V/C WAN S/S WAN
Results – EC2 Monetary Cost
32
4.2X 6.0X 28.5X0
0.51
1.52
2.5
Base
line
Gai
a
Base
line
Gai
a
Base
line
Gai
a
EC2-ALL V/C WAN S/S WAN
Nor
mlia
ed C
ost
Communication CostMachine Cost (Network)Machine Cost (Compute)
Matrix Factorization Topic Modeling
8.5X 10.7X 59.0X0
0.51
1.52
2.53
3.54
Base
line
Gai
a
Base
line
Gai
a
Base
line
Gai
a
EC2-ALL V/C WAN S/S WAN
Image Classification
Gaiais2.6-59.0XcheaperthanBaseline
![Page 33: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/33.jpg)
More in the Paper
•Convergence proof of Approximate Synchronous Parallel (ASP)
•ASP vs. fully asynchronous
•Gaia vs. centralizing data approach
33
![Page 34: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/34.jpg)
Key Takeaways• The Problem: How to perform ML on geo-distributed data?
• Centralizing data is infeasible. Geo-distributed ML is very slow
• Our Gaia Approach• Decouple the synchronization model within the data center from
that across data centers • Eliminate insignificant updates across data centers
• A new synchronization model: Approximate Synchronous Parallel• Retain the correctness and accuracy of ML algorithms
• Key Results: • 1.8-53.5X speedup over state-of-the-art ML systems on WANs • at most 1.40X of LAN speeds• without requiring changes to algorithms 34
![Page 35: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/35.jpg)
Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds
Kevin HsiehAaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, Onur Mutlu†
†
![Page 36: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/36.jpg)
Executive Summary• The Problem: How to perform ML on geo-distributed data?
• Centralizing data is infeasible. Geo-distributed ML is very slow
• Our Goal• Minimize communication over WANs• Retain the correctness and accuracy of ML algorithms• Without requiring changes to ML algorithms
• Our Gaia Approach• Decouple the synchronization model within the data center from
that across data centers: Eliminate insignificant updates on WANs• A new synchronization model: Approximate Synchronous Parallel
• Key Results: • 1.8-53.5X speedup over state-of-the-art ML systems on WANs• within 1.40X of LAN speeds 36
![Page 37: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/37.jpg)
Approximate Synchronous Parallel
Thesignificancefilter• Filterupdatesbasedtheirsignificance
ASPselectivebarrier• Ensuresignificantupdatesarereadintime
Mirrorclock• Safeguardforpathologicalcases
37
![Page 38: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/38.jpg)
Mirror Clock
38
Data Center 1 Data Center 2
Parameter Server Parameter Server
d
Data Center 2 Data Center 1
Parameter Server Parameter Server
Clock N
Clock N Clock N + DSGuarantees all significant updates
are seen after DS clocks
BarrierNo guarantee under extreme network conditions
![Page 39: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/39.jpg)
Effect of Synchronization Mechanisms
39
-1.5E+09 -1.4E+09 -1.3E+09 -1.2E+09 -1.1E+09 -1.0E+09 -9.0E+08
0 250 500 750 1000
Object
ive valu
e
Time (Seconds)
Gaia Gaia_Async
Convergence value
0E+00 1E+08 2E+08 3E+08 4E+08 5E+08 6E+08 7E+08 8E+08 9E+08 1E+09
0 50 100 150 200 250 300 350
Object
ive valu
e
Time (Seconds)
Gaia Gaia_Async
Convergence value
MatrixFactorization TopicModeling
![Page 40: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/40.jpg)
Methodology Details• Hardware
• A 22-node cluster. Each has a 16-core Intel Xeon CPU (E5-2698), a NVIDIA Titan X GPU, 64GB RAM, and a 40GbE NIC
• Application details• Matrix Factorization: SGD algorithm, 500 ranks• Topic Modeling: Gibbs sampling, 500 topics
• Convergence criteria• The value of the objective function changes less than 2% over the
course of 10 iterations• Significance Threshold
• 1% and shrinks over time 1%2�
40
![Page 41: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/41.jpg)
ML System Performance Comparison
• IterStore [Cui et al. SoCC’15] shows 10X performance improvement over PowerGraph [Gonzalez et al., OSDI’12] for Matrix Factorization
• PowerGraph matches the performance of GraphX [Gonzalez et al., OSDI’14], a Spark-based system
41
![Page 42: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/42.jpg)
Matrix Factorization (1/3)• Matrix factorization (also known as collaborative filtering) is a
technique commonly used in recommender systems
42
![Page 43: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/43.jpg)
Matrix Factorization (2/3)
43
≅
Movie
User 4
Rank(UserPreferenceParameters)(θ)
Rank(MovieParameters)(x)
![Page 44: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/44.jpg)
Matrix Factorization (3/3)
• Objective function (L2 regularization)
44
• Solve with stochastic gradient decent (SGD)
![Page 45: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/45.jpg)
Background – BSP• BSP (Bulk Synchronous Parallel)
• All machines need to receive all updates before proceeding to the next iteration
45
Worker1
Worker2
Worker3Clock
0 1 2 3
![Page 46: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/46.jpg)
Background – SSP• SSP (Stale Synchronous Parallel)
• Allows the fastest worker ahead of the slowest worker by a bounded number of iterations
46
Worker1
Worker2
Worker3Clock
0 1 2 3
Staleness=1
![Page 47: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/47.jpg)
Compare Against Centralizing Approach
47
GaiaSpeedupoverCentralize
GaiatoCentralizeCostRatio
MatrixFactorization EC2-ALL 1.11 3.54V/CWAN 1.22 1.00S/SWAN 2.13 1.17
TopicModeling EC2-ALL 0.80 6.14V/CWAN 1.02 1.26S/SWAN 1.25 1.92
ImageClassification EC2-ALL 0.76 3.33V/CWAN 1.12 1.07S/SWAN 1.86 1.08
![Page 48: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/48.jpg)
SSP Performance – 11 Data Centers
48
2.0X 2.0X 1.5X 1.5X1.8X 1.8X1.3X 1.3X
3.8X 3.7X 3.0X 2.7X
00.10.20.30.40.50.60.70.80.9
1
Baseline Gaia LAN Baseline Gaia LAN
BSP SSP
Nro
mal
ized
Exe
cutio
n Ti
me
Amazon-EC2Emulation-EC2Emulation-Full-Speed
Matrix Factorization
![Page 49: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/49.jpg)
SSP Performance – 11 Data Centers
49
Topic Modeling
2.0X2.5X 1.5X 1.7X
3.7X 4.8X2.0X
3.5X
00.10.20.30.40.50.60.70.80.9
1
Baseline Gaia LAN Baseline Gaia LAN
BSP SSP
Nro
mal
ized
Exe
cutio
n Ti
me Emulation-EC2
Emulation-Full-Speed
![Page 50: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/50.jpg)
SSP Performance – V/C WAN
50
Matrix Factorization
3.7X 2.6X3.5X 2.3X
00.10.20.30.40.50.60.70.80.9
1
BSP SSP
Baseline Gaia LAN
Topic Modeling
3.7X 3.1X3.9X 3.2X
00.10.20.30.40.50.60.70.80.9
1
BSP SSP
Baseline Gaia LAN
![Page 51: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds · Background: Parameter Server Architecture •The parameter server architecture has been widely adopted in many ML](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f5a5668c9476e000f5ae26d/html5/thumbnails/51.jpg)
SSP Performance – S/S WAN
51
Matrix Factorization Topic Modeling
25X 16X24X 14X0
0.10.20.30.40.50.60.70.80.9
1
BSP SSP
Baseline Gaia LAN
14X 17X17X 21X0
0.10.20.30.40.50.60.70.80.9
1
BSP SSP
Baseline Gaia LAN