ignacio (nacho) cano · ignacio (nacho) cano ph.d. defense paul g. allen school of computer science...

100
Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018

Upload: others

Post on 14-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

Ignacio (Nacho) Cano

Ph.D. Defense

Paul G. Allen School of Computer Science & Engineering

University of Washington

December 2018

Page 2: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

••

Page 3: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

HDD TIER

SSD TIER

Page 4: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

HDD TIER

SSD TIER

Page 5: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

Virtualization Layer

Physical Layer

HYPERVISOR

VM VM VM VM

HYPERVISOR

VM VM VM VM

Page 6: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

Virtualization Layer

Physical Layer

HYPERVISOR

VM VM VMVM

HYPERVISOR

VMVM

VMVM

Page 7: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

West-US East-US

Page 8: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

West-US East-US

Page 9: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

Sub-optimal systems efficiency and performance

Page 10: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

Improve systems efficiency and performance

Page 11: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

Machine Learning to the rescue!

Page 12: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

1. CURATOR: ML-based policy

2. ADARES: ML-based mechanism

3. PULPO: ML-System co-design

Page 13: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

• Motivation• Challenges• CURATOR

• ADARES

• PULPO

• Conclusions

Page 14: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

• Motivation• Challenges• CURATOR

• ADARES

• PULPO

• Conclusions

Page 15: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •
Page 16: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

historical traces

transfer learning

Page 17: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

sensing

reward

Page 18: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

safe online exploration

uncertainty

Page 19: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

• Motivation• Challenges• CURATOR

• ADARES

• PULPO

• Conclusions

Page 20: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

Scheduling of these tasks is key to overall cluster performance

Page 21: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

• reinforcement learning

Page 22: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

• reinforcement learning

Page 23: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

50% of clusters have 75% usage

Many clusters waste 25% of fast storage

Need smarter scheduling policies

Page 24: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

AGENT

ENVIRONMENT

Action atState st Reward rt

st+1

rt+1

Page 25: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •
Page 26: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

0

5

10

15

20

25

30

35

40

oltp oltp skewed oltp varying oltp and vdi dss

Impr

ovem

ent (

%)

latency ssd reads

METRIC AVG IMPROVEMENT

LATENCY 12 %

SSD READS 16 %

Page 27: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

reinforcement learning

Page 28: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

• Motivation• Challenges• CURATOR

• ADARES

• PULPO

• Conclusions

Page 29: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •
Page 30: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

Need a system that adaptively changes resources allocated to VMs in a cluster

Page 31: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

• contextual bandits

• decoupled highly configurable

Page 32: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

• contextual bandits

• decoupled highly configurable

Page 33: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

AGENT

ENVIRONMENT

Arm/Action atContext xt Reward rt

Page 34: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •
Page 35: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •
Page 36: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

• Adaptive and improve over time

Page 37: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

Arm/Action atContext xt

Reward rt

SOFTWARE AGENT

CLUSTER (ENVIRONMENT)

Sensing

cluster-node-vmmeasurements

Execution

Filtering

recommendation

Predictive

predict+

learn

Decision

exploitexplore

+business

logic

Page 38: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •
Page 39: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

Less under and overprovisioning

Start End

Under and overprovisioning

Page 40: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

Increases utilization Keeps up with IOPS

Page 41: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

contextual bandits ••

• decoupled highly configurable•

Page 42: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

• Motivation• Challenges• CURATOR

• ADARES

• PULPO

• Conclusions

Page 43: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •
Page 44: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •
Page 45: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

Need a system to efficiently train ML models from geo-distributed datasets

Page 46: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

• Trade-off in-DC computation and communication X-DC communication

• Apache Hadoop YARN and Apache REEF

Page 47: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

• Trade-off in-DC computation and communication X-DC communication

• Apache Hadoop YARN and Apache REEF

Page 48: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

West-US Germany

Canada

DC Coordinator

More in-DC cmp. and comm.Less X-DC communication

Page 49: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •
Page 50: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

• Reduce communication between data centers

Page 51: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

• Multi-level communication trees across DCs

• Learning algorithms in terms of B/R primitives

Application Layer (DML/GDML)

• Generalized control plane

• Data aggregation, communication, etc. Apache REEF

• Federated version

• Single massive YARN cluster

• Network-aware resource requests

Apache Hadoop YARN

Page 52: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •
Page 53: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •
Page 54: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

● ●

101

103

2 4 6 8Data centers

X−D

C T

rans

fer (

GB)

● centralizeddistributeddistributed−fadl

CRITEO 10M CRITEO 50M – 2DC

●●

10−5

10−3

10−1

100 101 102 103

X−DC Transfer (GB)

Rel

ative

obj

. fun

ctio

n ● centralizeddistributeddistributed−fadl

Better models soonerOrders of magnitude reduction

Page 55: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

KAGGLE 500K KAGGLE 5M

10−7

10−5

10−3

10−1

0 2 4 6 8Relative Time

Rel

ative

obj

. fun

ctio

n centralized−bulkcentralized−streamdistributeddistributed−fadl

10−7

10−5

10−3

10−1

0 10 20 30Relative Time

Rel

ative

obj

. fun

ctio

n centralized−bulkcentralized−streamdistributeddistributed−fadl

Close to optimal in low dimensional models

Deteriorates with model size

Page 56: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

in-DC comp. and comm. X-DC communication

Apache Hadoop YARN Apache REEF

Page 57: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

• Motivation• Challenges• CURATOR

• ADARES

• PULPO

• Conclusions

Page 58: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

We can use Machine Learning to optimize Distributed Systems

• CURATOR• RL-based scheduling

• ADARES• Contextual bandits-based adjustments

• PULPO• Co-designed ML-System

Page 59: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •
Page 60: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •
Page 61: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

We can use Machine Learning to optimize Distributed Systems

• CURATOR• RL-based scheduling

• ADARES• Contextual bandits-based adjustments

• PULPO• Co-designed ML-System

Page 62: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •
Page 63: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •
Page 64: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

Operations interposed at the hypervisor level

and redirected to CVMs

Global view of cluster state

Integrated Compute-Storage

Data replicationDisk balancingVM Migration

Global view of cluster state

Scale

Page 65: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

KEY-VALUE STORE

I/O SERVICE(foreground)

STORAGE MGMT(background)

Page 66: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

KEY-VALUE STORE

STORAGE MGMTuses MapReduce

I/O SERVICE

Page 67: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

• extentsextent groups

eid egid

33 1984

44 1984

56 2010

Extent Id Map

egid physical location

1984 disk1,disk2

… …

Extent Group Id Map

file

33 44 56

0 1 2 3 4

Page 68: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •
Page 69: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

Node A

Node B

Node C

Node D

Curator (A)Slave

Curator (C)Slave

Curator (D)Slave

Curator (B)Master

Metadata Ring

Page 70: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

Curator (A)

Curator (C)

Curator (D) Curator (B)

120 -> mtime

120 -> atime

(120,mtime,atime)

Page 71: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •
Page 72: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

Q(st, at) = (1� ↵)Q(st, at) + ↵(rt+1 + �maxaQ(st+1, a))

0 � 1

0 ↵ 1

Learned Value

Estimate of optimal future value

Learning Rate

Discount Factor

Old Value

Page 73: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

LEARNED not to run in those cases

run although cluster highly utilized and low latency

Page 74: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •
Page 75: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •
Page 76: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •
Page 77: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

Hypervisor

CVM

CPUMemory

SSD

SSD

HDD

HDD

SCSI Controller

Sensing

I/O Hypervisor

CVM

CPUMemory

SSD

SSD

HDD

HDD

SCSI Controller

Sensing

I/O Hypervisor

CVM

CPUMemory

SSD

SSD

HDD

HDD

SCSI Controller

Sensing

I/O

VM Management Software

Cluster Manager

Execution

Decision

Predictive

Filtering

Page 78: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •
Page 79: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

w/o transfer learningmem noop

with transfer learning mem up

Low memory context

Page 80: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •
Page 81: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •
Page 82: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •
Page 83: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

West-US

More computationLess communication

Page 84: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •
Page 85: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

Initialize w0

DC-2 DC-3

DC-1 / Coordinator

1. Initialize w0

Page 86: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

Send w0

DC-2 DC-3

DC-1 / Coordinator

1. Initialize w0

Page 87: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

Compute Gradient

Compute Gradient

Compute Gradient

DC-2 DC-3

DC-1 / Coordinator

1. Initialize w0

2. DCs compute gradient in parallel

Page 88: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

Send Gradient

Send Gradient

DC-2 DC-3

DC-1 / Coordinator

1. Initialize w0

2. DCs compute gradient in parallel

Page 89: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

Aggregate gradient

DC-2 DC-3

DC-1 / Coordinator

1. Initialize w0

2. DCs compute gradient in parallel

3. Aggregate gradient

Page 90: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

Send aggregate gradient

DC-2 DC-3

DC-1 / Coordinator

1. Initialize w0

2. DCs compute gradient in parallel

3. Aggregate gradient

Page 91: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

Optimize

f̂1

Optimize

f̂3

Optimize

f̂2

DC-2 DC-3

DC-1 / Coordinator

no X-DC communication

1. Initialize w0

2. DCs compute gradient in parallel

3. Aggregate gradient

4. DCs local optimization in parallel

Page 92: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

Send Descent Direction 3

Send Descent

Direction 2

DC-2 DC-3

DC-1 / Coordinator

1. Initialize w0

2. DCs compute gradient in parallel

3. Aggregate gradient

4. DCs local optimization in parallel

Page 93: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

Aggregate descent

direction

DC-2 DC-3

DC-1 / Coordinator

1. Initialize w0

2. DCs compute gradient in parallel

3. Aggregate gradient

4. DCs local optimization in parallel

5. Aggregate descent direction

Page 94: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

Send aggregate descent

direction

DC-2 DC-3

DC-1 / Coordinator

1. Initialize w0

2. DCs compute gradient in parallel

3. Aggregate gradient

4. DCs local optimization in parallel

5. Aggregate descent direction

Page 95: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

Line Search

Line SearchLine Search

DC-2 DC-3

DC-1 / Coordinator

1. Initialize w0

2. DCs compute gradient in parallel

3. Aggregate gradient

4. DCs local optimization in parallel

5. Aggregate descent direction

6. DCs do line search in parallel

Page 96: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

Send Line Search Results

Send Line Search Results

DC-2 DC-3

DC-1 / Coordinator

1. Initialize w0

2. DCs compute gradient in parallel

3. Aggregate gradient

4. DCs local optimization in parallel

5. Aggregate descent direction

6. DCs do line search in parallel

Page 97: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

Update model

DC-2 DC-3

DC-1 / Coordinator

1. Initialize w0

2. DCs compute gradient in parallel

3. Aggregate gradient

4. DCs local optimization in parallel

5. Aggregate descent direction

6. DCs do line search in parallel

7. Update model with best step size

Page 98: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

Send w1

DC-2 DC-3

DC-1 / Coordinator

1. Initialize w0

2. DCs compute gradient in parallel

3. Aggregate gradient

4. DCs local optimization in parallel

5. Aggregate descent direction

6. DCs do line search in parallel

7. Update model with best step size

8. Repeat

Page 99: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •

Federated YARN cluster

Container

DC-1 DC-2 DC-3

DC Slaves

Global Master

X-DC linksX-DC linksDC Masters

Page 100: Ignacio (Nacho) Cano · Ignacio (Nacho) Cano Ph.D. Defense Paul G. Allen School of Computer Science & Engineering University of Washington December 2018 • •