jockey - eurosys presentationcs.brown.edu/~adf/work/eurosys2012-talk.pdfjockey guaranteed job...
TRANSCRIPT
![Page 1: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/1.jpg)
Jockey Guaranteed Job Latency in
Data Parallel Clusters
Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric Boutin, and Rodrigo Fonseca
![Page 2: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/2.jpg)
2
DATA PARALLEL CLUSTERS
![Page 3: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/3.jpg)
3
DATA PARALLEL CLUSTERS
![Page 4: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/4.jpg)
4
DATA PARALLEL CLUSTERS Predictability
![Page 5: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/5.jpg)
5
DATA PARALLEL CLUSTERS Deadline
![Page 6: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/6.jpg)
6
DATA PARALLEL CLUSTERS Deadline
![Page 7: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/7.jpg)
7
VARIABLE LATENCY
![Page 8: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/8.jpg)
8
VARIABLE LATENCY
0 5 10 15 20 25 30 35 40
latency [minutes]
![Page 9: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/9.jpg)
9
VARIABLE LATENCY
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20 25 30 35 40
CDF
latency [minutes]
![Page 10: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/10.jpg)
10
VARIABLE LATENCY
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20 25 30 35 40
CDF
latency [minutes]
![Page 11: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/11.jpg)
11
VARIABLE LATENCY
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20 25 30 35 40
CDF
latency [minutes]
![Page 12: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/12.jpg)
12
VARIABLE LATENCY
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20 25 30 35 40
CDF
latency [minutes]
![Page 13: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/13.jpg)
13
VARIABLE LATENCY
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20 25 30 35 40
CDF
latency [minutes]
![Page 14: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/14.jpg)
14
VARIABLE LATENCY
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20 25 30 35 40
CDF
latency [minutes]
4.3x
![Page 15: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/15.jpg)
15
Why does latency vary?
1. Pipeline complexity 2. Noisy execution environment
![Page 16: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/16.jpg)
Cosmos
16
MICROSOFT’S DATA PARALLEL CLUSTERS
![Page 17: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/17.jpg)
Cosmos
17
MICROSOFT’S DATA PARALLEL CLUSTERS
• CosmosStore
![Page 18: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/18.jpg)
Cosmos
18
MICROSOFT’S DATA PARALLEL CLUSTERS
• CosmosStore • Dryad
![Page 19: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/19.jpg)
Cosmos
19
MICROSOFT’S DATA PARALLEL CLUSTERS
• CosmosStore • Dryad • SCOPE
![Page 20: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/20.jpg)
Cosmos
20
MICROSOFT’S DATA PARALLEL CLUSTERS
• CosmosStore • Dryad • SCOPE
![Page 21: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/21.jpg)
21
DRYAD’S DAG WORKFLOW
Cosm
os Cl
uste
r
![Page 22: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/22.jpg)
22
DRYAD’S DAG WORKFLOW
Cosm
os Cl
uste
r
![Page 23: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/23.jpg)
23
DRYAD’S DAG WORKFLOW
Pipeline
Job
![Page 24: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/24.jpg)
24
DRYAD’S DAG WORKFLOW
Deadline
Deadline
Deadline Deadline
Deadline
![Page 25: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/25.jpg)
25
DRYAD’S DAG WORKFLOW
Deadline
Deadline
Deadline Deadline
Deadline
![Page 26: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/26.jpg)
26
Stage
DRYAD’S DAG WORKFLOW
Job
![Page 27: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/27.jpg)
27
Stage
DRYAD’S DAG WORKFLOW
Tasks
Job
![Page 28: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/28.jpg)
28
![Page 29: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/29.jpg)
29
EXPRESSING PERFORMANCE TARGETS
Priorities?
![Page 30: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/30.jpg)
30
EXPRESSING PERFORMANCE TARGETS
Priorities? Not expressive enough
Weights?
![Page 31: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/31.jpg)
31
EXPRESSING PERFORMANCE TARGETS
Priorities? Not expressive enough
Weights? Difficult for users to set
Utility curves?
![Page 32: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/32.jpg)
32
EXPRESSING PERFORMANCE TARGETS
Priorities? Not expressive enough
Weights? Difficult for users to set
Utility curves? Capture deadline & penalty
![Page 33: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/33.jpg)
33
OUR GOAL
![Page 34: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/34.jpg)
34
OUR GOAL
Maximize utility
![Page 35: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/35.jpg)
35
OUR GOAL
Maximize utility while minimizing resources
![Page 36: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/36.jpg)
36
OUR GOAL
Maximize utility while minimizing resources
by dynamically adjusting the allocation
![Page 37: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/37.jpg)
Jockey 37
![Page 38: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/38.jpg)
Jockey 38
• Large clusters
![Page 39: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/39.jpg)
Jockey 39
• Large clusters • Many users
![Page 40: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/40.jpg)
Jockey 40
• Large clusters • Many users • Prior execution
![Page 41: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/41.jpg)
41
JOCKEY – MODEL
f( job state, allocation) -> remaining run time
![Page 42: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/42.jpg)
42
JOCKEY – MODEL
f( job state, allocation) -> remaining run time
![Page 43: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/43.jpg)
43
JOCKEY – MODEL
f( job state, allocation) -> remaining run time
![Page 44: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/44.jpg)
44
JOCKEY – MODEL
f( job state, allocation) -> remaining run time
![Page 45: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/45.jpg)
45
JOCKEY – CONTROL LOOP
![Page 46: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/46.jpg)
46
JOCKEY – CONTROL LOOP
![Page 47: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/47.jpg)
47
JOCKEY – CONTROL LOOP
![Page 48: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/48.jpg)
48
JOCKEY – MODEL
f( job state, allocation) -> remaining run time
![Page 49: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/49.jpg)
49
JOCKEY – MODEL
f(progress, allocation) -> remaining run time
f( job state, allocation) -> remaining run time
![Page 50: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/50.jpg)
50
JOCKEY – PROGRESS INDICATOR
![Page 51: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/51.jpg)
51
JOCKEY – PROGRESS INDICATOR
![Page 52: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/52.jpg)
52
JOCKEY – PROGRESS INDICATOR
![Page 53: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/53.jpg)
53
JOCKEY – PROGRESS INDICATOR
total running
![Page 54: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/54.jpg)
54
JOCKEY – PROGRESS INDICATOR
total running +
total queuing
![Page 55: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/55.jpg)
55
JOCKEY – PROGRESS INDICATOR
stage
total running +
total queuing
![Page 56: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/56.jpg)
56
JOCKEY – PROGRESS INDICATOR
total running +
total queuing
total running +
total queuing
total running +
total queuing
Stage 1
Stage 2
Stage 3
+
+
![Page 57: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/57.jpg)
57
JOCKEY – PROGRESS INDICATOR
total running +
total queuing
total running +
total queuing
total running +
total queuing
# complete total tasks
# complete total tasks
# complete total tasks
Stage 1
Stage 2
Stage 3
+
+
![Page 58: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/58.jpg)
58
JOCKEY – PROGRESS INDICATOR
![Page 59: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/59.jpg)
59
JOCKEY – PROGRESS INDICATOR
0 10 20 30 40 50 60
time [min]
![Page 60: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/60.jpg)
60
JOCKEY – PROGRESS INDICATOR
0
20
40
60
80
100
0 10 20 30 40 50 60
job
prog
ress
time [min]
![Page 61: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/61.jpg)
61
JOCKEY – PROGRESS INDICATOR
0
20
40
60
80
100
0 10 20 30 40 50 60
job
prog
ress
time [min]
![Page 62: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/62.jpg)
62
JOCKEY – PROGRESS INDICATOR
0
20
40
60
80
100
0
20
40
60
80
100
0 10 20 30 40 50 60
job
prog
ress
estim
ated
job
com
plet
ion
[min
]
time [min]
![Page 63: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/63.jpg)
63
JOCKEY – PROGRESS INDICATOR
0
20
40
60
80
100
0
20
40
60
80
100
0 10 20 30 40 50 60
job
prog
ress
estim
ated
job
com
plet
ion
[min
]
time [min]
![Page 64: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/64.jpg)
64
JOCKEY – CONTROL LOOP
![Page 65: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/65.jpg)
65
JOCKEY – CONTROL LOOP
1% complete
2% complete
3% complete
4% complete
5% complete
Job model
![Page 66: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/66.jpg)
66
JOCKEY – CONTROL LOOP
10 nodes 20 nodes 30 nodes
1% complete
2% complete
3% complete
4% complete
5% complete
Job model
![Page 67: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/67.jpg)
67
JOCKEY – CONTROL LOOP
10 nodes 20 nodes 30 nodes
1% complete 60 minutes 40 minutes 25 minutes
2% complete 59 minutes 39 minutes 24 minutes
3% complete 58 minutes 37 minutes 22 minutes
4% complete 56 minutes 36 minutes 21 minutes
5% complete 54 minutes 34 minutes 20 minutes
Job model
![Page 68: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/68.jpg)
68
JOCKEY – CONTROL LOOP
10 nodes 20 nodes 30 nodes
1% complete 60 minutes 40 minutes 25 minutes
2% complete 59 minutes 39 minutes 24 minutes
3% complete 58 minutes 37 minutes 22 minutes
4% complete 56 minutes 36 minutes 21 minutes
5% complete 54 minutes 34 minutes 20 minutes
Job model
Deadline: 50 min.
Completion: 1%
![Page 69: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/69.jpg)
69
JOCKEY – CONTROL LOOP
Job model
Deadline: 50 min.
Completion: 1%
10 nodes 20 nodes 30 nodes
1% complete 60 minutes 40 minutes 25 minutes
2% complete 59 minutes 39 minutes 24 minutes
3% complete 58 minutes 37 minutes 22 minutes
4% complete 56 minutes 36 minutes 21 minutes
5% complete 54 minutes 34 minutes 20 minutes
![Page 70: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/70.jpg)
70
JOCKEY – CONTROL LOOP
Job model 10 nodes 20 nodes 30 nodes
1% complete 60 minutes 40 minutes 25 minutes
2% complete 59 minutes 39 minutes 24 minutes
3% complete 58 minutes 37 minutes 22 minutes
4% complete 56 minutes 36 minutes 21 minutes
5% complete 54 minutes 34 minutes 20 minutes
Deadline: 40 min.
Completion: 3%
![Page 71: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/71.jpg)
71
JOCKEY – CONTROL LOOP
Job model 10 nodes 20 nodes 30 nodes
1% complete 60 minutes 40 minutes 25 minutes
2% complete 59 minutes 39 minutes 24 minutes
3% complete 58 minutes 37 minutes 22 minutes
4% complete 56 minutes 36 minutes 21 minutes
5% complete 54 minutes 34 minutes 20 minutes
Deadline: 30 min.
Completion: 5%
![Page 72: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/72.jpg)
72
JOCKEY – MODEL
f(progress, allocation) -> remaining run time
![Page 73: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/73.jpg)
73
JOCKEY – MODEL
f(progress, allocation) -> remaining run time
analytic model?
![Page 74: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/74.jpg)
74
JOCKEY – MODEL
f(progress, allocation) -> remaining run time
analytic model? machine learning?
![Page 75: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/75.jpg)
75
JOCKEY – MODEL
f(progress, allocation) -> remaining run time
analytic model? machine learning?
simulator
![Page 76: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/76.jpg)
76
JOCKEY
Problem Solution
![Page 77: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/77.jpg)
77
JOCKEY
Problem Solution
Pipeline complexity
![Page 78: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/78.jpg)
78
JOCKEY
Problem Solution
Pipeline complexity Use a simulator
![Page 79: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/79.jpg)
79
JOCKEY
Problem Solution
Pipeline complexity Use a simulator
Noisy environment
![Page 80: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/80.jpg)
80
JOCKEY
Problem Solution
Pipeline complexity Use a simulator
Noisy environment Dynamic control
![Page 81: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/81.jpg)
Jockey in Action 81
![Page 82: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/82.jpg)
Jockey in Action 82
• Real job
![Page 83: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/83.jpg)
Jockey in Action 83
• Real job • Production cluster
![Page 84: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/84.jpg)
Jockey in Action 84
• Real job • Production cluster • CPU load: ~80%
![Page 85: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/85.jpg)
Jockey in Action
85
![Page 86: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/86.jpg)
Jockey in Action
86
![Page 87: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/87.jpg)
Jockey in Action
87
![Page 88: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/88.jpg)
Jockey in Action
88
Initial deadline: 140 minutes
![Page 89: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/89.jpg)
Jockey in Action
89
New deadline: 70 minutes
![Page 90: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/90.jpg)
Jockey in Action
90
New deadline: 70 minutes
Release resources due to excess pessimism
![Page 91: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/91.jpg)
Jockey in Action
91
“Oracle” allocation: Total allocation-hours
Deadline
![Page 92: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/92.jpg)
Jockey in Action
92
“Oracle” allocation: Total allocation-hours
Deadline
Available parallelism less than allocation
![Page 93: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/93.jpg)
Jockey in Action
93
“Oracle” allocation: Total allocation-hours
Deadline
Allocation above oracle
![Page 94: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/94.jpg)
Evaluation 94
![Page 95: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/95.jpg)
Evaluation 95
• Production cluster
![Page 96: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/96.jpg)
Evaluation 96
• Production cluster • 21 jobs
![Page 97: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/97.jpg)
Evaluation 97
• Production cluster • 21 jobs • SLO met?
![Page 98: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/98.jpg)
Evaluation 98
• Production cluster • 21 jobs • SLO met? • Cluster impact?
![Page 99: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/99.jpg)
Evaluation
99
![Page 100: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/100.jpg)
Evaluation
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130%
job completion time relative to deadline
deadline
Jobs which met the SLO
100
![Page 101: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/101.jpg)
Evaluation
0%
20%
40%
60%
80%
100%
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130%
CD
F
job completion time relative to deadline
deadline
Jobs which met the SLO
101
![Page 102: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/102.jpg)
Evaluation
0%
20%
40%
60%
80%
100%
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130%
CD
F
job completion time relative to deadline
Jockey
deadline
Jobs which met the SLO
102
Missed 1 of 94 deadlines
![Page 103: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/103.jpg)
Evaluation
0%
20%
40%
60%
80%
100%
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130%
CD
F
job completion time relative to deadline
Jockey
deadline
Jobs which met the SLO
103
![Page 104: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/104.jpg)
Evaluation
0%
20%
40%
60%
80%
100%
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130%
CD
F
job completion time relative to deadline
Jockey
deadline
Jobs which met the SLO
104
![Page 105: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/105.jpg)
Evaluation
0%
20%
40%
60%
80%
100%
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130%
CD
F
job completion time relative to deadline
Jockey
deadline
Jobs which met the SLO
105
1.4x
![Page 106: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/106.jpg)
Evaluation
0%
20%
40%
60%
80%
100%
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130%
CD
F
job completion time relative to deadline
max allocation Jockey
deadline
Jobs which met the SLO
106
Allocated too many resources
Missed 1 of 94 deadlines
![Page 107: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/107.jpg)
Evaluation
0%
20%
40%
60%
80%
100%
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130%
CD
F
job completion time relative to deadline
max allocation Jockey
Allocation fromsimulator
deadline
Jobs which met the SLO Allocated too many resources
107
Simulator made good predictions: 80% finish before deadline
Missed 1 of 94 deadlines
![Page 108: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/108.jpg)
Evaluation
0%
20%
40%
60%
80%
100%
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130%
CD
F
job completion time relative to deadline
max allocation Jockey
Allocation fromsimulator
Control loop only
deadline
Jobs which met the SLO Allocated too many resources
Simulator made good predictions: 80% finish before deadline
108
Control loop is stable
and successful
Missed 1 of 94 deadlines
![Page 109: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/109.jpg)
Evaluation
109
![Page 110: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/110.jpg)
Evaluation
110
0% 25% 50% 75% 100%
fraction of allocation above oracle
![Page 111: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/111.jpg)
Evaluation
111
0%
5%
10%
15%
20%
0% 25% 50% 75% 100%
fract
ion
of d
eadl
ines
mis
sed
fraction of allocation above oracle
![Page 112: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/112.jpg)
Evaluation
112
0%
5%
10%
15%
20%
0% 25% 50% 75% 100%
fract
ion
of d
eadl
ines
mis
sed
fraction of allocation above oracle
Jockey
![Page 113: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/113.jpg)
Evaluation
113
0%
5%
10%
15%
20%
0% 25% 50% 75% 100%
fract
ion
of d
eadl
ines
mis
sed
fraction of allocation above oracle
max allocation
Jockey
![Page 114: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/114.jpg)
Evaluation
114
0%
5%
10%
15%
20%
0% 25% 50% 75% 100%
fract
ion
of d
eadl
ines
mis
sed
fraction of allocation above oracle
Allocation from simulator
max allocation
Control loop only
Jockey
![Page 115: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/115.jpg)
Conclusion 115
![Page 116: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/116.jpg)
116
Data parallel jobs are complex,
![Page 117: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/117.jpg)
117
Data parallel jobs are complex, yet users demand deadlines.
![Page 118: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/118.jpg)
118
Data parallel jobs are complex, yet users demand deadlines.
Jobs run in shared, noisy clusters,
![Page 119: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/119.jpg)
119
Data parallel jobs are complex, yet users demand deadlines.
Jobs run in shared, noisy clusters, making simple models inaccurate.
![Page 120: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/120.jpg)
Jockey 120
![Page 121: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/121.jpg)
simulator
121
![Page 122: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/122.jpg)
control-loop
122
![Page 123: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/123.jpg)
123
Deadline
Deadline
Deadline Deadline
Deadline
![Page 124: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/124.jpg)
124
![Page 126: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/126.jpg)
Co-a
utho
rs • Peter Bodík
(Microsoft Research) • Srikanth Kandula
(Microsoft Research) • Eric Boutín
(Microsoft) • Rodrigo Fonseca
(Brown)
Questions? 126
Andrew Ferguson [email protected]
![Page 127: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/127.jpg)
Backup Slides
127
![Page 128: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/128.jpg)
!"
# $%&# '#!"#$%"&'()*+",$*+&)(
Utility Curves
Deadline
For single jobs, scale doesn’t matter
For multiple jobs, use financial penalties
128
![Page 129: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/129.jpg)
129
Jockey Resource allocation control loop
1. Slack
2. Hysteresis
3. Dead Zone
Prediction Run Time Utility
![Page 130: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/130.jpg)
130
Cosmos
• Resources are allocated with a form of fair sharing across business groups and their jobs. (Like Hadoop FairScheduler or CapacityScheduler)
• Each job is guaranteed a number of tokens as dictated by cluster policy; each running or initializing task uses one token. Token released on task completion.
• A token is a guaranteed share of CPU and memory • To increase efficiency, unused tokens are re-allocated to
jobs with available work
Resource sharing in Cosmos
![Page 131: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/131.jpg)
131
Jockey Progress indicator • Can use many features of the job to build a progress
indicator • Earlier work (ParaTimer) concentrated on fraction of tasks
completed • Our indicator is very simple, but we found it performs
best for Jockey’s needs Total vertex initialization time
Total vertex run time Frac;on of completed ver;ces
![Page 132: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/132.jpg)
132
Comparison with ARIA • ARIA uses analytic models • Designed for 3 stages: Map, Shuffle, Reduce • Jockey’s control loop is robust due to control-
theory improvements • ARIA tested on small (66-node) cluster without a
network bottleneck • We believe Jockey is a better match for production
DAG frameworks such as Hive, Pig, etc.
![Page 133: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/133.jpg)
133
Jockey
Latency prediction: C(p, a) • Event-based simulator
– Same scheduling logic as actual Job Manager
– Captures important features of job progress
– Does not model input size variation or speculative re-execution of stragglers
– Inputs: job algebra, distributions of task timings, probabilities of failures, allocation
• Analytic model
– Inspired by Amdahl’s Law: T = S + P/N
– S is remaining work on critical path, P is all remaining work, N is number of machines
![Page 134: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/134.jpg)
134
Jockey Resource allocation control loop • Executes in Dryad’s Job Manager
• Inputs: fraction of completed tasks in each stage, time job has spent running, utility function, precomputed values (for speedup)
• Output: Number of tokens to allocate
• Improved with techniques from control-theory
![Page 135: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/135.jpg)
Jockey offline during job runtime
job profile
135
![Page 136: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/136.jpg)
Jockey
simulator
offline during job runtime
job profile
136
![Page 137: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/137.jpg)
Jockey
simulator
offline during job runtime
job stats
job profile
137
![Page 138: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/138.jpg)
Jockey
simulator
offline during job runtime
job stats latency predictions
job profile
138
![Page 139: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/139.jpg)
Jockey
simulator
offline during job runtime
utility function job stats latency
predictions
job profile
139
![Page 140: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric](https://reader030.vdocuments.net/reader030/viewer/2022041015/5ec6f125726f4b16e31d6e77/html5/thumbnails/140.jpg)
Jockey
simulator
offline during job runtime
running job
utility function job stats latency
predictions
resource allocation control loop
job profile
140