Elfen SchedulingFine-Grain Principled Borrowing from
Latency-Critical Workloads using Simultaneous Multithreading (SMT)
Xi YangAustralian National University
Stephen M BlackburnAustralian National University
Kathryn S McKinleyMicrosoft Research
Tail Latency Matters
2
Two second slowdown reduced revenue/user by 4.3%.
[Eric Schurman, Bing]
400 millisecond delay decreased searches/user by 0.59%.
[Jack Brutlag, Google]
Luiz André Barroso, Urs Hölzle“The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines”
“Such WSCs tend to have relatively low average utilization, spending most of its time in the10 - 50% CPU utilization range.”
0
50
100
150
200
0 20 40 60 80 100 120 140 160 180
Late
ncy
(ms)
RPS
Lucene alone 50%ileLucene alone 95%ileLucene alone 99%ile
High Responsiveness—Low Utilization
3
Intel Xeon-D 1540 Broadwell
LC
Luiz André Barroso, Urs Hölzle“The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines”
“Such WSCs tend to have relatively low average utilization, spending most of its time in the10 - 50% CPU utilization range.”
1 core, no SMT
Service Level Objective100ms SLO
0
50
100
150
200
0 20 40 60 80 100 120 140 160 180
Late
ncy
(ms)
RPS
Lucene alone 99%ile
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
0 20 40 60 80 100 120 140 160 180
Uti
lizat
ion
/ no
SM
T
RPS
34% w/ SMT
67% no SMT
LC
Soak up Slack with Batch?
4
LC batch
core core core core
shared cache
Co-running on different coresSMT turned off
responsiveness requires idle coresbecause OS descheduling is slow
LC idle LCidlecorebatchidleLC
shared cache shared cache
core core core core t1 t2 t1 t2 t1 t2 t1 t2t1 t2 t1 t2 t1 t2 t1 t2
Co-running on different coresSMT turned off
Co-running on same core in SMT lanes
0
100
200
300
400
500
600
10 20 30 40 50 60 70 80 90 100
99%i
le la
tenc
y (m
s)
RPS
Lucene alonewith IPC 1.0with IPC 0.01
SMT Co-Runner
5
Even IPC 0.01 violates SLO at low load!
while(1);
while(1) {movnti();mfence();
}
IPC 1.0
IPC 0.01
1 core, 2 SMT lanes
SLO
Great utilization!
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
10 20 30 40 50 60 70 80 90 100
Uti
lizat
ion
/ no
SM
T
RPS
6
FunctionalUnits
Load StoreQueue
IssueLogicLanes
timelucene
Simultaneous Multithreading OFF
7
lane 1
FunctionalUnits
Load StoreQueue
IssueLogicLanes
Dynamically shared
Staticallypartitioned
Round-robinshared
timelucene
IPC 0.01
Active SMT lanes share critical resources
lane 2
Simultaneous Multithreading ON
Principled Borrowing
8
time
busy idle
Batch borrows hardware when LC is idle
Batch releases hardware when LC is busy
Can we implement principled borrowing on current hardware?
LC
batch
Hardware is Ready — Software is Not
9
time
Batch lane calls “mwait”
Thread sleeps, releasing hardware to OS (~2K cycles)
OS supports thread sleeping, but not hardware sleepingrelease SMT hardware to other lane
OS schedules batch lane with any ready job
LC
batch
nanonap()
10
Semantics
Thread invoking nanonap releases SMT hardware without releasing SMT context
OS cannot schedule hardware context
OS can interrupt and wakeup thread
nanonap()
11
per_cpu_variable: nap_flag;void nanonap() {
enter_kernel();disable_preemption();my_nap_flag = this_cpu_flag(nap_flag);monitor(my_nap_flag);mwait();enable_preemption();leave_kernel();
}
timeLC
batch
nanonap
Elfen Scheduler
12
Instrument batch workloads todetect LC threads & nap
Bind latency-critical threads to LC laneBind batch threads to batch lane
No change to latency-critical threads
Elfen Scheduler
13
time
12
1 Batch thread borrows resources, continuously checks LC lane status
2 LC starts, batch calls nanonap() to release SMT hardware resources
OS touches nap_flag to wake up batch thread
/* fast path check injected into method body */check:if (!request_lane_idle)
slow_path();
slow_path() {nanonap(); }
1
2
/* maps lane IDs to the running task */exposed SHIM signal: cpu_task_maptask_switch(task T) {
cpu_task_map[thiscpu] = T; }idle_task() { // wake up any waiting batch thread update_nap_flag_of_partner_lane(); ......
}
LC
batch
3
3
3
nanonap()
0
20
40
60
80
100
120
140
10 20 30 40 50 60 70 80 90 100
99%i
le la
tenc
y (m
s)
RPS
w antlr
w bloat
w eclipse
w fop
w hsqldb
w jython
w luindex
w lusearch
w pmd
w xalan
Lucene alone
Results: Borrow Idle
14
1 core, 2 SMT lanes
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
10 20 30 40 50 60 70 80 90 100
Uti
lizat
ion
/ no
SM
T
RPS
increased utilization 10x - 1.5x
0
20
40
60
80
100
120
140
200 400 600 800 1000
99%i
le la
tenc
y (m
s)
RPS
7 cores, 2x7 SMT lanes
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
200 400 600 800 1000
Uti
lizat
ion
/ no
SM
T
RPS
4x - 0.19x
same latency!
Beyond Mutual Exclusion
15
time
Mutual exclusion
LC
batch
Concurrent co-running with a budget
• Fixed budget policy• Refresh budget policy• Dynamic budget policy
• Borrow idle policy
nanonap nanonap nanonap
16
0
20
40
60
80
100
120
140
10 20 30 40 50 60 70 80 90 100
99%i
le la
tenc
y (m
s)
RPS
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
10 20 30 40 50 60 70 80 90 100
Uti
lizat
ion
/ no
SM
T
RPS
0
20
40
60
80
100
120
140
200 400 600 800 1000
99%i
le la
tenc
y (m
s)RPS
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
200 400 600 800 1000
Uti
lizat
ion
/ no
SM
T
RPS
Borrow idle on 1 core Borrow idle on 7 cores Dynamic budget 7 cores
0
20
40
60
80
100
120
140
200 400 600 800 1000
99%i
le la
tenc
y (m
s)
RPS
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
200 400 600 800 1000
Uti
lizat
ion
/ no
SM
T
RPS
10x - 0.2x
Conclusion
17
Principled borrowing for SMT
nanonap() releases STM hardware without giving it to OS
Elfen scheduler implementation & policies
90% to 25% increase in utilization
Same tail latency!
Thank you!
https://github.com/elfenscheduler
BACKUP SLIDES
18
0
50
100
150
200
0 20 40 60 80 100 120 140 160 180
Late
ncy
(ms)
RPS
50%ile Lucene alone95%ile Lucene alone99%ile Lucene alone
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
0 20 40 60 80 100 120 140 160 180
Uti
lizat
ion
/ no
SM
T
RPS
High Responsiveness—Low Utilization
19
Intel Xeon-D 1540 Broadwell
67% no SMT
LC
34% w/ SMT
Luiz André Barroso, Urs Hölzle“The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines”
“Such WSCs tend to have relatively low average utilization, spending most of its time in the10 - 50% CPU utilization range.”
1 core, no SMT
Service Level Objective100ms SLO
LC
Results: Borrow Idle
20
40
60
80
100
120
140
10 20 30 40 50 60 70 80 90 100
99%
ile la
tenc
y (m
s)
RPS
Lucene alonew antlrw bloatw eclipsew fopw hsqldbw jythonw luindexw lusearchw pmdw xalan
1 core, 2 SMT lanes
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
10 20
30 40
50 60
70 80
90 100
Utilizatio
n /
no S
MT
RPS
40
60
80
100
120
140
200 300 400 500 600 700 800 900 10001100
RPS
7 cores, 2x7 SMT lanes
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
200 300
400 500
600 700
800 900
1000 1100
RPS
increased utilization 90% to 37%
increased utilization 80% to 10%
21
Borrow idle on 1 cores
40
60
80
100
120
140
200 300 400 500 600 700 800 900 10001100
RPS
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
200 300
400 500
600 700
800 900
1000 1100
RPS
40
60
80
100
120
140
200 300 400 500 600 700 800 900 10001100
RPS
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
200 300
400 500
600 700
800 900
1000 1100
RPS
40
60
80
100
120
140
10 20 30 40 50 60 70 80 90 100
99%
ile la
tenc
y (m
s)
RPS
Lucene alonew antlrw bloatw eclipsew fopw hsqldbw jythonw luindexw lusearchw pmdw xalan
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
10 20
30 40
50 60
70 80
90 100
Utilization
/ n
o S
MT
RPS
Borrow idle on 7 cores Dynamic budget 7 cores
increased utilization 100% to 10%
Result
22
Result
23
BACKUP SLIDES
24
Tail Latency Matters
25
Two second slowdown reduced revenue/user by 4.3%.
[Eric Schurman, Bing]
400 millisecond delay decreased searches/user by 0.59%.
[Jack Brutlag, Google] p75, Luiz André Barroso, Urs Hölzle“The Datacenter as a Computer: An Introduction
to the Design of Warehouse-Scale Machines”
“Such WSCs tend to have relatively low average utilization, spending most of its time in the10 - 50% CPU utilization range.”
0
50
100
150
200
20 40 60 80 100 120
140 160
180
Late
ncy
(m
s)
RPS
50%ile
95%ile
99%ile
0
50
100
150
200
20 40 60 80 100 120
140 160
180
Late
ncy
(m
s)
RPS
50%ile
High Responsiveness—Low Utilization
26
0
50
100
150
200
20 40 60 80 100 120 140 160 180
Late
ncy
(ms)
RPS
50%ile95%ile99%ile
Lucene search benchmark on one core of an Intel Xeon-D 1540 Broadwell.
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
20 40
60 80
100 120
140 160
180
Utilizatio
n /
no S
MT
RPS
67% no SMT
LC
34% w/ SMT
p75, Luiz André Barroso, Urs Hölzle“The Datacenter as a Computer: An Introduction
to the Design of Warehouse-Scale Machines”
“Such WSCs tend to have relatively low average utilization, spending most of its time in the10 - 50% CPU utilization range.”
Soak up Slack with Batch?
27
core core core core
shared cache
LC batch
shared cache shared cache
Co-running on different cores SMT turned off Co-running on same core in SMT lanes
LC idle core LCbatch core core core core t1 t2 t1 t2 t1 t2 t1 t2t1 t2 t1 t2 t1 t2 t1 t2
SMT Co-Runner
28
0
100
200
300
400
500
600
10 20 30 40 50 60 70 80 90 100
99%
ile la
tenc
y (m
s)
RPS
Lucene alonewith IPC 0.01with IPC 1
Even IPC 0.01 violates SLO at low load!
while(1);
while(1) {movnti();mfence();
}
IPC 1.0
IPC 0.01
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
10 20
30 40
50 60
70 80
90 100
Utilization / n
o S
MT
RPS
Even Low IPC Co-Runner is Destructive
29
LC
FunctionalUnits
Load StoreQueue
IssueLogicLanes
Dynamically shared
Staticallypartitioned
Round-robinshared
timelucene
IPC 0.01
Critical resources are shared by active SMT lanes.
batch
Principled Borrowing
30
time
busy idle
Batch lane borrows hardware resources when LC lane is idle
Batch lane releases hardware resources when LC lane is busy
Can we implement principled borrowing on current hardware?
LC
batch
Hardware is Ready — Software is Not
31
time
Batch lane calls “mwait”
Thread sleeps, releasing hardware to OS (~2K cycles)
When LC lane is busy, batch lane must do nothing (nap)
OS schedules batch lane with any ready job
LC
batch
nanonap()
32
Thread nap semantics
Put SMT hardware lane in idle state
OS cannot schedule hardware context
OS can interrupt and wakeup thread
nanonap()
33
per_cpu_variable: nap_flag;void nanonap() {
enter_kernel();disable_preemption();my_nap_flag = this_cpu_flag(nap_flag);monitor(my_nap_flag);mwait();enable_preemption();leave_kernel();
}
time
batch lane naps
LC
batch
Elfen Scheduler
No change to latency critical threads
Bind latency critical threads to LC laneBind batch threads to batch lane
Instrument batch workloads to detect LC threads quickly & nap
34
Elfen Scheduler
35
time
1 2
3
1 Batch thread borrows resources continuously checks LC lane status
2 LC starts, batch calls nanonap() to release SMT hardware context
3 OS touches nap_flag to wake up batch thread
/* fast path check injected into method body */check:if (!request_lane_idle)
slow_path();
slow_path() { nanonap(); }
1
2
/* maps lane IDs to the running task */exposed SHIM signal: cpu_task_maptask_switch(task T) {
cpu_task_map[thiscpu] = T; }idle_task() {
// wake up any waiting batch thread update_nap_flag_of_partner_lane(); ......
}
3
LC
batch
Results: Borrow Idle
36
40
60
80
100
120
140
10 20 30 40 50 60 70 80 90 100
99%
ile la
tenc
y (m
s)
RPS
Lucene alonew antlrw bloatw eclipsew fopw hsqldbw jythonw luindexw lusearchw pmdw xalan
1 core, 2 SMT lanes
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
10 20
30 40
50 60
70 80
90 100
Utilization
/ n
o S
MT
RPS
40
60
80
100
120
140
200 300 400 500 600 700 800 900 10001100
RPS
7 cores, 2x7 SMT lanes
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
200 300
400 500
600 700
800 900
1000 1100
RPS
increased utilization 90% to 37%
increased utilization 80% to 10%
Beyond Mutual Exclusion
37
time
Mutual exclusion
LC
batch
Concurrent co-running with a budget
• Fixed budget policy• Refresh budget policy• Dynamic budget policy
• Borrow idle policy
nanonap nanonap nanonap
38
Borrow idle on 1 cores
40
60
80
100
120
140
200 300 400 500 600 700 800 900 10001100
RPS
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
200 300
400 500
600 700
800 900
1000 1100
RPS
40
60
80
100
120
140
200 300 400 500 600 700 800 900 10001100
RPS
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
200 300
400 500
600 700
800 900
1000 1100
RPS
40
60
80
100
120
140
10 20 30 40 50 60 70 80 90 100
99%
ile la
tenc
y (m
s)
RPS
Lucene alonew antlrw bloatw eclipsew fopw hsqldbw jythonw luindexw lusearchw pmdw xalan
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
10 20
30 40
50 60
70 80
90 100
Utilization
/ n
o S
MT
RPS
Borrow idle on 7 cores Dynamic budget 7 cores
increased utilization 100% to 10%
Conclusion
39
Principled borrowing for SMT
nanonap() releases STM hardware without giving it to OS
Elfen scheduler implementation & policies
90% to 25% increase in utilization
Same tail latency!?
Thank you!
https: //github.com/elfenscheduler