qos for accelerator-rich “fat” nodes
TRANSCRIPT
•Two types of tasks on nearly all systems
• Best effort (BE)
• Latency critical (LC)
2(c) Mattan Erez
– Latency critical•User-facing: UI, games, video, …
• System-facing: sync, comm, …
– Best-effort•User background tasks
• Learning, optimizations, …
• System management
– Latency sensitive•CPU-oriented tasks
•Tightly-iterating algorithms
– Latency insensitive•Graphics
•Big DNNs…
3(c) Mattan Erez
•Warehouse-scale computers (WSCs)
– User-facing (eventually) tasks are latency sensitive
– But WSC is wasted if no background tasks
4
http://e.huawei.com/uk/case-studies/global/2016/201609271019
•Traditional QoS goals:
– Fairness
– Completion-time targets•No missed deadlines (hard realtime)
•Often translated to strict scheduling constraints
6(c) Mattan Erez
•Better QoS goals:
– Unfairness (priority)
– Completion-time targets•Percentile deadline targets
•Tail statistics
7(c) Mattan Erez
•Better QoS goals:
– Unfairness (priority)
– Completion-time targets•Percentile deadline targets
•Tail statistics
8(c) Mattan Erez
•Context: single node within a WSC
– Shared between LC and BE tasks
– Accelerators
– Lots of cores
– Heterogeneous memories
Goals:
(1) Meet completion-time percentile targets
(2) No significant tail expansion
(3) Maximize best-effort task throughput
(4) Prioritize best-effort tasks as needed
9(c) Mattan Erez
•BE interference with LC tasks on an accelerator
– From: Haishan Zhu et al., “Kelp” HPCA 2019
– Many thanks to Google collaborators
10
Node
Scheduler
QoS Runtime OS
SW/HW Interface
LC Task BE TaskSw
itc
h
11
•Performance interference with a TPU
– High sensitivity to DRAM interference•Despite TPU having own DRAM and PCIe traffic too low
– Need hardware solutions• Sub-millisecond interleaving of task steps
12
Single ML task occupies accelerator
•ML task also occupies multiple CPU cores
– Preprocessing
– Beam search
– Parameter servers
– …
•How to control DRAM interference?
– NUMA subdomains (i.e., channel partitioning)•On Xeon we tried, occasionally failed to control interference
(quite surprising)
•Too coarse grained
15(c) Mattan Erez
•Accelerators are challenging, but so are
– Lots of cores•Completion time tracking to the rescue?
(e.g, Zhu and Erez, “Dirigent”, ASPLOS 2016)
– Heterogeneous memories•Priority over partitioning really promising
20(c) Mattan Erez
Dirigent: Enforcing QoS for Latency-Critical Tasks on Shared Multicore Systems
Haishan Zhu and Mattan Erez ASPLOS 2016
21(c) Mattan Erez
•How to ensure QoS?
– Don’t assign too much work• Shutdown unused resources
•Very wasteful
– Backfill with “compatible” work•Not latency-critical
•Hopefully not too conflicting
•In practice severe waste– Forced to schedule conservatively
– Performance variation a big concern
22(c) Mattan Erez
•Performance variation causes wasted resources
23
Performance variation of latency-sensitive tasks indicates the amount of resources wasted
Pro
ba
bili
ty
Execution Time
Standalone
Contention
Ideal
1
𝑇𝑎𝑟𝑔𝑒𝑡 𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 𝐷𝑒𝑎𝑑𝑙𝑖𝑛𝑒
(c) Mattan Erez
•Dirigent uses application information to shape completion time distribution
– Also improves scheduling quality
24(c) Mattan Erez
Execution time prediction
25
S1 S2 S3 Time
S1 S2 S3
S1 S2 S3
Time
Time
Profiling while Running Standalone
Average Execution Time Penalty
Execution Time Prediction
Δ𝑇: 𝑡𝑖𝑚𝑒 𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑎 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑒𝑔𝑚𝑒𝑛𝑡Sn: 𝑝𝑟𝑜𝑔𝑟𝑎𝑚 𝑝𝑟𝑜𝑔𝑟𝑒𝑠𝑠 𝑑𝑢𝑟𝑖𝑛𝑔 𝑠𝑒𝑔𝑚𝑒𝑛𝑡 𝑛
𝑀𝐴 ∙ :𝑚𝑜𝑣𝑖𝑛𝑔 𝑎𝑣𝑒𝑟𝑎𝑔𝑒ത𝑃𝑖: 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑡𝑖𝑚𝑒 𝑝𝑒𝑛𝑎𝑙𝑡𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑖𝑡ℎ 𝑠𝑒𝑔𝑚𝑒𝑛𝑡
𝛼𝑖 =𝑝𝑟𝑜𝑓𝑖𝑙𝑒𝑑_𝑝𝑟𝑜𝑔𝑟𝑒𝑠𝑠𝑖𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑑_𝑝𝑟𝑜𝑔𝑟𝑒𝑠𝑠𝑖
(c) Mattan Erez
Execution time predictor accuracy
26
0%
2%
4%
6%
8%
10%
1.2E+09
1.3E+09
1.4E+09
1.5E+09
1.6E+09
1.7E+09
Err
or
Cycle
Fifty Consecutive Execution
Execution Time Prediction Relative Error
• 50 consecutive executions of raytrace (FG) and RS (BG)
• Predictions are made halfway through each FG execution
(c) Mattan Erez
Dirigent results: single LC workloads
28
0%
10%
20%
30%
40%
50%
60%
3.4E+09 3.6E+09 3.8E+09 4.0E+09 4.2E+09 4.4E+09
Pro
ba
bilit
y
Execution Time
Baseline StaticFreq
(c) Mattan Erez
Dirigent results: single LC workloads
18
0%
10%
20%
30%
40%
50%
60%
3.4E+09 3.6E+09 3.8E+09 4.0E+09 4.2E+09 4.4E+09
Pro
ba
bilit
y
Execution Time
Baseline StaticFreq StaticBoth DirigentFreq
(c) Mattan Erez
Dirigent results: single LC workloads
19
0%
10%
20%
30%
40%
50%
60%
3.4E+09 3.6E+09 3.8E+09 4.0E+09 4.2E+09 4.4E+09
Pro
ba
bilit
y
Execution Time
Baseline StaticFreq StaticBoth DirigentFreq Dirigent
(c) Mattan Erez
31
•Tradeoff between LC throughput and BE performance
– Precise control over the range of target deadlines
– Convert LC time slack to BE performance
– Consistent QoS satisfaction rate
1.02 1.02 1.05 1.08 1.11 1.14 1.16
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.00x 1.03x 1.06x 1.09x 1.12x 1.15x 1.18x
Ra
tio
s
Target Execution TIme
LC Time Avg LC Time Std BE Throughput
(c) Mattan Erez