qos for accelerator-rich “fat” nodes

QoS for Accelerator-Rich “Fat” Nodes

Mattan Erez

The University of Texas at Austin

•Two types of tasks on nearly all systems

• Best effort (BE)

• Latency critical (LC)

2(c) Mattan Erez

– Latency critical•User-facing: UI, games, video, …

• System-facing: sync, comm, …

– Best-effort•User background tasks

• Learning, optimizations, …

• System management

– Latency sensitive•CPU-oriented tasks

•Tightly-iterating algorithms

– Latency insensitive•Graphics

•Big DNNs…

3(c) Mattan Erez

•Warehouse-scale computers (WSCs)

– User-facing (eventually) tasks are latency sensitive

– But WSC is wasted if no background tasks

4

http://e.huawei.com/uk/case-studies/global/2016/201609271019

•What does QoS mean?

5(c) Mattan Erez

•Traditional QoS goals:

– Fairness

– Completion-time targets•No missed deadlines (hard realtime)

•Often translated to strict scheduling constraints

6(c) Mattan Erez

•Better QoS goals:

– Unfairness (priority)

– Completion-time targets•Percentile deadline targets

•Tail statistics

7(c) Mattan Erez

•Better QoS goals:

– Unfairness (priority)

– Completion-time targets•Percentile deadline targets

•Tail statistics

8(c) Mattan Erez

•Context: single node within a WSC

– Shared between LC and BE tasks

– Accelerators

– Lots of cores

– Heterogeneous memories

Goals:

(1) Meet completion-time percentile targets

(2) No significant tail expansion

(3) Maximize best-effort task throughput

(4) Prioritize best-effort tasks as needed

9(c) Mattan Erez

•BE interference with LC tasks on an accelerator

– From: Haishan Zhu et al., “Kelp” HPCA 2019

– Many thanks to Google collaborators

10

Node

Scheduler

QoS Runtime OS

SW/HW Interface

LC Task BE TaskSw

itc

h

11

•Performance interference with a TPU

– High sensitivity to DRAM interference•Despite TPU having own DRAM and PCIe traffic too low

– Need hardware solutions• Sub-millisecond interleaving of task steps

12

Single ML task occupies accelerator

•ML task also occupies multiple CPU cores

– Preprocessing

– Beam search

– Parameter servers

– …

•DRAM interference is dominant

– Aggressor microbenchmarks

13

•How to control DRAM interference?

– NUMA subdomains (i.e., channel partitioning)

14(c) Mattan Erez

•How to control DRAM interference?

– NUMA subdomains (i.e., channel partitioning)•On Xeon we tried, occasionally failed to control interference

(quite surprising)

•Too coarse grained

15(c) Mattan Erez

Kelp mechanism 1: Manage prefetchers

16

Kelp Mechanism 2: Backfilling

17

18

•Result summary – Kelp works pretty well

– Tradeoffs hard to visualize, so

•Challenges remain:

– Multi-socket can lead to worse inteference

19

CNN1 CNN2

•Accelerators are challenging, but so are

– Lots of cores•Completion time tracking to the rescue?

(e.g, Zhu and Erez, “Dirigent”, ASPLOS 2016)

– Heterogeneous memories•Priority over partitioning really promising

20(c) Mattan Erez

Dirigent: Enforcing QoS for Latency-Critical Tasks on Shared Multicore Systems

Haishan Zhu and Mattan Erez ASPLOS 2016

21(c) Mattan Erez

•How to ensure QoS?

– Don’t assign too much work• Shutdown unused resources

•Very wasteful

– Backfill with “compatible” work•Not latency-critical

•Hopefully not too conflicting

•In practice severe waste– Forced to schedule conservatively

– Performance variation a big concern

22(c) Mattan Erez

•Performance variation causes wasted resources

23

Performance variation of latency-sensitive tasks indicates the amount of resources wasted

Pro

ba

bili

ty

Execution Time

Standalone

Contention

Ideal

1

𝑇𝑎𝑟𝑔𝑒𝑡 𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 𝐷𝑒𝑎𝑑𝑙𝑖𝑛𝑒

(c) Mattan Erez

•Dirigent uses application information to shape completion time distribution

– Also improves scheduling quality

24(c) Mattan Erez

Execution time prediction

25

S1 S2 S3 Time

S1 S2 S3

S1 S2 S3

Time

Time

Profiling while Running Standalone

Average Execution Time Penalty

Execution Time Prediction

Δ𝑇: 𝑡𝑖𝑚𝑒 𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑎 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑒𝑔𝑚𝑒𝑛𝑡Sn: 𝑝𝑟𝑜𝑔𝑟𝑎𝑚 𝑝𝑟𝑜𝑔𝑟𝑒𝑠𝑠 𝑑𝑢𝑟𝑖𝑛𝑔 𝑠𝑒𝑔𝑚𝑒𝑛𝑡 𝑛

𝑀𝐴 ∙ :𝑚𝑜𝑣𝑖𝑛𝑔 𝑎𝑣𝑒𝑟𝑎𝑔𝑒ത𝑃𝑖: 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑡𝑖𝑚𝑒 𝑝𝑒𝑛𝑎𝑙𝑡𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑖𝑡ℎ 𝑠𝑒𝑔𝑚𝑒𝑛𝑡

𝛼𝑖 =𝑝𝑟𝑜𝑓𝑖𝑙𝑒𝑑_𝑝𝑟𝑜𝑔𝑟𝑒𝑠𝑠𝑖𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑑_𝑝𝑟𝑜𝑔𝑟𝑒𝑠𝑠𝑖

(c) Mattan Erez

Execution time predictor accuracy

26

0%

2%

4%

6%

8%

10%

1.2E+09

1.3E+09

1.4E+09

1.5E+09

1.6E+09

1.7E+09

Err

or

Cycle

Fifty Consecutive Execution

Execution Time Prediction Relative Error

• 50 consecutive executions of raytrace (FG) and RS (BG)

• Predictions are made halfway through each FG execution

(c) Mattan Erez

•In future, should also use application-level hints

27(c) Mattan Erez

Dirigent results: single LC workloads

28

0%

10%

20%

30%

40%

50%

60%

3.4E+09 3.6E+09 3.8E+09 4.0E+09 4.2E+09 4.4E+09

Pro

ba

bilit

y

Execution Time

Baseline StaticFreq

(c) Mattan Erez


18

0%

10%

20%

30%

40%

50%

60%

3.4E+09 3.6E+09 3.8E+09 4.0E+09 4.2E+09 4.4E+09

Pro

ba

bilit

y

Execution Time

Baseline StaticFreq StaticBoth DirigentFreq

(c) Mattan Erez


19

0%

10%

20%

30%

40%

50%

60%

3.4E+09 3.6E+09 3.8E+09 4.0E+09 4.2E+09 4.4E+09

Pro

ba

bilit

y

Execution Time

Baseline StaticFreq StaticBoth DirigentFreq Dirigent

(c) Mattan Erez

31

•Tradeoff between LC throughput and BE performance

– Precise control over the range of target deadlines

– Convert LC time slack to BE performance

– Consistent QoS satisfaction rate

1.02 1.02 1.05 1.08 1.11 1.14 1.16

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.00x 1.03x 1.06x 1.09x 1.12x 1.15x 1.18x

Ra

tio

s

Target Execution TIme

LC Time Avg LC Time Std BE Throughput

(c) Mattan Erez

qos for accelerator-rich “fat” nodes

Documents