less$is$more:$trading$ali0le$bandwidth$ for$ultralow ... · throughput [mbps] pq drain rate [mbps]...
TRANSCRIPT
![Page 1: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750](https://reader036.vdocuments.net/reader036/viewer/2022071217/604def52aa610d0c025d4574/html5/thumbnails/1.jpg)
Less is More: Trading a li0le Bandwidth for Ultra-‐Low Latency in the Data Center
Mohammad Alizadeh, Abdul Kabbani, Tom Edsall, Balaji Prabhakar, Amin Vahdat, and Masato Yasuda
![Page 2: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750](https://reader036.vdocuments.net/reader036/viewer/2022071217/604def52aa610d0c025d4574/html5/thumbnails/2.jpg)
Latency in Data Centers
• Latency is becoming a primary performance metric in DC
• Low latency applicaIons – High-‐frequency trading – High-‐performance compuIng – Large-‐scale web applicaIons – RAMClouds (want < 10μs RPCs)
• Desire predictable low-‐latency delivery of individual packets
2
![Page 3: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750](https://reader036.vdocuments.net/reader036/viewer/2022071217/604def52aa610d0c025d4574/html5/thumbnails/3.jpg)
Large-scale Web Application
Why Does Latency Ma7er?
Data Structures
Traditional Application
App. Logic
App Logic
App Logic
App Logic
App Logic
App Logic
App Logic
App Logic
App Logic
App Logic
App Logic Alice App
Logic
Who does she know? What has she done?
Minnie Eric Pics Videos Apps
• Latency limits data access rate Ø Fundamentally limits applicaAons
• Possibly 1000s of RPCs per operaIon Ø Microseconds ma7er, even at the tail (e.g., 99.9th percenAle)
3
![Page 4: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750](https://reader036.vdocuments.net/reader036/viewer/2022071217/604def52aa610d0c025d4574/html5/thumbnails/4.jpg)
Reducing Latency
• SoXware and hardware are improving – Kernel bypass, RDMA; RAMCloud: soXware processing ~1μs – Low latency switches forward packets in a few 100ns – Baseline fabric latency (propagaAon, switching) under 10μs is achievable.
• Queuing delay: random and traffic dependent – Can easily reach 100s of microseconds or even milliseconds
• One 1500B packet = 12μs @ 1Gbps
Goal: Reduce queuing delays to zero. 4
![Page 5: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750](https://reader036.vdocuments.net/reader036/viewer/2022071217/604def52aa610d0c025d4574/html5/thumbnails/5.jpg)
Data Center Workloads:
• Short messages [100B-‐10KB]
• Large flows [1MB-‐100MB]
Low Latency
High Throughput
Low Latency AND High Throughput
We want baseline fabric latency AND high throughput.
5
![Page 6: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750](https://reader036.vdocuments.net/reader036/viewer/2022071217/604def52aa610d0c025d4574/html5/thumbnails/6.jpg)
Why do we need buffers? • Main reason: to create “slack” – Handle temporary oversubscripIon – Absorb TCP’s rate fluctuaIons as it discovers path bandwidth
• Example: Bandwidth-‐delay product rule of thumb – A single TCP flow needs C×RTT buffers for 100% Throughput.
Throughp
ut
Buffe
r Size
100%
B B ≥ C×RTT
B
100%
B < C×RTT
6
![Page 7: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750](https://reader036.vdocuments.net/reader036/viewer/2022071217/604def52aa610d0c025d4574/html5/thumbnails/7.jpg)
Overview of our Approach
• Use “phantom queues” – Signal congesIon before any queuing occurs
• Use DCTCP [SIGCOMM’10] – MiIgate throughput loss that can occur without buffers
• Use hardware pacers – Combat bursIness due to offload mechanisms like LSO and Interrupt coalescing
Main Idea
7
![Page 8: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750](https://reader036.vdocuments.net/reader036/viewer/2022071217/604def52aa610d0c025d4574/html5/thumbnails/8.jpg)
Source: • React in proporIon to the extent of congesIon è less fluctuaIons
– Reduce window size based on fracAon of marked packets.
ECN Marks TCP DCTCP
1 0 1 1 1 1 0 1 1 1 Cut window by 50% Cut window by 40%
0 0 0 0 0 0 0 0 0 1 Cut window by 50% Cut window by 5%
Review: DCTCP
Switch: • Set ECN Mark when Queue Length > K.
B K Mark Don’t Mark
8
![Page 9: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750](https://reader036.vdocuments.net/reader036/viewer/2022071217/604def52aa610d0c025d4574/html5/thumbnails/9.jpg)
Setup: Win 7, Broadcom 1Gbps Switch Scenario: 2 long-‐lived flows,
(Kbytes)
ECN Marking Thresh = 30KB
DCTCP vs TCP
From Alizadeh et al [SIGCOMM’10]
9
![Page 10: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750](https://reader036.vdocuments.net/reader036/viewer/2022071217/604def52aa610d0c025d4574/html5/thumbnails/10.jpg)
TCP: ~1–10ms
DCTCP: ~100μs
~Zero Latency
How do we get this?
Achieving Zero Queuing Delay
C Incoming Traffic
TCP
Incoming Traffic
DCTCP K C
10
![Page 11: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750](https://reader036.vdocuments.net/reader036/viewer/2022071217/604def52aa610d0c025d4574/html5/thumbnails/11.jpg)
Phantom Queue
Link Speed C
Switch
Bump on Wire (NetFPGA implementaAon)
• Key idea: – Associate congesIon with link uIlizaIon, not buffer occupancy – Virtual Queue (Gibbens & Kelly 1999, Kunniyur & Srikant 2001)
Marking Thresh.
γC γ < 1: Creates “bandwidth headroom”
11
![Page 12: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750](https://reader036.vdocuments.net/reader036/viewer/2022071217/604def52aa610d0c025d4574/html5/thumbnails/12.jpg)
Throughput Switch latency (mean)
0
200
400
600
800
1000
600 650 700 750 800 850 900 950 1000
Thro
ughp
ut [M
bps]
PQ Drain Rate [Mbps]
ecn1kecn3kecn6k
ecn15kecn30k
0 50
100 150 200 250 300 350 400
600 650 700 750 800 850 900 950 1000
Mea
n S
witc
h La
tenc
y [µ
s]
PQ Drain Rate [Mbps]
ecn1kecn3kecn6k
ecn15kecn30k
Throughput & Latency vs. PQ Drain Rate
12
![Page 13: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750](https://reader036.vdocuments.net/reader036/viewer/2022071217/604def52aa610d0c025d4574/html5/thumbnails/13.jpg)
• TCP traffic is very bursty – Made worse by CPU-‐offload opImizaIons like Large Send Offload and Interrupt Coalescing
– Causes spikes in queuing, increasing latency
Example. 1Gbps flow on 10G NIC
The Need for Pacing
65KB bursts every 0.5ms
13
![Page 14: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750](https://reader036.vdocuments.net/reader036/viewer/2022071217/604def52aa610d0c025d4574/html5/thumbnails/14.jpg)
Impact of Interrupt Coalescing
Interrupt Coalescing
Receiver CPU (%)
Throughput (Gbps)
Burst Size (KB)
disabled 99 7.7 67.4
rx-‐frames=2 98.7 9.3 11.4
rx-‐frames=8 75 9.5 12.2
rx-‐frames=32 53.2 9.5 16.5
rx-‐frames=128 30.7 9.5 64.0
More Interrupt Coalescing
Lower CPU UAlizaAon & Higher Throughput
More BursAness 14
![Page 15: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750](https://reader036.vdocuments.net/reader036/viewer/2022071217/604def52aa610d0c025d4574/html5/thumbnails/15.jpg)
• Algorithmic challenges: – At what rate to pace?
• Found dynamically: – Which flows to pace?
• Elephants: On each ACK with ECN bit set, begin pacing the flow with some probability.
R← (1−η)R+ηRmeasured +βQTB
Outgoing Packets From
Server NIC
Un-‐paced Traffic
TX
Token Bucket Rate Limiter
Flow AssociaAon
Table
R QTB
Hardware Pacer Module
15
![Page 16: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750](https://reader036.vdocuments.net/reader036/viewer/2022071217/604def52aa610d0c025d4574/html5/thumbnails/16.jpg)
0
200
400
600
800
1000
600 650 700 750 800 850 900 950 1000
Thro
ughp
ut [M
bps]
PQ Drain Rate [Mbps]
ecn1kecn3kecn6k
ecn15kecn30k 0
50 100 150 200 250 300 350 400
600 650 700 750 800 850 900 950 1000
Mea
n S
witc
h La
tenc
y [µ
s]
PQ Drain Rate [Mbps]
5µsec
ecn1kecn3kecn6k
ecn15kecn30k
Throughput Switch latency (mean)
Throughput & Latency vs. PQ Drain Rate (with Pacing)
16
![Page 17: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750](https://reader036.vdocuments.net/reader036/viewer/2022071217/604def52aa610d0c025d4574/html5/thumbnails/17.jpg)
0 50
100 150 200 250 300 350 400
600 650 700 750 800 850 900 950 1000
Mea
n S
witc
h La
tenc
y [µ
s]
PQ Drain Rate [Mbps]
ecn1kecn3kecn6k
ecn15kecn30k
0 50
100 150 200 250 300 350 400
600 650 700 750 800 850 900 950 1000
Mea
n S
witc
h La
tenc
y [µ
s]
PQ Drain Rate [Mbps]
5µsec
ecn1kecn3kecn6k
ecn15kecn30k
No Pacing Pacing
No Pacing vs Pacing (Mean Latency)
17
![Page 18: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750](https://reader036.vdocuments.net/reader036/viewer/2022071217/604def52aa610d0c025d4574/html5/thumbnails/18.jpg)
0 100 200 300 400 500 600 700 800
600 650 700 750 800 850 900 950 1000
99th
Per
cent
ile L
aten
cy [µ
s]
PQ Drain Rate [Mbps]
ecn1kecn3kecn6k
ecn15kecn30k
0 100 200 300 400 500 600 700 800
600 650 700 750 800 850 900 950 1000
99th
Per
cent
ile L
aten
cy [µ
s]
PQ Drain Rate [Mbps]
21µsec
ecn1kecn3kecn6k
ecn15kecn30k
No Pacing Pacing
No Pacing vs Pacing (99th PercenAle Latency)
18
![Page 19: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750](https://reader036.vdocuments.net/reader036/viewer/2022071217/604def52aa610d0c025d4574/html5/thumbnails/19.jpg)
The HULL Architecture
Phantom Queue
Hardware Pacer
DCTCP CongesAon Control
19
![Page 20: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750](https://reader036.vdocuments.net/reader036/viewer/2022071217/604def52aa610d0c025d4574/html5/thumbnails/20.jpg)
SW1
NF5
S9 S10
NF1S1
S2
NF2S3
S4
NF3S6
NF4S7
S8
NF6S5
• ImplementaIon – PQ, Pacer, and Latency Measurement modules implemented in NetFPGA
– DCTCP in Linux (patch available online)
• EvaluaIon – 10 server testbed – Numerous micro-‐benchmarks
• StaIc & dynamic workloads • Comparison with ‘ideal’ 2-‐priority QoS scheme • Different marking thresholds, switch buffer sizes • Effect of parameters
– Large-‐scale ns-‐2 simulaIons
ImplementaAon and EvaluaAon
20
![Page 21: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750](https://reader036.vdocuments.net/reader036/viewer/2022071217/604def52aa610d0c025d4574/html5/thumbnails/21.jpg)
Load: 20% Switch Latency (μs) 10MB FCT (ms)
Avg 99th Avg 99th
TCP 111.5 1,224.8 110.2 349.6
DCTCP-‐30K 38.4 295.2 106.8 301.7
DCTCP-‐6K-‐Pacer 6.6 59.7 111.8 320.0
DCTCP-‐PQ950-‐Pacer 2.8 18.6 125.4 359.9
• 9 senders à 1 receiver (80% 1KB flows, 20% 10MB flows).
Dynamic Flow Experiment 20% load
21
![Page 22: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750](https://reader036.vdocuments.net/reader036/viewer/2022071217/604def52aa610d0c025d4574/html5/thumbnails/22.jpg)
Conclusion
• The HULL architecture combines – Phantom queues – DCTCP – Hardware pacing
• We trade some bandwidth (that is relaIvely plenIful) for significant latency reducIons (oXen 10-‐40x compared to TCP and DCTCP).
22
![Page 23: Less$is$More:$Trading$ali0le$Bandwidth$ for$UltraLow ... · Throughput [Mbps] PQ Drain Rate [Mbps] ecn1k ecn3k ecn6k ecn15k ecn30k 0 50 100 150 200 250 300 350 400 600 650 700 750](https://reader036.vdocuments.net/reader036/viewer/2022071217/604def52aa610d0c025d4574/html5/thumbnails/23.jpg)
Thank you!