the ieee p802.1qcz project on congestion isolation
TRANSCRIPT
The IEEE P802.1Qcz Project on Congestion Isolation
Paul CongdonTellac
NANOG 76June 2019
The Case for Low-latency, Lossless, Large-Scale DCNs
• More and more latency-sensitive applications are being deployed in data centers– Distributed Storage– AI / Deep Learning– Cloud HPC– High-Frequency Trading
• RDMA is operating at larger scales thanks to RoCEv2– Chuanxiong Guo, et al., Microsoft, ”RDMA over Commodity Ethernet at Scale”, SIGCOMM
2016– Y Zhu, H Eran, et al., Microsoft, Mellanox, “Congestion control for large-scale RDMA
deployments”, SIGCOMM 2015– Radhika Mittal, et al., UC Berkeley, Google, “TIMELY: RTT-based Congestion Control for the
Datacenter”, SIGCOMM 2015• The scale of Data Center Networks continues to grow
– Larger, faster clusters are better than more smaller size clusters– Server growth continues at 25% - 30% putting pressure on cluster sizes and networking costs
Lossless DCN state-of-the-art
• DCNs are primarily L3 CLOS networks
• ECN is used for end-to-end congestion control
• Congestion feedback can be protocol and application specific
• PFC used as a last resort to ensure lossless environment, or not at all in low-loss environments.
• Traffic classes for PFC are mapped using DSCP as opposed to VLAN tags
Congestion
HoLB
PFCECN Mark
Congestion Feedback (e.g. CNP(RoCEv2),
DCQCN, ECE(TCP))
Congested Flow
Victim Flow
ECN Control Loop
Challenges going forward
• Scaling the high-performance data center– More hops => more congestion points– Faster links => more data in flight
• Switch buffer growth is not keeping up
85
192
384
512
8519.2 9.6 5.12
Gen1 (48x1 Gbps) Gen2 (48x10 Gbps) Gen3 (32x40 Gbps) Gen4 (32x100 Gbps)
KBMemory
KB of Packet Buffer by Commodity Switch Architecture
Buffer/portBuffer/port…
Source: “Congestion Control for High-speed Extremely Shallow-buffered Datacenter Networks”. In Proceedings of APNet’17, Hong Kong, China, August 03-04, 2017, https://doi.org/10.1145/3106989.3107003
Existing 802.1 Congestion Management Tools802.1Qbb - Priority-based Flow Control 802.1Qau - Congestion Notification
Congestion
HoLB
PFCPF
C
PFC
PFC
PFC
PFC
HoLB
Congestion
Qau CNM
Concerns with over-usel Head-of-Line blocking
l Congestion spreading
l Buffer Bloat, increasing latency
l Increased jitter reducing throughput
l Deadlocks with some implementations
Concerns with deploymentl Layer 2 end-to-end congestion control
l NIC based rate-limiters (Reaction Points)
l Designed for non-IP based protocolsp FCoE
p RoCE – v1
Reaction Point
Project Background – P802.1Qcz• Project Initiation
– November 2017 – IEEE 802.1 agreed to develop a Project Authorization Request (PAR) and Criteria for Standards Development (CSD) to amend IEEE 802.1Q with “Congestion Isolation”
– Amendment to IEEE 802.1Q-2018 to Support the isolation of congested data flows within data center environments, such as high-performance computing, distributed storage and central offices re-architected as data centers.
– Motivation discussed in draft report of “802 Network Enhancements For the Next Decade”
• https://mentor.ieee.org/802.1/dcn/18/1-18-0007-03-ICne-draft-report-lossless-data-center-networks.pdf
• Project Status– July 2018 – Project Approved– Jan 2019 – Initial drafts submitted as individual contributions– March 2019 – Editor assigned and drafts are under construction
• So what is Congestion Isolation?
Congestion
HOLB
PFC
Congested Flow
Victim Flow
Congested Queue
Normal Queue
Today – Without Congestion Isolation
CIM
Isolate
Isolate
Congestion Isolation
Congestion Isolation at a High Level
P802.1Qcz – Congestion Isolation - Goals
• Work in conjunction with higher-layer end-to-end congestion control (ECN, BBR, etc)
• Support larger, faster data centers (Low-Latency, High-Throughput)
• Support lossless and low-loss environments• Improve performance of TCP and UDP based flows• Reduce pressure on switch buffer growth• Reduce the frequency of relying on PFC for a
lossless environment• Eliminate or significantly reduce HOLB caused by
over-use of PFC
Congestion Isolation
DownstreamUpstream1 3 1 3
2
4
2
4Ingress Port
(Virtual Queues)Egress Port Ingress Port
(Virtual Queues)Egress Port
Congested Queue
Non-CongestedQueue
1. Identify the flow causing congestion and isolate locally
CIM2. Signal to neighbor
when congested queue fills
Eliminate HoLBlocking
3. Upstream isolates the flow too, eliminating head-of-line blocking
PFC 4. Last Resort! If congested queue continues to fill, invoke PFC for lossless
Congestion Isolation Critical Processes
1. Detecting flows causing congestion2. Creating flows in the congested flow table3. Signaling congested flow identify to neighbors4. Isolating congested flows without ordering issues5. Interaction with PFC generation6. Detecting when congested flows are no longer congested7. Signaling congested to non-congested flow transitions to
neighbors8. Un-isolating previously congested flows without ordering
issues
Simulation Highlights
• Complete presentations on simulations are available on 802.1 public repository:
– http://www.ieee802.org/1/files/public/docs2017/new-dcb-shen-congestion-isolation-simulation-1117-v00.pdf– http://www.ieee802.org/1/files/public/docs2018/new-dcb-shen-congestion-isolation-simulation-0118-v01.pdf– http://www.ieee802.org/1/files/public/docs2018/cz-shen-congestion-isolation-simulation-0318-v01.pdf
• Set-up – OMNET++
• 2 Tier CLOS: 1152 servers, 72 switches, 100 GbE interface, 200 ns of link latency (about 40 meters)
• Traffic Patterns:
***
TOR#1 TOR#2 TOR#3 TOR#N
SPINE#N/2SPINE#1
*** *** ***
***
*** *** *** ***
***
• Model data mining application with flow size distributions
• 50 clusters of 21 servers for many to many traffic
• 4 sets of 20:1 permanent many to one incast traffic
***
Many to many traffic
***
Many to one incast traffic
0
0.5
1
1.5
Without CI CI
Average flow completion time(all flows)
0
24
6
810
Without CI CI
Average flow completion time(>10MB flows)
38%23%
FCT(
ms)
0
1
2
3
Without CI CI
Average flow completion time(1MB~10MB flows)
0
0.5
1
1.5
Without CI CI
Average flow completion time(100KB~1MB flows)
31%
49%
0
0.05
0.1
0.15
0.2
Without CI CI
Average flow completion time(<100KB flows)
63%
FCT(
ms)
FCT(
ms)
FCT(
ms)
FCT(
ms)
• The mice benefit the most.
FCT Comparison – Lossless Scenario (with PFC)
0
0.5
1
1.5
Without CI CI
Average flow completion time(all flows)
0
0.5
1
1.5
Without CI CI
Average flow completion time(>10MB flows)
25% 26%
FCT(
Nor
m. t
o W
ithou
t CI)
0
0.5
1
1.5
Without CI CI
Average flow completion time(1MB~10MB flows)
0
0.5
1
1.5
Without CI CI
Average flow completion time(100KB~1MB flows)
26%
36%
0
0.5
1
1.5
Without CI CI
Average flow completion time(<100KB flows)
15%
• With 3 queue model both “without CI” and “CI” have mice prioritization mechanism.
• The performance of the mice is not improved as much.
FCT(
Nor
m. t
o W
ithou
t CI)
FCT(
Nor
m. t
o W
ithou
t CI)
FCT(
Nor
m. t
o W
ithou
t CI)
FCT(
Nor
m. t
o W
ithou
t CI)
FTC With Mice/Elephant separation (3 Queue Model)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
ECN only CI no Signaling CI with Signaling
Overall Packet Loss Rate(%)
26%
54%
Lossy Scenario (No PFC)
• CI reduces packet loss rate, which means it also reduces packet retransmission and improves performance.
0100000200000300000400000500000
Withou
t CI
Mice/El
epha…
With CI
Pause Frame Count Received by Switch
66%
05
10152025
Withou
t CI
Mice/El
epha…
With CI
Average Switch Queue XOFF Duration Percentage(%)
Lossless Scenario - Reducing the Impact of PFC
53%
42%
29%
• CI reduces Pause frame count and XOFF duration.• XOFF duration is less significant than Pause frame count, because usually
pause for low priority queue takes longer time to resume than high priority queue.
Scaling Comparison
00.20.40.60.8
11.21.41.61.8
2
128 288 512 1152
Average Flow Completion Time (ms) – All Flows
Without CIMice/Elephant OnlyWith CI
• Adding CI allows the data center size to scale.
Number of Servers
Next Steps
• Continued draft development• Current Design Team – participants affiliated with Cavium,
Huawei, Marvell, Mellanox, Microsoft, Polytechnic University of Valencia
• How to help/participate?– Provide review comments and feedback to me – [email protected]– Participate in IEEE 802 Industry Connections activity toward next generation
Data Centers (https://1.ieee802.org/802-nend/)– Participate on the P802.1Qcz design team– Participate in IEEE 802.1 meetings – vendor perspective, technical contribution,
customer validation…
References
• P802.1Qcz project web-page– https://1.ieee802.org/tsn/802-1qcz/
• Useful Presentations– Objectives Discussion
• http://www.ieee802.org/1/files/public/docs2018/new-dcb-congdon-ci-objectives-0118-v02.pdf
– Technical overview of CI• http://www.ieee802.org/1/files/public/docs2018/cz-congdon-congestion-isolation-review-
0418-v1.pdf– Simulation Results
• http://www.ieee802.org/1/files/public/docs2018/cz-sun-ci-simulation-update-0518-v01.pdf– Possible changes to 802.1Q
• http://www.ieee802.org/1/files/public/docs2018/cz-congdon-ci-Q-changes-0618-v1.pdf
THANK YOU
Detecting Flows that Cause Congestion
• Expect to not to specify a new mechanism– Reference existing Quantized Congestion Notification (QCN) sampling
approach (Clause 30.2.1)– Reference IETF standards/recommendations?– Allow implementation flexibility
Congested Flow Queue
Uncongested Flow Queue
Queue 0
Queue 1
Isolation Threshold
Creating flows in the congested flow table
• Should only be required to register congested flows• A flow is a traditional 5-tuple• No failure if flow table is full• May be useful to know how many and when CIMs were generated
Src IP Dest IP Protocol Src Port Dest Port Packet Counter
Last Active Time …
10.136.159.100 10.136.169.100 17 3245 4791 100 0x4a32fa32 …
… … … … … … … …
10.120.31.21 10.120.34.21 6 2345 80 20 0x4a33231f …
10.189.32.20 10.189.31.21 6 23 81 1022 0x4a3323f4 …
Congested Flow Table
Signaling congested flow identify to neighbors
• Neighbor needs enough information to identify the same flow– First n-bytes of frame or explicit packet fields?– Leverage QCN format for Congestion Notification Message
• No adverse effects of single packet loss• No requirement to identify flows within virtualization overlay encapsulation• Mandatory or optional functionality?
Dest MAC Address
Src MAC Address
Ethertype
Flow Identification Data (TBD) Flow identifying Information (e.g IP Header, Transport Header, Virtualization/Tunnel encapsulation).
Upstream Switch Port Mac Address
Current Output Port Mac Address
New Ethernet Type
Isolating congested flows without ordering issues
• Conformance will be specified as “externally observable behavior”
• Potential example approach as an informative Annex
121231 1 34
0 0
Non Congested
Congested flow Queue
121231 1 3
4
T
T
≤1 1≤
1231 1 3
4
T
T
1 1≤
2
5
42
6
2 2
6
T
T789
1 1≤1 0>45
4
0 0
2
6789 45
4
WRR
Interaction with PFC generation• Priority-based Flow Control (PFC) is required for fully ‘lossless’ mode of
operation• The goal is, that if needed, PFC should be issued primarily on the congested
traffic class. However…– Due to CIM signaling delays, packets may exist in both un-congested and
congested upstream traffic classes after isolation.– The downstream switch/router needs to ‘know’ which traffic class was used by
an upstream switch - Solution is to use Priority tagged or VLAN tagged format
Congested
Un-Congested
Transmit Queues Receive Buffers
Pause
In Flight
XOFFThreshold
XONThreshold
Congested
Un-Congested
Transmit Queues Receive Buffers
STOP
In Flight
Packet Loss
Removing entries from congested flow table (a flow is no longer congested)
• Mechanisms are needed to free entries in the congested flow table– flush all entries egressing a port if the congested queue empties –
Obvious– Define a recovery threshold in the congested flow queue and maintain
order through scheduling– Count packets of a flow in the congested flow queue – Ideal, but
maybe difficult to implement
Flow ID
Congested Flow Table
R 2 1
Congested Flow Queue
3 R
Uncongested Flow Queue
Queue 0
Queue 1
R_0
R_1
Isolation Threshold
= 1
= 1
…
Recovery Threshold
Cnt_congested_pkts
2
…
Signaling congested to non-congested transitions to neighbors
• Flows may remain congested upstream longer than necessary• No natural mechanism to generate multiple messages
CIM
Isolate
Isolate
Two congested egress queues
CIM
Isolate
Isolate
CIM
No LongerCongested
Blue Remains Isolated, but could
be deallocated
Congested subsides on one
Generate CIM
(deallocate)
Inform neighbor
CIM(deallocate)
Blue Flow becomes non-congested, potential out-of-
order
Un-isolating previously congested flows without ordering issues
• Specify externally observable behavior, not implementation details
• Similar mechanism when isolating a flow, but in reverse
Non Congested
Congested flow Queue
33
678
WRR
0 0
Non Congested
Congested flow Queue
3
678N
N9
WRR
≤1 1≤
Non Congested
Congested flow Queue 78N
N984
1 1≤
Non Congested
Congested flow Queue N
N984956
1 1≤1 0>
Non Congested
Congested flow Queue
84956
WRR
0 0