the ieee p802.1qcz project on congestion isolation

27
The IEEE P802.1Qcz Project on Congestion Isolation Paul Congdon Tellac NANOG 76 June 2019

Upload: others

Post on 11-Nov-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The IEEE P802.1Qcz Project on Congestion Isolation

The IEEE P802.1Qcz Project on Congestion Isolation

Paul CongdonTellac

NANOG 76June 2019

Page 2: The IEEE P802.1Qcz Project on Congestion Isolation

The Case for Low-latency, Lossless, Large-Scale DCNs

• More and more latency-sensitive applications are being deployed in data centers– Distributed Storage– AI / Deep Learning– Cloud HPC– High-Frequency Trading

• RDMA is operating at larger scales thanks to RoCEv2– Chuanxiong Guo, et al., Microsoft, ”RDMA over Commodity Ethernet at Scale”, SIGCOMM

2016– Y Zhu, H Eran, et al., Microsoft, Mellanox, “Congestion control for large-scale RDMA

deployments”, SIGCOMM 2015– Radhika Mittal, et al., UC Berkeley, Google, “TIMELY: RTT-based Congestion Control for the

Datacenter”, SIGCOMM 2015• The scale of Data Center Networks continues to grow

– Larger, faster clusters are better than more smaller size clusters– Server growth continues at 25% - 30% putting pressure on cluster sizes and networking costs

Page 3: The IEEE P802.1Qcz Project on Congestion Isolation

Lossless DCN state-of-the-art

• DCNs are primarily L3 CLOS networks

• ECN is used for end-to-end congestion control

• Congestion feedback can be protocol and application specific

• PFC used as a last resort to ensure lossless environment, or not at all in low-loss environments.

• Traffic classes for PFC are mapped using DSCP as opposed to VLAN tags

Congestion

HoLB

PFCECN Mark

Congestion Feedback (e.g. CNP(RoCEv2),

DCQCN, ECE(TCP))

Congested Flow

Victim Flow

ECN Control Loop

Page 4: The IEEE P802.1Qcz Project on Congestion Isolation

Challenges going forward

• Scaling the high-performance data center– More hops => more congestion points– Faster links => more data in flight

• Switch buffer growth is not keeping up

85

192

384

512

8519.2 9.6 5.12

Gen1 (48x1 Gbps) Gen2 (48x10 Gbps) Gen3 (32x40 Gbps) Gen4 (32x100 Gbps)

KBMemory

KB of Packet Buffer by Commodity Switch Architecture

Buffer/portBuffer/port…

Source: “Congestion Control for High-speed Extremely Shallow-buffered Datacenter Networks”. In Proceedings of APNet’17, Hong Kong, China, August 03-04, 2017, https://doi.org/10.1145/3106989.3107003

Page 5: The IEEE P802.1Qcz Project on Congestion Isolation

Existing 802.1 Congestion Management Tools802.1Qbb - Priority-based Flow Control 802.1Qau - Congestion Notification

Congestion

HoLB

PFCPF

C

PFC

PFC

PFC

PFC

HoLB

Congestion

Qau CNM

Concerns with over-usel Head-of-Line blocking

l Congestion spreading

l Buffer Bloat, increasing latency

l Increased jitter reducing throughput

l Deadlocks with some implementations

Concerns with deploymentl Layer 2 end-to-end congestion control

l NIC based rate-limiters (Reaction Points)

l Designed for non-IP based protocolsp FCoE

p RoCE – v1

Reaction Point

Page 6: The IEEE P802.1Qcz Project on Congestion Isolation

Project Background – P802.1Qcz• Project Initiation

– November 2017 – IEEE 802.1 agreed to develop a Project Authorization Request (PAR) and Criteria for Standards Development (CSD) to amend IEEE 802.1Q with “Congestion Isolation”

– Amendment to IEEE 802.1Q-2018 to Support the isolation of congested data flows within data center environments, such as high-performance computing, distributed storage and central offices re-architected as data centers.

– Motivation discussed in draft report of “802 Network Enhancements For the Next Decade”

• https://mentor.ieee.org/802.1/dcn/18/1-18-0007-03-ICne-draft-report-lossless-data-center-networks.pdf

• Project Status– July 2018 – Project Approved– Jan 2019 – Initial drafts submitted as individual contributions– March 2019 – Editor assigned and drafts are under construction

• So what is Congestion Isolation?

Page 7: The IEEE P802.1Qcz Project on Congestion Isolation

Congestion

HOLB

PFC

Congested Flow

Victim Flow

Congested Queue

Normal Queue

Today – Without Congestion Isolation

CIM

Isolate

Isolate

Congestion Isolation

Congestion Isolation at a High Level

Page 8: The IEEE P802.1Qcz Project on Congestion Isolation

P802.1Qcz – Congestion Isolation - Goals

• Work in conjunction with higher-layer end-to-end congestion control (ECN, BBR, etc)

• Support larger, faster data centers (Low-Latency, High-Throughput)

• Support lossless and low-loss environments• Improve performance of TCP and UDP based flows• Reduce pressure on switch buffer growth• Reduce the frequency of relying on PFC for a

lossless environment• Eliminate or significantly reduce HOLB caused by

over-use of PFC

Page 9: The IEEE P802.1Qcz Project on Congestion Isolation

Congestion Isolation

DownstreamUpstream1 3 1 3

2

4

2

4Ingress Port

(Virtual Queues)Egress Port Ingress Port

(Virtual Queues)Egress Port

Congested Queue

Non-CongestedQueue

1. Identify the flow causing congestion and isolate locally

CIM2. Signal to neighbor

when congested queue fills

Eliminate HoLBlocking

3. Upstream isolates the flow too, eliminating head-of-line blocking

PFC 4. Last Resort! If congested queue continues to fill, invoke PFC for lossless

Page 10: The IEEE P802.1Qcz Project on Congestion Isolation

Congestion Isolation Critical Processes

1. Detecting flows causing congestion2. Creating flows in the congested flow table3. Signaling congested flow identify to neighbors4. Isolating congested flows without ordering issues5. Interaction with PFC generation6. Detecting when congested flows are no longer congested7. Signaling congested to non-congested flow transitions to

neighbors8. Un-isolating previously congested flows without ordering

issues

Page 11: The IEEE P802.1Qcz Project on Congestion Isolation

Simulation Highlights

• Complete presentations on simulations are available on 802.1 public repository:

– http://www.ieee802.org/1/files/public/docs2017/new-dcb-shen-congestion-isolation-simulation-1117-v00.pdf– http://www.ieee802.org/1/files/public/docs2018/new-dcb-shen-congestion-isolation-simulation-0118-v01.pdf– http://www.ieee802.org/1/files/public/docs2018/cz-shen-congestion-isolation-simulation-0318-v01.pdf

• Set-up – OMNET++

• 2 Tier CLOS: 1152 servers, 72 switches, 100 GbE interface, 200 ns of link latency (about 40 meters)

• Traffic Patterns:

***

TOR#1 TOR#2 TOR#3 TOR#N

SPINE#N/2SPINE#1

*** *** ***

***

*** *** *** ***

***

• Model data mining application with flow size distributions

• 50 clusters of 21 servers for many to many traffic

• 4 sets of 20:1 permanent many to one incast traffic

***

Many to many traffic

***

Many to one incast traffic

Page 12: The IEEE P802.1Qcz Project on Congestion Isolation

0

0.5

1

1.5

Without CI CI

Average flow completion time(all flows)

0

24

6

810

Without CI CI

Average flow completion time(>10MB flows)

38%23%

FCT(

ms)

0

1

2

3

Without CI CI

Average flow completion time(1MB~10MB flows)

0

0.5

1

1.5

Without CI CI

Average flow completion time(100KB~1MB flows)

31%

49%

0

0.05

0.1

0.15

0.2

Without CI CI

Average flow completion time(<100KB flows)

63%

FCT(

ms)

FCT(

ms)

FCT(

ms)

FCT(

ms)

• The mice benefit the most.

FCT Comparison – Lossless Scenario (with PFC)

Page 13: The IEEE P802.1Qcz Project on Congestion Isolation

0

0.5

1

1.5

Without CI CI

Average flow completion time(all flows)

0

0.5

1

1.5

Without CI CI

Average flow completion time(>10MB flows)

25% 26%

FCT(

Nor

m. t

o W

ithou

t CI)

0

0.5

1

1.5

Without CI CI

Average flow completion time(1MB~10MB flows)

0

0.5

1

1.5

Without CI CI

Average flow completion time(100KB~1MB flows)

26%

36%

0

0.5

1

1.5

Without CI CI

Average flow completion time(<100KB flows)

15%

• With 3 queue model both “without CI” and “CI” have mice prioritization mechanism.

• The performance of the mice is not improved as much.

FCT(

Nor

m. t

o W

ithou

t CI)

FCT(

Nor

m. t

o W

ithou

t CI)

FCT(

Nor

m. t

o W

ithou

t CI)

FCT(

Nor

m. t

o W

ithou

t CI)

FTC With Mice/Elephant separation (3 Queue Model)

Page 14: The IEEE P802.1Qcz Project on Congestion Isolation

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

ECN only CI no Signaling CI with Signaling

Overall Packet Loss Rate(%)

26%

54%

Lossy Scenario (No PFC)

• CI reduces packet loss rate, which means it also reduces packet retransmission and improves performance.

Page 15: The IEEE P802.1Qcz Project on Congestion Isolation

0100000200000300000400000500000

Withou

t CI

Mice/El

epha…

With CI

Pause Frame Count Received by Switch

66%

05

10152025

Withou

t CI

Mice/El

epha…

With CI

Average Switch Queue XOFF Duration Percentage(%)

Lossless Scenario - Reducing the Impact of PFC

53%

42%

29%

• CI reduces Pause frame count and XOFF duration.• XOFF duration is less significant than Pause frame count, because usually

pause for low priority queue takes longer time to resume than high priority queue.

Page 16: The IEEE P802.1Qcz Project on Congestion Isolation

Scaling Comparison

00.20.40.60.8

11.21.41.61.8

2

128 288 512 1152

Average Flow Completion Time (ms) – All Flows

Without CIMice/Elephant OnlyWith CI

• Adding CI allows the data center size to scale.

Number of Servers

Page 17: The IEEE P802.1Qcz Project on Congestion Isolation

Next Steps

• Continued draft development• Current Design Team – participants affiliated with Cavium,

Huawei, Marvell, Mellanox, Microsoft, Polytechnic University of Valencia

• How to help/participate?– Provide review comments and feedback to me – [email protected]– Participate in IEEE 802 Industry Connections activity toward next generation

Data Centers (https://1.ieee802.org/802-nend/)– Participate on the P802.1Qcz design team– Participate in IEEE 802.1 meetings – vendor perspective, technical contribution,

customer validation…

Page 18: The IEEE P802.1Qcz Project on Congestion Isolation

References

• P802.1Qcz project web-page– https://1.ieee802.org/tsn/802-1qcz/

• Useful Presentations– Objectives Discussion

• http://www.ieee802.org/1/files/public/docs2018/new-dcb-congdon-ci-objectives-0118-v02.pdf

– Technical overview of CI• http://www.ieee802.org/1/files/public/docs2018/cz-congdon-congestion-isolation-review-

0418-v1.pdf– Simulation Results

• http://www.ieee802.org/1/files/public/docs2018/cz-sun-ci-simulation-update-0518-v01.pdf– Possible changes to 802.1Q

• http://www.ieee802.org/1/files/public/docs2018/cz-congdon-ci-Q-changes-0618-v1.pdf

Page 19: The IEEE P802.1Qcz Project on Congestion Isolation

THANK YOU

Page 20: The IEEE P802.1Qcz Project on Congestion Isolation

Detecting Flows that Cause Congestion

• Expect to not to specify a new mechanism– Reference existing Quantized Congestion Notification (QCN) sampling

approach (Clause 30.2.1)– Reference IETF standards/recommendations?– Allow implementation flexibility

Congested Flow Queue

Uncongested Flow Queue

Queue 0

Queue 1

Isolation Threshold

Page 21: The IEEE P802.1Qcz Project on Congestion Isolation

Creating flows in the congested flow table

• Should only be required to register congested flows• A flow is a traditional 5-tuple• No failure if flow table is full• May be useful to know how many and when CIMs were generated

Src IP Dest IP Protocol Src Port Dest Port Packet Counter

Last Active Time …

10.136.159.100 10.136.169.100 17 3245 4791 100 0x4a32fa32 …

… … … … … … … …

10.120.31.21 10.120.34.21 6 2345 80 20 0x4a33231f …

10.189.32.20 10.189.31.21 6 23 81 1022 0x4a3323f4 …

Congested Flow Table

Page 22: The IEEE P802.1Qcz Project on Congestion Isolation

Signaling congested flow identify to neighbors

• Neighbor needs enough information to identify the same flow– First n-bytes of frame or explicit packet fields?– Leverage QCN format for Congestion Notification Message

• No adverse effects of single packet loss• No requirement to identify flows within virtualization overlay encapsulation• Mandatory or optional functionality?

Dest MAC Address

Src MAC Address

Ethertype

Flow Identification Data (TBD) Flow identifying Information (e.g IP Header, Transport Header, Virtualization/Tunnel encapsulation).

Upstream Switch Port Mac Address

Current Output Port Mac Address

New Ethernet Type

Page 23: The IEEE P802.1Qcz Project on Congestion Isolation

Isolating congested flows without ordering issues

• Conformance will be specified as “externally observable behavior”

• Potential example approach as an informative Annex

121231 1 34

0 0

Non Congested

Congested flow Queue

121231 1 3

4

T

T

≤1 1≤

1231 1 3

4

T

T

1 1≤

2

5

42

6

2 2

6

T

T789

1 1≤1 0>45

4

0 0

2

6789 45

4

WRR

Page 24: The IEEE P802.1Qcz Project on Congestion Isolation

Interaction with PFC generation• Priority-based Flow Control (PFC) is required for fully ‘lossless’ mode of

operation• The goal is, that if needed, PFC should be issued primarily on the congested

traffic class. However…– Due to CIM signaling delays, packets may exist in both un-congested and

congested upstream traffic classes after isolation.– The downstream switch/router needs to ‘know’ which traffic class was used by

an upstream switch - Solution is to use Priority tagged or VLAN tagged format

Congested

Un-Congested

Transmit Queues Receive Buffers

Pause

In Flight

XOFFThreshold

XONThreshold

Congested

Un-Congested

Transmit Queues Receive Buffers

STOP

In Flight

Packet Loss

Page 25: The IEEE P802.1Qcz Project on Congestion Isolation

Removing entries from congested flow table (a flow is no longer congested)

• Mechanisms are needed to free entries in the congested flow table– flush all entries egressing a port if the congested queue empties –

Obvious– Define a recovery threshold in the congested flow queue and maintain

order through scheduling– Count packets of a flow in the congested flow queue – Ideal, but

maybe difficult to implement

Flow ID

Congested Flow Table

R 2 1

Congested Flow Queue

3 R

Uncongested Flow Queue

Queue 0

Queue 1

R_0

R_1

Isolation Threshold

= 1

= 1

Recovery Threshold

Cnt_congested_pkts

2

Page 26: The IEEE P802.1Qcz Project on Congestion Isolation

Signaling congested to non-congested transitions to neighbors

• Flows may remain congested upstream longer than necessary• No natural mechanism to generate multiple messages

CIM

Isolate

Isolate

Two congested egress queues

CIM

Isolate

Isolate

CIM

No LongerCongested

Blue Remains Isolated, but could

be deallocated

Congested subsides on one

Generate CIM

(deallocate)

Inform neighbor

CIM(deallocate)

Blue Flow becomes non-congested, potential out-of-

order

Page 27: The IEEE P802.1Qcz Project on Congestion Isolation

Un-isolating previously congested flows without ordering issues

• Specify externally observable behavior, not implementation details

• Similar mechanism when isolating a flow, but in reverse

Non Congested

Congested flow Queue

33

678

WRR

0 0

Non Congested

Congested flow Queue

3

678N

N9

WRR

≤1 1≤

Non Congested

Congested flow Queue 78N

N984

1 1≤

Non Congested

Congested flow Queue N

N984956

1 1≤1 0>

Non Congested

Congested flow Queue

84956

WRR

0 0