ultra low latency-cluj2012 · • flow control • message passing • interface • language •...

Cisco Confidential © 2010 Cisco and/or its affiliates. All rights reserved. 1

Ultra Low Latency Calin Poenaru Cisco Systems Romania

October, 2012

© 2012 Cisco and/or its affiliates. All rights reserved. 2

Agenda

§ Common understanding of “Ultra Low Latency”

§ Usual ways to measure latency on the switches

§ Design critical choices and important considerations to have when building a ULL solution

2


•  Definition of latency: delay introduced in the communication between the time sender initiates it and the receiver receives and processes the information.

•  Example: Voice Over IP, Radar, Satellite Communication, Real time application

•  Different requirements / different user experience Example of market data: user experience vs. machine trading Examples of industry: Telecommunication vs. Financial

What is latency?

4

Consensus that performance without loss during stable and peek times is the ultimate goal


(Y-Axis in Logarithmic Scale)

Focus has been here

When focus should be here


Which Latency where? Evolution of the look at the full stack

6

BITS

FRAMES

PACKETS

SEGMENTS

DATA

•  L3 lookup •  Routing

•  L2 lookup •  Switching •  Buffering •  Flow control

•  Line encoding •  Framing

•  Transmission •  Reliability •  Flow Control

•  Message Passing

•  Interface •  Language

•  User space •  Libraries

Physical

Data Link

Network

Transport

Session

Presentation

Application

Physical

Data Link

Network

Transport

Session

Presentation

Application

transport


•  From RFC 1242: for store and forward devices:

The time interval starting when the last bit of the input frame reaches the input port and ending when the first bit of the output frame is seen on the output port.

LIFO: Last In First Out

How to measure the latency in the Network?

7

Source: RFC 1242


•  From RFC 1242: for bit forwarding devices (cut-through devices):

The time interval starting when the end of the first bit of the input frame reaches the input port and ending when the start of the first bit of the output frame is seen on the output port

FIFO: First In First Out


8

Source: RFC 1242


•  Measurement method: LIFO or FIFO?

•  LIFO = FIFO - (Packet size in bits/Speed)

•  Cable length: identical cable type and length

•  Identical amount of ports to test

•  Identical testing equipment: Chassis Testing cards Software Revision

•  Typical Latency tests:RFC 2544, 2889, 3918


9


Data Centre Architecture - Trends

Spectrum of Design Evolution

Ultra Low Latency

•  High Frequency Trading •  Layer 3 & Multicast •  No Virtualization •  Limited Physical Scale •  Nexus 3000 & UCS •  10G edge moving to

40G

Warehouse Scale

•  Layer 3 Edge (iBGP, ISIS) •  1000’s of racks •  Homogeneous Environment •  No Hypervisor virtualization •  1G edge moving to 10G •  Nexus 2000, 3000, 5500, 7000

& UCS

slot 1 slot 2 slot 3 slot 4 slot 5 slot 6 slot 7 slot 8

blade1 blade2 blade3 blade4 blade5 blade6 blade7 blade8























Virtualized Data Center

•  SP and Enterprise •  Hypervisor Virtualization •  Shared infrastructure

Heterogeneous •  1G Edge moving to 10G •  Nexus 1000v, 2000, 5500,

7000 & UCS

HPC/GRID

•  Layer 3 & Layer 2 •  No Virtualization •  iWARP & RoCE •  Nexus 2000, 3000,

5500, 7000 & UCS •  10G moving to 40G


•  Baud rate

•  The driver is not bandwidth in ULL

Design Consideration #1 : Speed

12

The Faster speed should be the faster communication, is that all?

1 G

10 G

40 G


•  Serialization Delay reduced with higher speeds

•  The less speed mismatch the better performance

Design Consideration #1 : Speed

13

1 GE 10 GE 40 GE 100 GE

64 byte 0.512 us 0.051us 0.013us 0.005us

128 bytes 1.024 us 0.102us 0.026us 0.010us

256 bytes 2.048 us 0.205us 0.051us 0.021us

512 bytes 4.096 us 0.410us 0.102us 0.041us

10, 40, 100 GE options to reduce serialization delay


Design Consideration #1 : Speed - Performance and Serialization delay

N3064 N3064 N3064

10GE Hosts

N3064

N3064 N3064

10GE

10GE

128 bytes = 2.802us 258 bytes = 2.907us 512 bytes = 3.531us

10GE Interconnects


N3064 N3064 N3064

10GE Hosts

N3064

N3016 N3016

40GE

40GE

40GE Interconnects



Ex: Up to 860 nanoseconds faster with 40 GE interconnects to an aggregation N3016

2.7 2.8

3.2

3.8 3.9 3.9 3.9 3.9

2.4 2.5 2.7

2.9 3.2

3.4 3.6

3.8

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

4.50

64 128 256 512 768 1024 1280 1518

Latency comparison 10 GE to 40 GE aggregation (L3 RFC 2544)

10-10-10 GE (3016 Spine)

10-40-10 GE (3016 Spine)


•  Small Flows/Messaging (Heart-beats, Keep-alive, delay sensitive application messaging)

•  Small – Medium Incast (Hadoop Shuffle, Scatter-Gather, Distributed Storage)

•  Large Flows (HDFS Insert, File Copy)

•  Large Incast (Hadoop Replication, Distributed Storage)

Design Consideration #2 – Congestion– What is the traffic type?

16


•  Uplink Speed Mismatch

•  Incast / Many to One conversations

Design Consideration #2 – Congestion – When are buffers needed?

17

1GE Acces

s

10GE

10GE

10GE

10GE

1GE Acces

s


§  A balanced fabric is a function of maximal throughput ‘and’ minimal loss => “Goodput”

§  Application-level throughput (goodput): Given by the total bytes received from all senders divided by the finishing time of the last sender.

Source : “Understanding TCP Incast Throughput Collapse in Datacenter Networks”, Y. Chen, R Griffith,

WREN ’09

5 millisecond view Congestion Threshold exceeded

Data Center Design Goal: Optimizing the balance of end to end fabric latency with the ability to absorb traffic peaks and prevent any associated

traffic loss

Design Consideration #2 – Congestion – When are buffers needed?


1" 10"

19"

28"

37"

46"

55"

64"

73"

82"

91"

100"

109"

118"

127"

136"

145"

154"

163"

172"

181"

190"

199"

208"

217"

226"

235"

244"

253"

262"

271"

280"

289"

298"

307"

316"

325"

334"

343"

352"

361"

370"

379"

388"

397"

406"

415"

424"

433"

442"

451"

460"

469"

478"

487"

496"

505"

514"

523"

532"

541"

550"

559"

568"

577"

586"

595"

604"

613"

622"

631"

640"

649"

658"

667"

676"

685"

694"

703"

712"

721"

730"

739"

748"

757"

766"

775"

784"

793"

Job$Co

mple*

on$

Cell$Usage$

1G"Buffer"Used" 10G"Buffer"Used" 1G"Map"%" 10G"Map"%"

§  Moving from 1GE to 10GE actually lowers the buffer requirement on the switching layer §  By moving to 10GE, the data node has a larger input to receive data lessening the need

for buffers on the network as the total aggregate speed or amount of data does not increase substantially

§  Current system limits are primarily I/O and Compute capabilities (Disk I/O bound)

Buffering and link speeds Incast


Design Consideration #2 – Buffer Amount – Small Flow Buffer Delay

Crossbar Egress Buffer

128K

B F

low

C

ompl

etio

n Ti

me

128KB Flow 128KB Flow with Background 1GB Flows

Buffer Architecture Choice Matters

Cisco Confidential © 2010 Cisco and/or its affiliates. All rights reserved. 21

Egress Buffer

INGRESS

EGRESS

INGRESS

Ingress per port Buffer

Scheduler

Crossbar Egress per port

Buffer EGRESS

Shared Memory Buffer

Scheduler

INGRESS

EGRESS

Crossbar

Scheduler

Design Consideration #2 – Buffer Amount – The Switch Architecture


Switch on Chip (SoC)

Output Buffer

Parser Decision Engine

Packet FIFO Block

Egress Process

Block (EP)

Packet ReWrite

1/10G Interfaces

1/10G Interfaces

Design Consideration #2 – Buffer Amount – The Switch Architecture


•  Deterministic vs. Lowest debate

Design Consideration #3 – Switching mode

1.15 1.13 1.14 1.18 1.25 1.25 1.24 1.24 1.22 1.20

0.00

0.50

1.00

1.50

2.00

2.50

64 128 256 512 768 1024 1280 1518 4096 9216

Late

ncy

(Mic

rose

cond

)

40 GE -> 10 GE

0.76 0.84

0.97

1.18

1.38

1.62

1.79

1.98

0

0.5

1

1.5

2

2.5

64 128 256 512 768 1024 1280 1518 4096 9216 La

tenc

y (M

icro

seco

nd)

10GB -> 40GE

Cut-through Store and Forward

Deterministic latency

Market Data Market Data

RFC 2544 measuring Latency for different switch architectures


•  What media to choose, optical or copper?

•  Propagation delay Pd= distance / speed

•  Electromagnetic speed: s=200 000 km/s

•  Light Speed: 300 000 km/s, fiber glass refraction 1.5

Design Consideration #4 : Physical Media Type – 10G and 40G

24

Delay for 1m

Fiber CX-1 RJ-45

Pd (ns) 5ns 4.3ns 5ns

Propagation Delay Pd


•  Copper UTP/RJ-45: +1usec in the conversion Base SX <-> Base T

Design Consideration #4 : Physical Media Type – 10G and 40G

25

Best Practice: Optical for 1GE, passive CX-1 or optical for 10/40GE


•  Active or Passive Twinax?

•  Passive CX-1 is 0.3ns latency

•  Active CX-1 adds 1ns latency

Design Consideration #4 : Physical Media Type

26

Active CX-1

Passive CX-1

Active process electrical signaling, needs power to drive the IC. Behave as an optical SFP transceiver


•  CDP / LLDP

•  STP

•  Layer 3

•  Multicast

•  Multiple Destination SPAN / Span ACL

•  NAT

•  Monitoring

Design Consideration #5 – Feature Set

27


Consideration #5 – Feature Set - Using Python to Buffer Monitoring

switch# python Python 2.7.2 (default, Jan 11 2012, 17:25:37) [GCC 3.4.3 (MontaVista 3.4.3-25.0.143.0800417 2008-02-22)] on linux2 Type "help", "copyright", "credits" or "license" for more information. Loaded cisco NxOS lib! >>>

Interactive Python Shell

Run Python Script

switch# python bootflash:showBuffer.py Mon Jan 30 19:26:36 UTC 2012 |-----------------------------------------------------------------------------------------| Total Instant Usage 0 Remaining Instant Usage 46080 Max Cell Usage 0 Switch Cell Count 46080 |-----------------------------------------------------------------------------------------| #

Code for the Python script

showBuffer.py is in the next slide

….

•  Runs directly from NX-OS CLI •  Pass Arguments •  Displays the scripts on CLI

switch# show file bootflash:script.py

switch# python script.py arg1 arg2 ['/bootflash/test.py', ’arg1', ’arg2’]


Consideration #5 – Feature Set - Using Python to Buffer Monitoring

#!/usr/bin/python # Import the CLI class from cisco import * # Create Object objCli objCli = CLI('show hardware internal buffer info pkt-stats brief', False) # Display raw output: print objCli.get_raw_output()

|-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐| Total Instant Usage 0 Remaining Instant Usage 46080 Max Cell Usage 0 Switch Cell Count 46080 |-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐|

CLI: get_raw_output()

showBuffer.py Import Cisco library to use the

CLI class in your script

CLI class provides optimized access to the NxOS CLI

Second parameter to CLI specifies whether output should be printed. Default(True) is set

to print. Provides raw (unchanged)

output as printed by the NxOS CLI


12/03/27 08:02:23 INFO mapred.JobClient: map 69% reduce 0% 12/03/27 08:02:24 INFO mapred.JobClient: map 77% reduce 0% 12/03/27 08:02:25 INFO mapred.JobClient: map 87% reduce 0% 12/03/27 08:02:26 INFO mapred.JobClient: map 96% reduce 9% 12/03/27 08:02:27 INFO mapred.JobClient: map 98% reduce 10% 12/03/27 08:02:28 INFO mapred.JobClient: map 100% reduce 10% 12/03/27 08:02:29 INFO mapred.JobClient: map 100% reduce 27% 12/03/27 08:02:30 INFO mapred.JobClient: map 100% reduce 29% 12/03/27 08:02:32 INFO mapred.JobClient: map 100% reduce 32% 12/03/27 08:02:35 INFO mapred.JobClient: map 100% reduce 84%

Hadoop Job Status

Buffer usage statistics from the switch while

running Hadoop TeraSort

Hadoop job status output while running a 1GB TeraSort using

8 nodes

2012/03/27 08:02:23 0 * 2012/03/27 08:02:24 3810 -‐-‐-‐-‐-‐* 2012/03/27 08:02:25 1127 -‐* 2012/03/27 08:02:26 0 * 2012/03/27 08:02:27 0 * 2012/03/27 08:02:28 0 * 2012/03/27 08:02:29 0 * 2012/03/27 08:02:30 0 * 2012/03/27 08:02:31 0 * 2012/03/27 08:02:32 4921 -‐-‐-‐-‐-‐-‐-‐* 2012/03/27 08:02:33 4299 -‐-‐-‐-‐-‐-‐* 2012/03/27 08:02:34 6929 -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐* 2012/03/27 08:02:35 0 *

Buffer Usage

•  Consideration #5 – Feature Set - Using Python to Enhance Monitoring

Design Considerations


•  What is the total port count needed?

•  What is the end design / scale targeted?

•  Is communication needed between servers, inside / outside POD

•  Uniform speed or higher uplink with cut-through switching?

•  What is the feature-set required?

Design Consideration #6 – Simplicity of the Network design – All in one Approach

31

Ingress UPC

Egress UPC

Unified Crossbar Fabric


NAT/PAT Classification and Translation

Output Buffer

1/10G Interfaces

Parsed/ Classified

Traffic ACL Table

4K

Unicast HOST Table 64K

Unicast/Multicast

Route Table (32K)

Packet ReWrite

NAT Translation Table (2K)

Egress Process

Block (EP)

•  NAT uses VACL space for classifying and identifying the traffic for NAT translation based on ingress interface

•  NAT translation table would provide actual translation info for packet ReWrite block for packet modification before sending the packet out of NAT interface

•  For Static NAT, ACL and Translation Table are updated as soon as the NAT static config is added

•  For dynamic NAT*, first packet is punted to CPU after ACL classifies it to be NAT flow and then software updates the translation table based on the flow info

VACL/NAT

* Post FCS

•  Design Consideration #6 – Simplicity of the Network design – All in one Approach


•  Traffic to be replicated is marked in the ingress flow

•  The replication occurs in the queuing engine and the mirrored traffic is placed in one of two multicast queues

Design Consideration #6 – Feature set - Linerate SPAN

UC Queue 0 UC Queue 1 UC Queue 2 UC Queue 3 UC Queue 4 UC Queue 5 UC Queue 6 UC Queue 7 MC Queue 0 MC Queue 1 MC Queue 2 MC Queue 3

MC Queue 0 MC Queue 1 Egress

port 3

Egress port 2

Ingress Port 1

data

….

data

data

Nexus 3000

A

B

Sniffer

.


Design Consideration #7 – Security

34

•  Use Hardware features: ACLs, PVLANs…

•  Use OS level security when possible


•  Precision Time Protocol: IEEE 1588v2

•  Nanosecond Precision

Design Consideration #8 – Application Precision

35

N3064 N3064 N3064 N3064

N3016 N3016

Telecommunications

Financial trading


Applications @ Switch •  Verify accuracy with 1PPS output •  PONG for Hop-by-Hop Latency Measurements •  Integration with ERSPAN for Accurate Timestamp of Monitored Traffic

Servers with PTP clients

Servers with PTP clients

Monitoring & performance

management system PTP-enabled Network

1PPS timing

1PPS timing 1PPS timing

•  Design Consideration #8 – Application Precision


Clock with frequency synchronizer

CPU

PTP Stack, Clock Servo

PCI-Local Bus SPI

eUSB Flash

USB Conn Console

4GB DRAM

Management Ethernet

Front panel Ethernet interface

XLAUI/XFI

1PPS

Timing logic

N3548 Ethernet Switch

1 2

3 4

Monticello ASIC 5

1.  1588 packet is timestamped at ingress of ASIC to record the arrive time (t2)

2.  Timestamp points to the first bit of the packet (following SFD)

3.  Packet is copied to CPU with timestamp and destination port

4.  The packet goes through PTP stack and other process

5.  The packet is sent out at egress port. (The corresponding timestamp for the TX packet is available from the FIFO TX time stamp) ASIC records the packet’s departure timestamp and delivers it to the PTP stack.


Typical Specifications Algorithm Boost Features •  Ultra Low Latency – <300ns •  Active Latency/Buffer Monitoring •  NAT @ Ultra Low Latency •  Intelligent Traffic Mirroring •  IEEE-1588 PTP w/Pulse Per Second

•  48x SFP+ – 100M / 1G / 10G / 40G

•  Line rate L2/L3, Unicast & Multicast

•  18MB Packet Buffer

•  32K IPv4 Route, 64K Host, 8K MC

•  4K Flexible ACL / QoS

•  Data Center TCP


L2 & L3, Unicast & Multicast

Independent of features enabled (NAT, ACL)

Independent of packet size

< 300 ns

Full mesh, 100% linerate


TimeStamp Packet

T0

T1 – T0

Min: 290ns Max: 312ns Avg: 298ns

Min: 285ns Max: 325ns Avg: 300ns

…


Shared Buffer

0

10

20

30

40

50

60

5% 10% 15% … 90% 95% 100%

# of

Sam

ples

Buffer Occupancy Level Percentage

0

10

20

30

40

50

60

5% 10% 15% … 90% 95% 100%

# of

Sam

ples

Buffer Occupancy Level Percentage

Packet


•  Ethernet is now suitable for ULL applications: •  Delays as low as 250ns •  Same delay for L2 and L3 •  No latency penalty when activate smart features (NAT/ACL) •  Ultralow jitter (8ns worst case)

•  SoC design combined with advanced features (buffers, monitoring, etc.) allowed high performance, ultra-low latency switching

•  Pick carefully the features that you need for your network (availability, buffering, 10GE vs 1GE, Latency) in order to reach the required performance

Thank you.

ultra low latency-cluj2012 · • flow control • message passing • interface • language •...

Documents