ultra low latency-cluj2012 · • flow control • message passing • interface • language •...
TRANSCRIPT
Cisco Confidential © 2010 Cisco and/or its affiliates. All rights reserved. 1
Ultra Low Latency Calin Poenaru Cisco Systems Romania
October, 2012
© 2012 Cisco and/or its affiliates. All rights reserved. 2
Agenda
§ Common understanding of “Ultra Low Latency”
§ Usual ways to measure latency on the switches
§ Design critical choices and important considerations to have when building a ULL solution
2
Cisco Confidential © 2010 Cisco and/or its affiliates. All rights reserved. 3 3
© 2012 Cisco and/or its affiliates. All rights reserved. 4
• Definition of latency: delay introduced in the communication between the time sender initiates it and the receiver receives and processes the information.
• Example: Voice Over IP, Radar, Satellite Communication, Real time application
• Different requirements / different user experience Example of market data: user experience vs. machine trading Examples of industry: Telecommunication vs. Financial
What is latency?
4
Consensus that performance without loss during stable and peek times is the ultimate goal
© 2012 Cisco and/or its affiliates. All rights reserved. 5
(Y-Axis in Logarithmic Scale)
Focus has been here
When focus should be here
© 2012 Cisco and/or its affiliates. All rights reserved. 6
Which Latency where? Evolution of the look at the full stack
6
BITS
FRAMES
PACKETS
SEGMENTS
DATA
• L3 lookup • Routing
• L2 lookup • Switching • Buffering • Flow control
• Line encoding • Framing
• Transmission • Reliability • Flow Control
• Message Passing
• Interface • Language
• User space • Libraries
Physical
Data Link
Network
Transport
Session
Presentation
Application
Physical
Data Link
Network
Transport
Session
Presentation
Application
transport
© 2012 Cisco and/or its affiliates. All rights reserved. 7
• From RFC 1242: for store and forward devices:
The time interval starting when the last bit of the input frame reaches the input port and ending when the first bit of the output frame is seen on the output port.
LIFO: Last In First Out
How to measure the latency in the Network?
7
Source: RFC 1242
© 2012 Cisco and/or its affiliates. All rights reserved. 8
• From RFC 1242: for bit forwarding devices (cut-through devices):
The time interval starting when the end of the first bit of the input frame reaches the input port and ending when the start of the first bit of the output frame is seen on the output port
FIFO: First In First Out
How to measure the latency in the Network?
8
Source: RFC 1242
© 2012 Cisco and/or its affiliates. All rights reserved. 9
• Measurement method: LIFO or FIFO?
• LIFO = FIFO - (Packet size in bits/Speed)
• Cable length: identical cable type and length
• Identical amount of ports to test
• Identical testing equipment: Chassis Testing cards Software Revision
• Typical Latency tests:RFC 2544, 2889, 3918
How to measure the latency in the Network?
9
Cisco Confidential © 2010 Cisco and/or its affiliates. All rights reserved. 10 10
© 2012 Cisco and/or its affiliates. All rights reserved. 11
Data Centre Architecture - Trends
Spectrum of Design Evolution
Ultra Low Latency
• High Frequency Trading • Layer 3 & Multicast • No Virtualization • Limited Physical Scale • Nexus 3000 & UCS • 10G edge moving to
40G
Warehouse Scale
• Layer 3 Edge (iBGP, ISIS) • 1000’s of racks • Homogeneous Environment • No Hypervisor virtualization • 1G edge moving to 10G • Nexus 2000, 3000, 5500, 7000
& UCS
slot 1 slot 2 slot 3 slot 4 slot 5 slot 6 slot 7 slot 8
blade1 blade2 blade3 blade4 blade5 blade6 blade7 blade8
slot 1 slot 2 slot 3 slot 4 slot 5 slot 6 slot 7 slot 8
blade1 blade2 blade3 blade4 blade5 blade6 blade7 blade8
slot 1 slot 2 slot 3 slot 4 slot 5 slot 6 slot 7 slot 8
blade1 blade2 blade3 blade4 blade5 blade6 blade7 blade8
slot 1 slot 2 slot 3 slot 4 slot 5 slot 6 slot 7 slot 8
blade1 blade2 blade3 blade4 blade5 blade6 blade7 blade8
slot 1 slot 2 slot 3 slot 4 slot 5 slot 6 slot 7 slot 8
blade1 blade2 blade3 blade4 blade5 blade6 blade7 blade8
slot 1 slot 2 slot 3 slot 4 slot 5 slot 6 slot 7 slot 8
blade1 blade2 blade3 blade4 blade5 blade6 blade7 blade8
slot 1 slot 2 slot 3 slot 4 slot 5 slot 6 slot 7 slot 8
blade1 blade2 blade3 blade4 blade5 blade6 blade7 blade8
slot 1 slot 2 slot 3 slot 4 slot 5 slot 6 slot 7 slot 8
blade1 blade2 blade3 blade4 blade5 blade6 blade7 blade8
slot 1 slot 2 slot 3 slot 4 slot 5 slot 6 slot 7 slot 8
blade1 blade2 blade3 blade4 blade5 blade6 blade7 blade8
slot 1 slot 2 slot 3 slot 4 slot 5 slot 6 slot 7 slot 8
blade1 blade2 blade3 blade4 blade5 blade6 blade7 blade8
slot 1 slot 2 slot 3 slot 4 slot 5 slot 6 slot 7 slot 8
blade1 blade2 blade3 blade4 blade5 blade6 blade7 blade8
slot 1 slot 2 slot 3 slot 4 slot 5 slot 6 slot 7 slot 8
blade1 blade2 blade3 blade4 blade5 blade6 blade7 blade8
Virtualized Data Center
• SP and Enterprise • Hypervisor Virtualization • Shared infrastructure
Heterogeneous • 1G Edge moving to 10G • Nexus 1000v, 2000, 5500,
7000 & UCS
HPC/GRID
• Layer 3 & Layer 2 • No Virtualization • iWARP & RoCE • Nexus 2000, 3000,
5500, 7000 & UCS • 10G moving to 40G
© 2012 Cisco and/or its affiliates. All rights reserved. 12
• Baud rate
• The driver is not bandwidth in ULL
Design Consideration #1 : Speed
12
The Faster speed should be the faster communication, is that all?
1 G
10 G
40 G
© 2012 Cisco and/or its affiliates. All rights reserved. 13
• Serialization Delay reduced with higher speeds
• The less speed mismatch the better performance
Design Consideration #1 : Speed
13
1 GE 10 GE 40 GE 100 GE
64 byte 0.512 us 0.051us 0.013us 0.005us
128 bytes 1.024 us 0.102us 0.026us 0.010us
256 bytes 2.048 us 0.205us 0.051us 0.021us
512 bytes 4.096 us 0.410us 0.102us 0.041us
10, 40, 100 GE options to reduce serialization delay
© 2012 Cisco and/or its affiliates. All rights reserved. 14
Design Consideration #1 : Speed - Performance and Serialization delay
N3064 N3064 N3064
10GE Hosts
N3064
N3064 N3064
10GE
10GE
128 bytes = 2.802us 258 bytes = 2.907us 512 bytes = 3.531us
10GE Interconnects
128 bytes = 2.45us 258 bytes = 2.65us 512 bytes = 2.951us
N3064 N3064 N3064
10GE Hosts
N3064
N3016 N3016
40GE
40GE
40GE Interconnects
128 bytes = 0.352us 258 bytes = 0.257us 512 bytes = 0.580us
© 2012 Cisco and/or its affiliates. All rights reserved. 15
Ex: Up to 860 nanoseconds faster with 40 GE interconnects to an aggregation N3016
2.7 2.8
3.2
3.8 3.9 3.9 3.9 3.9
2.4 2.5 2.7
2.9 3.2
3.4 3.6
3.8
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
64 128 256 512 768 1024 1280 1518
Latency comparison 10 GE to 40 GE aggregation (L3 RFC 2544)
10-10-10 GE (3016 Spine)
10-40-10 GE (3016 Spine)
© 2012 Cisco and/or its affiliates. All rights reserved. 16
• Small Flows/Messaging (Heart-beats, Keep-alive, delay sensitive application messaging)
• Small – Medium Incast (Hadoop Shuffle, Scatter-Gather, Distributed Storage)
• Large Flows (HDFS Insert, File Copy)
• Large Incast (Hadoop Replication, Distributed Storage)
Design Consideration #2 – Congestion– What is the traffic type?
16
© 2012 Cisco and/or its affiliates. All rights reserved. 17
• Uplink Speed Mismatch
• Incast / Many to One conversations
Design Consideration #2 – Congestion – When are buffers needed?
17
1GE Acces
s
10GE
10GE
10GE
10GE
1GE Acces
s
© 2012 Cisco and/or its affiliates. All rights reserved. 18
§ A balanced fabric is a function of maximal throughput ‘and’ minimal loss => “Goodput”
§ Application-level throughput (goodput): Given by the total bytes received from all senders divided by the finishing time of the last sender.
Source : “Understanding TCP Incast Throughput Collapse in Datacenter Networks”, Y. Chen, R Griffith,
WREN ’09
5 millisecond view Congestion Threshold exceeded
Data Center Design Goal: Optimizing the balance of end to end fabric latency with the ability to absorb traffic peaks and prevent any associated
traffic loss
Design Consideration #2 – Congestion – When are buffers needed?
© 2012 Cisco and/or its affiliates. All rights reserved. 19
1" 10"
19"
28"
37"
46"
55"
64"
73"
82"
91"
100"
109"
118"
127"
136"
145"
154"
163"
172"
181"
190"
199"
208"
217"
226"
235"
244"
253"
262"
271"
280"
289"
298"
307"
316"
325"
334"
343"
352"
361"
370"
379"
388"
397"
406"
415"
424"
433"
442"
451"
460"
469"
478"
487"
496"
505"
514"
523"
532"
541"
550"
559"
568"
577"
586"
595"
604"
613"
622"
631"
640"
649"
658"
667"
676"
685"
694"
703"
712"
721"
730"
739"
748"
757"
766"
775"
784"
793"
Job$Co
mple*
on$
Cell$Usage$
1G"Buffer"Used" 10G"Buffer"Used" 1G"Map"%" 10G"Map"%"
§ Moving from 1GE to 10GE actually lowers the buffer requirement on the switching layer § By moving to 10GE, the data node has a larger input to receive data lessening the need
for buffers on the network as the total aggregate speed or amount of data does not increase substantially
§ Current system limits are primarily I/O and Compute capabilities (Disk I/O bound)
Buffering and link speeds Incast
© 2012 Cisco and/or its affiliates. All rights reserved. 20
Design Consideration #2 – Buffer Amount – Small Flow Buffer Delay
Crossbar Egress Buffer
128K
B F
low
C
ompl
etio
n Ti
me
128KB Flow 128KB Flow with Background 1GB Flows
Buffer Architecture Choice Matters
Cisco Confidential © 2010 Cisco and/or its affiliates. All rights reserved. 21
Egress Buffer
INGRESS
EGRESS
INGRESS
Ingress per port Buffer
Scheduler
Crossbar Egress per port
Buffer EGRESS
Shared Memory Buffer
Scheduler
INGRESS
EGRESS
Crossbar
Scheduler
Design Consideration #2 – Buffer Amount – The Switch Architecture
© 2012 Cisco and/or its affiliates. All rights reserved. 22
Switch on Chip (SoC)
Output Buffer
Parser Decision Engine
Packet FIFO Block
Egress Process
Block (EP)
Packet ReWrite
1/10G Interfaces
1/10G Interfaces
Design Consideration #2 – Buffer Amount – The Switch Architecture
© 2012 Cisco and/or its affiliates. All rights reserved. 23
• Deterministic vs. Lowest debate
Design Consideration #3 – Switching mode
1.15 1.13 1.14 1.18 1.25 1.25 1.24 1.24 1.22 1.20
0.00
0.50
1.00
1.50
2.00
2.50
64 128 256 512 768 1024 1280 1518 4096 9216
Late
ncy
(Mic
rose
cond
)
40 GE -> 10 GE
0.76 0.84
0.97
1.18
1.38
1.62
1.79
1.98
0
0.5
1
1.5
2
2.5
64 128 256 512 768 1024 1280 1518 4096 9216 La
tenc
y (M
icro
seco
nd)
10GB -> 40GE
Cut-through Store and Forward
Deterministic latency
Market Data Market Data
RFC 2544 measuring Latency for different switch architectures
© 2012 Cisco and/or its affiliates. All rights reserved. 24
• What media to choose, optical or copper?
• Propagation delay Pd= distance / speed
• Electromagnetic speed: s=200 000 km/s
• Light Speed: 300 000 km/s, fiber glass refraction 1.5
Design Consideration #4 : Physical Media Type – 10G and 40G
24
Delay for 1m
Fiber CX-1 RJ-45
Pd (ns) 5ns 4.3ns 5ns
Propagation Delay Pd
© 2012 Cisco and/or its affiliates. All rights reserved. 25
• Copper UTP/RJ-45: +1usec in the conversion Base SX <-> Base T
Design Consideration #4 : Physical Media Type – 10G and 40G
25
Best Practice: Optical for 1GE, passive CX-1 or optical for 10/40GE
© 2012 Cisco and/or its affiliates. All rights reserved. 26
• Active or Passive Twinax?
• Passive CX-1 is 0.3ns latency
• Active CX-1 adds 1ns latency
Design Consideration #4 : Physical Media Type
26
Active CX-1
Passive CX-1
Active process electrical signaling, needs power to drive the IC. Behave as an optical SFP transceiver
© 2012 Cisco and/or its affiliates. All rights reserved. 27
• CDP / LLDP
• STP
• Layer 3
• Multicast
• Multiple Destination SPAN / Span ACL
• NAT
• Monitoring
Design Consideration #5 – Feature Set
27
© 2012 Cisco and/or its affiliates. All rights reserved. 28
Consideration #5 – Feature Set - Using Python to Buffer Monitoring
switch# python Python 2.7.2 (default, Jan 11 2012, 17:25:37) [GCC 3.4.3 (MontaVista 3.4.3-25.0.143.0800417 2008-02-22)] on linux2 Type "help", "copyright", "credits" or "license" for more information. Loaded cisco NxOS lib! >>>
Interactive Python Shell
Run Python Script
switch# python bootflash:showBuffer.py Mon Jan 30 19:26:36 UTC 2012 |-----------------------------------------------------------------------------------------| Total Instant Usage 0 Remaining Instant Usage 46080 Max Cell Usage 0 Switch Cell Count 46080 |-----------------------------------------------------------------------------------------| #
Code for the Python script
showBuffer.py is in the next slide
….
• Runs directly from NX-OS CLI • Pass Arguments • Displays the scripts on CLI
switch# show file bootflash:script.py
switch# python script.py arg1 arg2 ['/bootflash/test.py', ’arg1', ’arg2’]
© 2012 Cisco and/or its affiliates. All rights reserved. 29
Consideration #5 – Feature Set - Using Python to Buffer Monitoring
#!/usr/bin/python # Import the CLI class from cisco import * # Create Object objCli objCli = CLI('show hardware internal buffer info pkt-stats brief', False) # Display raw output: print objCli.get_raw_output()
|-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐| Total Instant Usage 0 Remaining Instant Usage 46080 Max Cell Usage 0 Switch Cell Count 46080 |-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐|
CLI: get_raw_output()
showBuffer.py Import Cisco library to use the
CLI class in your script
CLI class provides optimized access to the NxOS CLI
Second parameter to CLI specifies whether output should be printed. Default(True) is set
to print. Provides raw (unchanged)
output as printed by the NxOS CLI
© 2012 Cisco and/or its affiliates. All rights reserved. 30
12/03/27 08:02:23 INFO mapred.JobClient: map 69% reduce 0% 12/03/27 08:02:24 INFO mapred.JobClient: map 77% reduce 0% 12/03/27 08:02:25 INFO mapred.JobClient: map 87% reduce 0% 12/03/27 08:02:26 INFO mapred.JobClient: map 96% reduce 9% 12/03/27 08:02:27 INFO mapred.JobClient: map 98% reduce 10% 12/03/27 08:02:28 INFO mapred.JobClient: map 100% reduce 10% 12/03/27 08:02:29 INFO mapred.JobClient: map 100% reduce 27% 12/03/27 08:02:30 INFO mapred.JobClient: map 100% reduce 29% 12/03/27 08:02:32 INFO mapred.JobClient: map 100% reduce 32% 12/03/27 08:02:35 INFO mapred.JobClient: map 100% reduce 84%
Hadoop Job Status
Buffer usage statistics from the switch while
running Hadoop TeraSort
Hadoop job status output while running a 1GB TeraSort using
8 nodes
2012/03/27 08:02:23 0 * 2012/03/27 08:02:24 3810 -‐-‐-‐-‐-‐* 2012/03/27 08:02:25 1127 -‐* 2012/03/27 08:02:26 0 * 2012/03/27 08:02:27 0 * 2012/03/27 08:02:28 0 * 2012/03/27 08:02:29 0 * 2012/03/27 08:02:30 0 * 2012/03/27 08:02:31 0 * 2012/03/27 08:02:32 4921 -‐-‐-‐-‐-‐-‐-‐* 2012/03/27 08:02:33 4299 -‐-‐-‐-‐-‐-‐* 2012/03/27 08:02:34 6929 -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐* 2012/03/27 08:02:35 0 *
Buffer Usage
• Consideration #5 – Feature Set - Using Python to Enhance Monitoring
Design Considerations
© 2012 Cisco and/or its affiliates. All rights reserved. 31
• What is the total port count needed?
• What is the end design / scale targeted?
• Is communication needed between servers, inside / outside POD
• Uniform speed or higher uplink with cut-through switching?
• What is the feature-set required?
Design Consideration #6 – Simplicity of the Network design – All in one Approach
31
Ingress UPC
Egress UPC
Unified Crossbar Fabric
© 2012 Cisco and/or its affiliates. All rights reserved. 32
NAT/PAT Classification and Translation
Output Buffer
1/10G Interfaces
Parsed/ Classified
Traffic ACL Table
4K
Unicast HOST Table 64K
Unicast/Multicast
Route Table (32K)
Packet ReWrite
NAT Translation Table (2K)
Egress Process
Block (EP)
• NAT uses VACL space for classifying and identifying the traffic for NAT translation based on ingress interface
• NAT translation table would provide actual translation info for packet ReWrite block for packet modification before sending the packet out of NAT interface
• For Static NAT, ACL and Translation Table are updated as soon as the NAT static config is added
• For dynamic NAT*, first packet is punted to CPU after ACL classifies it to be NAT flow and then software updates the translation table based on the flow info
VACL/NAT
* Post FCS
• Design Consideration #6 – Simplicity of the Network design – All in one Approach
© 2012 Cisco and/or its affiliates. All rights reserved. 33
• Traffic to be replicated is marked in the ingress flow
• The replication occurs in the queuing engine and the mirrored traffic is placed in one of two multicast queues
Design Consideration #6 – Feature set - Linerate SPAN
UC Queue 0 UC Queue 1 UC Queue 2 UC Queue 3 UC Queue 4 UC Queue 5 UC Queue 6 UC Queue 7 MC Queue 0 MC Queue 1 MC Queue 2 MC Queue 3
MC Queue 0 MC Queue 1 Egress
port 3
Egress port 2
Ingress Port 1
data
….
data
data
Nexus 3000
A
B
Sniffer
.
© 2012 Cisco and/or its affiliates. All rights reserved. 34
Design Consideration #7 – Security
34
• Use Hardware features: ACLs, PVLANs…
• Use OS level security when possible
© 2012 Cisco and/or its affiliates. All rights reserved. 35
• Precision Time Protocol: IEEE 1588v2
• Nanosecond Precision
Design Consideration #8 – Application Precision
35
N3064 N3064 N3064 N3064
N3016 N3016
Telecommunications
Financial trading
© 2012 Cisco and/or its affiliates. All rights reserved. 36
Applications @ Switch • Verify accuracy with 1PPS output • PONG for Hop-by-Hop Latency Measurements • Integration with ERSPAN for Accurate Timestamp of Monitored Traffic
Servers with PTP clients
Servers with PTP clients
Monitoring & performance
management system PTP-enabled Network
1PPS timing
1PPS timing 1PPS timing
• Design Consideration #8 – Application Precision
© 2012 Cisco and/or its affiliates. All rights reserved. 37
Clock with frequency synchronizer
CPU
PTP Stack, Clock Servo
PCI-Local Bus SPI
eUSB Flash
USB Conn Console
4GB DRAM
Management Ethernet
Front panel Ethernet interface
XLAUI/XFI
1PPS
Timing logic
N3548 Ethernet Switch
1 2
3 4
Monticello ASIC 5
1. 1588 packet is timestamped at ingress of ASIC to record the arrive time (t2)
2. Timestamp points to the first bit of the packet (following SFD)
3. Packet is copied to CPU with timestamp and destination port
4. The packet goes through PTP stack and other process
5. The packet is sent out at egress port. (The corresponding timestamp for the TX packet is available from the FIFO TX time stamp) ASIC records the packet’s departure timestamp and delivers it to the PTP stack.
Cisco Confidential © 2010 Cisco and/or its affiliates. All rights reserved. 38 38
© 2012 Cisco and/or its affiliates. All rights reserved. 39
Typical Specifications Algorithm Boost Features • Ultra Low Latency – <300ns • Active Latency/Buffer Monitoring • NAT @ Ultra Low Latency • Intelligent Traffic Mirroring • IEEE-1588 PTP w/Pulse Per Second
• 48x SFP+ – 100M / 1G / 10G / 40G
• Line rate L2/L3, Unicast & Multicast
• 18MB Packet Buffer
• 32K IPv4 Route, 64K Host, 8K MC
• 4K Flexible ACL / QoS
• Data Center TCP
© 2012 Cisco and/or its affiliates. All rights reserved. 40
L2 & L3, Unicast & Multicast
Independent of features enabled (NAT, ACL)
Independent of packet size
< 300 ns
Full mesh, 100% linerate
© 2012 Cisco and/or its affiliates. All rights reserved. 41
TimeStamp Packet
T0
T1 – T0
Min: 290ns Max: 312ns Avg: 298ns
Min: 285ns Max: 325ns Avg: 300ns
…
© 2012 Cisco and/or its affiliates. All rights reserved. 42
Shared Buffer
0
10
20
30
40
50
60
5% 10% 15% … 90% 95% 100%
# of
Sam
ples
Buffer Occupancy Level Percentage
0
10
20
30
40
50
60
5% 10% 15% … 90% 95% 100%
# of
Sam
ples
Buffer Occupancy Level Percentage
Packet
© 2012 Cisco and/or its affiliates. All rights reserved. 43
• Ethernet is now suitable for ULL applications: • Delays as low as 250ns • Same delay for L2 and L3 • No latency penalty when activate smart features (NAT/ACL) • Ultralow jitter (8ns worst case)
• SoC design combined with advanced features (buffers, monitoring, etc.) allowed high performance, ultra-low latency switching
• Pick carefully the features that you need for your network (availability, buffering, 10GE vs 1GE, Latency) in order to reach the required performance
Thank you.