100gb/s highways: innovation required at every...
TRANSCRIPT
UC Berkeley Casper Workshop, June 13 2014
Kevin Deierling, Mellanox Technologies – kevind at mellanox.com
100Gb/s Highways: Innovation Required at Every Layer
© 2014 Mellanox Technologies 2- Mellanox Technologies -
Leading Supplier of End-to-End Interconnect Solutions
Virtual Protocol Interconnect
StorageFront / Back-End
Server / Compute Switch / Gateway
56G IB & FCoIB 56G InfiniBand
10/40/56GbE & FCoE 10/40/56GbE
Virtual Protocol Interconnect
Host/Fabric SoftwareICs Switches/GatewaysAdapter Cards Cables/Modules
Comprehensive End-to-End InfiniBand and Ethernet Portfolio (VPI)
Metro / WAN
© 2014 Mellanox Technologies 3- Mellanox Technologies -
Bandwidth & Latency Roadmap
Heritage of High Speed, Low Latency Interconnect and Large Scale Systems
Scaling from two servers to 1000's of interconnected nodes End to end connectivity solutions that delivers value for challenging workloads
© 2014 Mellanox Technologies 4- Mellanox Technologies -
Moore’s Law
� Moore’s Law: Chip transistor count doubles roughly every two years
• Linear shrink of 30% results in half the area
• Keep the cost/area about constant while shrinking
© 2014 Mellanox Technologies 5- Mellanox Technologies -
Moore’s Law & Dennard Scaling
� Moore’s Law: Chip transistor count doubles roughly every two years
• Linear shrink of 30% results in half the area
• Keep the cost/area about constant while shrinking
� Dennard Scaling: As a MOSFET transistor shrinks it gets:
• Faster
• Lower power (constant power density)
• Smaller/lighter
Henry Dennard, IEEE JSSC Oct 1974
No compromises: Everything gets better as transistors shrink
© 2014 Mellanox Technologies 6- Mellanox Technologies -
Now for the Bad News …
� Dennard scaling broke about a decade ago
• Both power density & performance stopped scaling
� Higher power & new process/fabs = higher costs
� The economic half of Moore’s law crumbling too
http://www.ni. com/white-paper/14565/en/
© 2014 Mellanox Technologies 7- Mellanox Technologies -
But Half of Moore’s Law Still Going - Performance thru Parallelism
� So with only half of Moore’s law in tact what is a multi-Billion dollar Fab to do ?
� Not faster cores but more and more of them …
http://www.ni. com/white-paper/14565/en/
� Dennard scaling broke about a decade ago
• Both power density & performance stopped scaling
� Higher power & new process/fabs = higher costs
� The economic half of Moore’s law crumbling too
© 2014 Mellanox Technologies 8- Mellanox Technologies -
Multi-Core Processors & Virtualization is Perfect Match
� Hardware Multi-Core Parallelism
• Natural consequence of Moore’s Law continuing in conjunction with the end Dennard’s Scaling
• Multi-threaded programming is hard
• Difficult to achieve perfect parallelism
- Gustafson Law Speedup = s’ + p*n (ideally serial s’=0)
� Virtualization
• Multiple Virtual Machines on single physical server
• Completely separate tasks are easy to parallelize
• The Virtual Machine is a software construct supported by software hypervisor
- Software is easy to move – needs virtual IP addresses
© 2014 Mellanox Technologies 9- Mellanox Technologies -
100Gb/s Requires Innovating at Every Layer
� Application Layer
• Message format
� Presentation Layer
• Coding 1’s and 0’s
� Session Layer
• Authentication, Permissions, Persistence
� Transport Layer
• End-to-end error control
� Network Layer
• Addressing, routing
� Link Layer
• Error detection, flow control
� Physical Layer
• Bit stream, physical medium, analog symbol mapping bits
HYBRID MODEL
© 2014 Mellanox Technologies 10- Mellanox Technologies -
Parallel Processing of Big Data Problems & Decision Analytics
� Large search indexes now over 100 petabytes!
� If you are spending $10 million on a cluster -efficiency matters
� Better accelerate every part of Hadoop processing
• Direct access of processes to file system
• RDMA Enabled distributed file systems
- HDFS
- GPFS
- Ceph
• Accelerate Hadoop itself (map shuffle)
© 2014 Mellanox Technologies 11- Mellanox Technologies -
� Transport Layer Innovation Required• TCP/IP dropped packets a non-starter.
• Rear-ending someone is not the best way to figure out there is congestion
• Explicit notification required
• + RDMA, virtual nics, virtual traffic steering, affinity
� Network Layer• Virtual as well as physical routing (Easy VM migration )
� Link Layer• Lossless Networks using Flow control
- PFC (on/off) flow control is a blunt instrument
� Round trip latency of 1us requires ~40Kbytes buffering per logical link
- IETF starting to consider credit based flow control for Ethernet modeled after InfiniBand
� Physical Layer• 100Gb/s signaling means 10ps symbol period!!
- 3 mm pulse of light in free space!
- Less <<1cm on FR4 … Not feasible at this rate
• Lower symbol rate required through either:
- Parallel streams: ex: 4x25Gb/s
- Multi-bit/symbol: ex: PAM4, WDM
Innovation Required @ 100Gb/s
TCP/IP Implicit Congestion Notification aka dropped packets and timeouts
PFC: Priority Flow Control
© 2014 Mellanox Technologies 12- Mellanox Technologies -
RDMA: Critical for Efficient use of Data Center Resources
ZERO Copy Remote Data Transfer
Low Latency, High Performance Data Transfers
InfiniBand - 56Gb/s RoCE* – 40Gb/s
Kernel Bypass Protocol Offload
* RDMA over Converged Ethernet
Application ApplicationUSER
KERNEL
HARDWARE
Buffer Buffer
© 2014 Mellanox Technologies 13- Mellanox Technologies -
RDMA: How it Works
RDMA over InfiniBand or Ethernet
KERNEL
HARDW
ARE
USER
RACK 1
OS
NIC Buffer 1
Application
1Application
2
OS
Buffer 1
NICBuffer 1
TCP/IP
RACK 2
HCA HCA
Buffer 1Buffer 1
Buffer 1
Buffer 1
Buffer 1
© 2014 Mellanox Technologies 14- Mellanox Technologies -
Server
VM1 VM2 VM3 VM4
Need to Accelerate Virtual Overlay Networks!
Overlay Network Virtualization: Isolation, Simplicity, Scalability
Virtual Domain 3
Virtual Domain 2
Virtual Domain 1
Physical View
Physical View
Server
VM5 VM6 VM7 VM8
Mellanox SDN Switches & Routers
VirtualView
VirtualView
NVGRE/VXLAN Overlay NetworksVirtual Overlay Networks Simplifies
Management and VM Migration
ConnectX-3 Pro Overlay Accelerators Enable
Bare Metal PerformanceOpenFlow
Virtual Network
Management API
© 2014 Mellanox Technologies 15- Mellanox Technologies -
� To fit 100Gb/s in QSFP package requires:
• Low power electronics
• 4x25+ Gb/s modulators and detectors
• Silicon photonics integration:
• no lenses for the laser
• no isolators
• no TEC
100Gb/s in QSFP28 Package
TX (Modulator)RX (Photo Detector)
TIA & CDR
Modulator Driver
& CDR
Mellanox QSFP Module
TIA – Transimpedance AmplifierCDR – Clock Data Recovery
© 2014 Mellanox Technologies 16- Mellanox Technologies -
Single Laser Reduces Power and Cost
Fully integrated Parallel optical
engine Tx and Rx chipsGe PDs
Lasers
Modulators
Splitter
© 2014 Mellanox Technologies 17- Mellanox Technologies -
Franz-Keldysh Modulator scales to >50G
� 5dB ER with 2.8 Vpp
� 2 Vpp possible
� Integrates w/WDM
28GHz25Gb/s Eye
© 2014 Mellanox Technologies 18- Mellanox Technologies -
� Wavelength division multiplexing
• Many parallel channels over a single fiber, reducing cabling complexity and cost
� WDM:100G Ethernet LR4 uses 4 wavelengths
Wavelength Division Multiplexing
© 2014 Mellanox Technologies 19- Mellanox Technologies -
� Echelle gratings scale from 4 to 40+ channels
� As much as 10x smaller than Arrayed Waveguide Gratings
� Better control of wavelength registration and very low cross talk
Echelle Gratings as Mux/Demux
input
outputReflective facet
Slab
SEM of etched facet
Spectra on a 12 channel multiplexer
Tx
Rx
© 2014 Mellanox Technologies 20- Mellanox Technologies -
� QSFP provides great density
� WDM link uses standard SMF duplex fiber (same as 10G today)
� 2 km reach
Echelle Gratings Enable 4x25G WDM over Single Fiber
Laser bank
Mo
du
lato
r b
an
k
Mux DeMux
Ge
PIN
de
tec
tor
ba
nk
Ge Detectors
VOALasers
Power Monitors
Modulators
© 2014 Mellanox Technologies 21- Mellanox Technologies -
� Have demonstrated 1+Tb/s link
WDM Provides Scalability
1 5 2 5 1 5 3 0 1 5 3 5 1 5 4 0 1 5 4 5 1 5 5 0 1 5 5 5 1 5 6 0 1 5 6 5- 3 5
- 3 0
- 2 5
- 2 0
- 1 5
- 1 0
- 5
0
W a v e l e n g t h ( n m )
T E / T M S p e c t r u m
Lo
ss
(d
B)
1 5 3 0 1 5 3 5 1 54 0 1 5 4 5 1 5 5 0 1 5 5 5 1 5 6 00 .3
0 .4
0 .5
0 .6
0 .7
0 .8
0 .9
1
1 .1
W a v e le n g th (n m )
Re
sp
on
siv
ity
(A/W
)
T E
T M
40 channel WDM Receiver Layout16mm x 11.5mm
Connecting Single Fiber is Easier and Cheaper than Connecting 40!
© 2014 Mellanox Technologies 22- Mellanox Technologies -
Summary
� Commercial 100Gb/s is around the corner
• But really hard!
� Dennard Scaling has cracked
� And with it Moore’s Law will break. Not because of the laws of physics but rather the laws of economics
� So scaling will be at the cluster level, driving the requirement for low latency, high performance interconnects
� Innovation required at every level
• Application layer: IPC, Application direct access file systems
• Transport Layer: RDMA, Application Acceleration, Convergence
• Network: Better be thinking virtually
• Link: Lossless but not just stop lights
• Phy: 100Gb/s @ VCSEL limits. Silicon photonics required to move beyond 100Gbs. WDM in the data center is not far behind.
� Market drivers will be efficiency and performance for Cloud, Web 2.0, Big Data
Thank You! Questions?