10 gigabit ethernet vs infinibandbasic
TRANSCRIPT
![Page 1: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/1.jpg)
InfiniBand and 10-Gigabit Ethernet for Dummies
Dhabaleswar K. (DK) Panda
A Tutorial at Supercomputing ‘09by
Pavan BalajiDhabaleswar K. (DK) PandaThe Ohio State University
E-mail: [email protected]://www.cse.ohio-state.edu/~panda
Pavan BalajiArgonne National LaboratoryE-mail: [email protected]
http://www.mcs.anl.gov/~balajihttp://www.cse.ohio state.edu/ panda http://www.mcs.anl.gov/ balaji
Matthew KoopNASA Goddard
E il tth k @E-mail: [email protected]://www.cse.ohio-state.edu/~koop
![Page 2: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/2.jpg)
Presentation Overview
• Introduction
• Why InfiniBand and 10-Gigabit Ethernet?
• Overview of IB, 10GE, their Convergence and Features
• IB and 10GE HW/SW Products and Installations• IB and 10GE HW/SW Products and Installations
• Sample Case Studies and Performance Numbers
• Conclusions and Final Q&A
Supercomputing '09 2
![Page 3: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/3.jpg)
Current and Next Generation A li ti d C ti S t
• Big demand for
Applications and Computing Systems
– High Performance Computing (HPC)
– File-systems, multimedia, database, visualization
– Enterprise Multi-tier datacentersEnterprise Multi tier datacenters
• Processor performance continues to grow – Chip density doubling every 18 months (multi-cores)
• Commodity networking also continues to grow – Increase in speed and features & affordable pricing
• Clusters are increasingly becoming popular to design next generation computing systems– Scalability Modularity and Upgradeability withScalability, Modularity and Upgradeability with
compute and network technologies
Supercomputing '09 3
![Page 4: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/4.jpg)
Trends for Computing Clusters in the T 500 Li t
• Top 500 list of Supercomputers (www.top500.org)
Top 500 List
Jun. 2001: 33/500 (6.6%) Nov. 2005: 360/500 (72.0%)
Nov. 2001: 43/500 (8.6%) Jun. 2006: 364/500 (72.8%)
Jun. 2002: 80/500 (16%) Nov. 2006: 361/500 (72.2%)
Nov. 2002: 93/500 (18.6%) Jun. 2007: 373/500 (74.6%)
Jun. 2003: 149/500 (29.8%) Nov. 2007: 406/500 (81.2%)
Nov. 2003: 208/500 (41.6%) Jun. 2008: 400/500 (80.0%)
Jun. 2004: 291/500 (58.2%) Nov. 2008: 410/500 (82.0%)
Nov. 2004: 294/500 (58.8%) Jun. 2009: 410/500 (82.0%)
Supercomputing '09
Jun. 2005: 304/500 (60.8%) Nov. 2009: To be announced
4
![Page 5: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/5.jpg)
Integrated High-End Computing E i tEnvironments
Compute cluster Storage cluster
Meta-DataManager
I/O ServerNode
MetaData
DataCompute
Node
ComputeNode
I/O ServerNode Data
ComputeNode
LANLANFrontend
NodeNode
I/O ServerNode Data
ComputeNodeLAN/WAN
Enterprise Multi-tier Datacenter for Visualization and Mining
Routers/Servers
Database Server
Application Server
.
...
SwitchRouters/Servers
Application Server
Application Server
Database Server
Database Server
Switch SwitchRouters/Servers
Supercomputing '09
.
Tier1 Tier3
Routers/Servers
Application Server
Database Server
5
![Page 6: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/6.jpg)
Networking and I/O Requirements
• Good Systems Area Network with excellent performance
g q
(low latency and high bandwidth) for inter-processor communication (IPC) and I/O
• Good Storage Area Networks high performance I/O
• Good WAN connectivity in addition to intra-clusterGood WAN connectivity in addition to intra cluster SAN/LAN connectivity
• Quality of Service (QoS) for interactive applicationsQuality of Service (QoS) for interactive applications
• RAS (Reliability, Availability, and Serviceability)
With l t• With low cost
Supercomputing '09 6
![Page 7: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/7.jpg)
Major Components in Computing S t
• Hardware Components
SystemsP0
– Processing Core and Memory sub-system
I/O B
Core 0
Core 1
Core 2
Core 3MemoryProcessing
Bottleneck
I
– I/O Bus
– Network Adapter
– Network Switch
P1
Core 0
Core 1
Core 2
Core 3Memory
IO
BUS
Network Switch
• Software Components– Communication software
Core 1 Core 3
I/OBottleneck
S
NetworkAdapter
NetworkSwitch
NetworkBottleneck
Supercomputing '09 7
![Page 8: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/8.jpg)
Processing Bottlenecks inT diti l P t l
• Ex: TCP/IP, UDP/IP
Traditional Protocols
• Generic architecture for all network interfaces
• Host-handles almost all aspects of communication– Data buffering (copies on sender and receiver)
– Data integrity (checksum)
– Routing aspects (IP routing)
• Signaling between different layers– Hardware interrupt whenever a packet arrives or is sent
– Software signals between different layers to handle protocol processing in different priority levels
Supercomputing '09 8
![Page 9: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/9.jpg)
Bottlenecks in TraditionalI/O I t f d N t k
• Traditionally relied on bus-based technologies
I/O Interfaces and Networks
– E.g., PCI, PCI-X, Shared Ethernet
– One bit per wire
– Performance increase through:• Increasing clock speed
• Increasing bus width
– Not scalable:Cross talk bet een bits• Cross talk between bits
• Skew between wires
• Signal integrity makes it difficult to increase bus widthSignal integrity makes it difficult to increase bus width significantly, especially for high clock speeds
Supercomputing '09 9
![Page 10: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/10.jpg)
InfiniBand (“Infinite Bandwidth”) and10 Gi bit Eth t
• Industry Networking Standards
10-Gigabit Ethernet
• Processing Bottleneck:– Hardware offloaded protocol stacks with user-level
communication access
• Network Bottleneck:– Bit serial differential signaling
• Independent pairs of wires to transmit independent data (called a lane)(called a lane)
• Scalable to any number of lanes• Easy to increase clock speed of lanes (since each lane
consists only of a pair of wires)
Supercomputing '09 10
![Page 11: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/11.jpg)
Interplay with I/O Technologies
• InfiniBand initially intended to replace I/O bus
p y g
technologies with networking-like technology– That is, bit serial differential signaling
– With enhancements in I/O technologies that use a similar architecture (HyperTransport, PCI Express), this has become mostly irrelevant nowthis has become mostly irrelevant now
• Both IB and 10GE today come as network adapters that plug into existing I/O technologiesthat plug into existing I/O technologies
Supercomputing '09 11
![Page 12: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/12.jpg)
Presentation Overview
• Introduction
• Why InfiniBand and 10-Gigabit Ethernet?
• Overview of IB, 10GE, their Convergence and Features
• IB and 10GE HW/SW Products and Installations• IB and 10GE HW/SW Products and Installations
• Sample Case Studies and Performance Numbers
• Conclusions and Final Q&A
Supercomputing '09 12
![Page 13: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/13.jpg)
Trends in I/O Interfaces with Servers
• Network performance depends on– Networking technology (adapter + switch)– Network interface (last mile bottleneck)
PCI 1990 33MHz/32bit: 1.05Gbps (shared bidirectional)
PCI-X1998 (v1.0)2003 (v2.0)
133MHz/64bit: 8.5Gbps (shared bidirectional)266-533MHz/64bit: 17Gbps (shared bidirectional)( ) p ( )
HyperTransport (HT)by AMD
2001 (v1.0), 2004 (v2.0)2006 (v3.0), 2008 (v3.1)
102.4Gbps (v1.0), 179.2Gbps (v2.0)332.8Gbps (v3.0), 409.6Gbps (v3.1)
Gen1: 4X (8Gbps) 8X (16Gbps) 16X (32Gbps)PCI-Express (PCIe)
by Intel2003 (Gen1), 2007 (Gen2)
2009 (Gen3 standard)
Gen1: 4X (8Gbps), 8X (16Gbps), 16X (32Gbps)Gen2: 4X (16Gbps), 8X (32Gbps), 16X (64Gbps)
Gen3: 4X (~32Gbps), 8X (~64Gbps), 16X (~128Gbps)
Intel QuickPath 2009 153 6 204 8Gbps per linkIntel QuickPath 2009 153.6-204.8Gbps per link
Supercomputing '09 13
![Page 14: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/14.jpg)
Growth in Commodity Network T h l
Representative commodity networks; their entries into the market
Technology
Ethernet (1979 - ) 10 Mbit/sec
Fast Ethernet (1993 -) 100 Mbit/sec
Gigabit Ethernet (1995 -) 1000 Mbit /sec
ATM (1995 -) 155/622/1024 Mbit/sec
Myrinet (1993 -) 1 Gbit/sec
Fibre Channel (1994 -) 1 Gbit/sec
InfiniBand (2001 -) 2 Gbit/sec (1X SDR)
10-Gigabit Ethernet (2001 -) 10 Gbit/sec
InfiniBand (2003 -) 8 Gbit/sec (4X SDR)
InfiniBand (2005 -) 16 Gbit/sec (4X DDR)
24 Gbit/sec (12X SDR)
InfiniBand (2007 -) 32 Gbit/sec (4X QDR)
InfiniBand (2011-) 64 Gbit/sec (4X EDR)
Supercomputing '09
16 times in the last 9 years14
![Page 15: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/15.jpg)
Capabilities ofHi h P f N t k
• Intelligent Network Interface Cards
High-Performance Networks
• Support entire protocol processing completely in hardware (hardware protocol offload engines)( p g )
• Provide a rich communication interface to applications
– User-level communication capabilityUser level communication capability
– Gets rid of intermediate data buffering requirements
• No software signaling between communication layers• No software signaling between communication layers
– All layers are implemented on a dedicated hardware unit and not on a shared host CPUunit, and not on a shared host CPU
Supercomputing '09 15
![Page 16: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/16.jpg)
Previous High-Performance Network St k
• Virtual Interface Architecture
Stacks
– Standardized by Intel, Compaq, Microsoft
• Fast Messages (FM)– Developed by UIUC
• Myricom GMP i t t l t k f M i– Proprietary protocol stack from Myricom
• These network stacks set the trend for high-performance communication requirementsperformance communication requirements– Hardware offloaded protocol stack– Support for fast and secure user-level access to the pp
protocol stackSupercomputing '09 16
![Page 17: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/17.jpg)
IB Trade Association
• IB Trade Association was formed with seven industry leaders (Compaq, Dell, HP, IBM, Intel, Microsoft, and Sun)
• Goal: To design a scalable and high performance i ti d I/O hit t b t ki i t t dcommunication and I/O architecture by taking an integrated
view of computing, networking, and storage technologies
• Many other industry participated in the effort to define the IBMany other industry participated in the effort to define the IB architecture specification
• IB Architecture (Volume 1, Version 1.0) was released to ( )public on Oct 24, 2000
– Latest version 1.2.1 released January 2008
• http://www.infinibandta.org
Supercomputing '09 17
![Page 18: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/18.jpg)
IB Hardware Acceleration
• Some IB models have multiple hardware acceleratorsp
– E.g., Mellanox IB adapters
• Protocol Offload Engines• Protocol Offload Engines
– Completely implement layers 2-4 in hardware
• Additional hardware supported features also present
– RDMA, Multicast, QoS, Fault Tolerance, and many more
Supercomputing '09 18
![Page 19: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/19.jpg)
10-Gigabit Ethernet Consortium
• 10GE Alliance formed by several industry leaders to
g
take the Ethernet family to the next speed step
• Goal: To achieve a scalable and high performance communication architecture while maintaining backward compatibility with Ethernet
• http://www.ethernetalliance.org
• Upcoming 40-Gbps (Servers) and 100-Gbps Ethernet (Backbones, Switches, Routers): IEEE 802.3 WG
• Energy-efficient and power-conscious protocols– On-the-fly link speed reduction for under-utilized links
Supercomputing '09 19
![Page 20: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/20.jpg)
Ethernet Hardware Acceleration
• Interrupt Coalescing– Improves throughput, but degrades latency
• Jumbo FramesN l i I ibl i h i i i h– No latency impact; Incompatible with existing switches
• Hardware Checksum EnginesChecksum performed in hardware significantly faster– Checksum performed in hardware significantly faster
– Shown to have minimal benefit independently
• Segmentation Offload EnginesSegmentation Offload Engines– Supported by most 10GE products because of its backward
compatibility considered “regular” Ethernet– Heavily used in the “server-on-steroids” model
Supercomputing '09 20
![Page 21: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/21.jpg)
TOE and iWARP Accelerators
• TCP Offload Engines (TOE)– Hardware Acceleration for the entire TCP/IP stack
– Initially patented by Tehuti Networks
– Actually refers to the IC on the network adapter that implements TCP/IP
I ti ll f d t th ti t k d t– In practice, usually referred to as the entire network adapter
• Internet Wide-Area RDMA Protocol (iWARP)St d di d b IETF d th RDMA C ti– Standardized by IETF and the RDMA Consortium
– Support acceleration features (like IB) for Ethernet
htt // i tf & htt // d ti• http://www.ietf.org & http://www.rdmaconsortium.org
Supercomputing '09 21
![Page 22: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/22.jpg)
Converged Enhanced Ethernet
• Popularly known as Datacenter Ethernet
g
• Combines a number of Ethernet (optional) standards into one umbrella; sample enhancements include:– Priority-based flow-control: Link-level flow control for
each Class of Service (CoS)
– Enhanced Transmission Selection: Bandwidth assignment to each CoS
Datacenter Bridging Exchange Protocols: Congestion– Datacenter Bridging Exchange Protocols: Congestion notification, Priority classes
– End-to-end Congestion notification: Per flowEnd to end Congestion notification: Per flow congestion control to supplement per link flow control
Supercomputing '09 22
![Page 23: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/23.jpg)
Presentation Overview
• Introduction
• Why InfiniBand and 10-Gigabit Ethernet?
• Overview of IB, 10GE their Convergence and Features
• IB and 10GE HW/SW Products and Installations• IB and 10GE HW/SW Products and Installations
• Sample Case Studies and Performance Numbers
• Conclusions and Final Q&A
Supercomputing '09 23
![Page 24: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/24.jpg)
IB, 10GE and their Convergence
• InfiniBand
, g
– Architecture and Basic Hardware Components– Novel Features– IB Verbs InterfaceIB Verbs Interface– Management and Services
• 10-Gigabit Ethernet Family– Architecture and Components– Existing Implementations of 10GE/iWARP
InfiniBand/Ethernet Convergence Technologies• InfiniBand/Ethernet Convergence Technologies– Virtual Protocol Interconnect– InfiniBand over Ethernet– RDMA over Converged Enhanced Ethernet
Supercomputing '09 24
![Page 25: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/25.jpg)
IB Overview
• InfiniBand– Architecture and Basic Hardware Components
– Communication Model and SemanticsC i ti M d l• Communication Model
• Memory registration and protection
• Channel and memory semantics
– Novel Features• Hardware Protocol Offload
Link network and transport layer features– Link, network and transport layer features
– Management and Services• Subnet Management
• Hardware support for scalable network management
Supercomputing '09 25
![Page 26: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/26.jpg)
A Typical IB Networkyp
Three primary components
Channel Adapters
Switches/Routers
Links and connectors
Supercomputing '09 26
![Page 27: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/27.jpg)
Components: Channel Adapters
• Used by processing and I/O units to
p p
connect to fabric
• Consume & generate IB packets
• Programmable DMA engines with protection features
• May have multiple ports
– Independent buffering channeled through Virtual Lanes
• Host Channel Adapters (HCAs)
Supercomputing '09 27
![Page 28: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/28.jpg)
Components: Switches and Routersp
• Relay packets from a link to another• Relay packets from a link to another
• Switches: intra-subnet
• Routers: inter-subnet
• May support multicastSupercomputing '09 28
![Page 29: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/29.jpg)
Components: Links & Repeaters
• Network Links
p p
– Copper, Optical, Printed Circuit wiring on Back Plane– Not directly addressable
• Traditional adapters built for copper cablingTraditional adapters built for copper cabling– Restricted by cable length (signal integrity)
• Intel Connects: Optical cables with Copper-to-optical conversion hubs (acquired by Emcore)– Up to 100m length
550 picoseconds– 550 picosecondscopper-to-optical conversion latency
• Available from other vendors (Luxtera) (Courtesy Intel)
• Repeaters (Vol. 2 of InfiniBand specification)
Supercomputing '09 29
![Page 30: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/30.jpg)
IB Overview
• InfiniBand– Architecture and Basic Hardware Components
– Communication Model and SemanticsC i ti M d l• Communication Model
• Memory registration and protection
• Channel and memory semantics
– Novel Features• Hardware Protocol Offload
Link network and transport layer features– Link, network and transport layer features
– Management and Services• Subnet Management
• Hardware support for scalable network management
Supercomputing '09 30
![Page 31: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/31.jpg)
InfiniBand Communication Model
Basic InfiniBand CommunicationCommunication
Semantics
Supercomputing '09 31
![Page 32: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/32.jpg)
Queue Pair ModelQ
Communication in InfiniBand uses a Queue Pair
• Each QP has two queuesCQQP
Model for all data transfer
– Send Queue (SQ)
– Receive Queue (RQ)
CQQPSend Recv
• A QP must be linked to a Complete Queue (CQ)
– Gives notification of operation– Gives notification of operation completion from QPs InfiniBand Device
Supercomputing '09 32
![Page 33: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/33.jpg)
Queue Pair Model: WQEs and CQEsQueue Pair Model: WQEs and CQEs
• Entries used for QPEntries used for QP communication are data structures called Work Queue CQQP
Send RecvRequests (WQEs)– Called “Wookies”
• Completed WQEs are placed in
Send Recv
WQEs CQEs
• Completed WQEs are placed in the CQ with additional information
I fi iB d D i– They are now called CQEs
(“Cookies”)
InfiniBand Device
Supercomputing '09 33
![Page 34: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/34.jpg)
WQEs and CQEs
• Send WQEs contain
Q Q
data about what buffer to send from, how much to send, etc.
• Receive WQEs contain data about what buffer to
i i t h hreceive into, how much to receive, etc.
• CQEs contain data about which QP the completed WQE was posted on how muchposted on, how much data actually arrived
Supercomputing '09 34
![Page 35: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/35.jpg)
Memory Registrationy g
Before we do any communication:All memory used for communication must be registered
1. Registration Request • Send virtual address and lengthProcess
All memory used for communication must be registered
g
2. Kernel handles virtual->physical mapping and pins region into physical memory1 4• Process cannot map memory that it does
not own (security !)
3 HCA caches the virtual to physical
Kernel2
3. HCA caches the virtual to physical mapping and issues a handle• Includes an l_key and r_key
3
HCA/RNIC
4. Handle is returned to application
Supercomputing '09 35
![Page 36: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/36.jpg)
Memory Protectiony
For security, keys are required for all operations that touch buffers
• To send or receive data the l_key must be provided to the HCA
Process
that touch buffers
be provided to the HCA
• HCA verifies access to local memory
• For RDMA the initiator must have theKernell_key
• For RDMA, the initiator must have the r_key for the remote virtual address
• Possibly exchanged with a send/recvHCA/RNIC Possibly exchanged with a send/recv
• r_key is not encrypted in IBHCA/RNIC
Supercomputing '09
r_key is needed for RDMA operations
36
![Page 37: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/37.jpg)
Communication in the Channel S ti (S d/R i M d l)
Memory Memory
Semantics (Send/Receive Model)Processor Processor
ReceiveBuffer
SendBuffer
CQQP
Send Recv CQQP
Send RecvProcessor is involved only to:
1. Post receive WQE2. Post send WQEQ
3. Pull out completed CQEs from the CQ
InfiniBand DeviceInfiniBand Device
S d WQE t i i f ti b t th Receive WQE contains information on the receive
Hardware ACK
Supercomputing '09 37
Send WQE contains information about the send buffer
Receive WQE contains information on the receive buffer; Incoming messages have to be matched to
a receive WQE to know where to place the data
![Page 38: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/38.jpg)
Communication in the Memory S ti (RDMA M d l)
Memory Memory
Semantics (RDMA Model)Processor Processor
ReceiveBuffer
SendBuffer
CQQP
Send Recv CQQP
Send RecvInitiator processor is involved only to:
1. Post send WQE2. Pull out completed CQE from the send CQ
No involvement from the target processor
InfiniBand DeviceInfiniBand Device
S d WQE t i i f ti b t th
Hardware ACK
Supercomputing '09 38
Send WQE contains information about the send buffer and the receive buffer
![Page 39: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/39.jpg)
IB Overview
• InfiniBand– Architecture and Basic Hardware Components
– Communication Model and SemanticsC i ti M d l• Communication Model
• Memory registration and protection
• Channel and memory semantics
– Novel Features• Hardware Protocol Offload
Link network and transport layer features– Link, network and transport layer features
– Management and Services• Subnet Management
• Hardware support for scalable network management
Supercomputing '09 39
![Page 40: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/40.jpg)
Hardware Protocol Offload
CompleteComplete Hardware
ImplementationsExist
Supercomputing '09 40
![Page 41: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/41.jpg)
Link Layer Capabilities
• CRC-based Data Integrity
y p
CRC based Data Integrity
• Buffering and Flow Control
• Virtual Lanes, Service Levels and QoS
• Switching and Multicast
• IB WAN CapabilityIB WAN Capability
Supercomputing '09 41
![Page 42: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/42.jpg)
CRC-based Data Integrity
• Two forms of CRC to achieve both early error detection
g y
and end-to-end reliability– Invariant CRC (ICRC) covers fields that do not change
per link (per network hop)per link (per network hop)• E.g., routing headers (if there are no routers), transport
headers, data payload• 32-bit CRC (compatible with Ethernet CRC)• End-to-end reliability (does not include I/O bus)
V i t CRC (VCRC) thi– Variant CRC (VCRC) covers everything• Erroneous packets do not have to reach the destination
before being discarded• Early error detection
Supercomputing '09 42
![Page 43: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/43.jpg)
Buffering and Flow Control
• IB provides an absolute credit-based flow-control
g
p
– Receiver guarantees that it has enough space allotted for
N blocks of data
– Occasional update of available credits by the receiver
• Has no relation to the number of messages but only to• Has no relation to the number of messages, but only to
the total amount of data being sent
O 1MB i i l 1024 1KB– One 1MB message is equivalent to 1024 1KB messages
(except for rounding off at message boundaries)
Supercomputing '09 43
![Page 44: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/44.jpg)
Link Layer Capabilities
• CRC-based Data Integrity
y p
CRC based Data Integrity
• Buffering and Flow Control
• Virtual Lanes, Service Levels and QoS
• Switching and Multicast
• IB WAN CapabilityIB WAN Capability
Supercomputing '09 44
![Page 45: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/45.jpg)
Virtual Lanes
• Multiple virtual links within same physical link– Between 2 and 16
S t b ff d fl• Separate buffers and flow control
Avoids Head of Line– Avoids Head-of-Line Blocking
• VL15: reserved for management
• Each port supports one or more data VL
Supercomputing '09 45
![Page 46: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/46.jpg)
Service Levels and QoS
• Service Level (SL):
Q
– Packets may operate at one of 16 different SLs– Meaning not defined by IB
S• SL to VL mapping:– SL determines which VL on the next link is to be used
E h t ( it h t d d ) h SL t VL– Each port (switches, routers, end nodes) has a SL to VL mapping table configured by the subnet management
• Partitions:Partitions:– Fabric administration (through Subnet Manager) may assign
specific SLs to different partitions to isolate traffic flows
Supercomputing '09 46
![Page 47: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/47.jpg)
Traffic Segregation Benefitsg g
• Segregation of Server, Network and Storage Traffic – On
• InfiniBand Virtual Lanes allow the multiplexing of multiple
IPC, Load Balancing, Web Caches, ASP
the same physical network
S SS the multiplexing of multiple independent logical traffic flows on the same physical link
Servers ServersServers
• Providing the benefits of independent, separate networks while eliminating the cost and
InfiniBand Network
Virtual Lanes
FabricInfiniBandFabric
while eliminating the cost and difficulties associated with maintaining two or more
t kStorage Area Network
IP Network
networks
Supercomputing '09Courtesy: Mellanox Technologies
Routers, Switches VPN’s, DSLAMs
RAID, NAS, Backup
47
![Page 48: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/48.jpg)
Switching (Layer-2 Routing) and M lti t
• Each port has one or more associated LIDs (Local
Multicast
Identifiers)– Switches look up which port to forward a packet to based
it d ti ti LID (DLID)on its destination LID (DLID)– This information is maintained at the switch
• For multicast packets the switch needs to maintain• For multicast packets, the switch needs to maintain multiple output ports to forward the packet to– Packet is replicated to each appropriate output portPacket is replicated to each appropriate output port– Ensures at-most once delivery & loop-free forwarding– There is an interface for a group management protocol
• Create, join/leave, prune, delete group
Supercomputing '09 48
![Page 49: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/49.jpg)
Destination-based Switching/Routingg g
Spine Blocks
Leaf Blocks
Spine Blocks
S it hi IB t R ti U ifi d b IB SPEC
An Example IB Switch Block Diagram (Mellanox 144-Port)
Switching: IB supports Virtual Cut Through (VCT)
Routing: Unspecified by IB SPECUp*/Down*, Shift are popular routing
engines supported by OFED
• Fat-Tree is a popular topology for IB Clusters
• Different over subscription ratio may be used• Different over-subscription ratio may be used
Supercomputing '09 49
![Page 50: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/50.jpg)
IB Switching/Routing: An Exampleg g p
Spine Blocks
Leaf Blocks
Spine Blocks
1 2 3 4
P1P2 DLID Out-Port
2 1Forwarding TableLID: 2
LID: 4
• Someone has to setup these tables and give every port an LID
2 1
4 4
LID: 4
• Someone has to setup these tables and give every port an LID– “Subnet Manager” does this work (more discussion on this later)
• Different routing algorithms may give different pathsDifferent routing algorithms may give different paths
Supercomputing '09 50
![Page 51: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/51.jpg)
IB Multicast Examplep
Supercomputing '09 51
![Page 52: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/52.jpg)
IB WAN Capability
• Getting increased attention for:
p y
– Remote Storage, Remote Visualization– Cluster Aggregation (Cluster-of-clusters)
• IB Optical switches by multiple vendors• IB-Optical switches by multiple vendors– Obsidian Research Corporation: www.obsidianresearch.com– Network Equipment Technology (NET): www.net.com– Layer-1 changes from copper to optical; everything else stays
the same• Low latency copper optical copper conversion• Low-latency copper-optical-copper conversion
• Large link-level buffers for flow-control– Data messages do not have to wait for round-trip hopsg p p– Important in the wide-area network
Supercomputing '09 52
![Page 53: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/53.jpg)
Hardware Protocol Offload
CompleteComplete Hardware
ImplementationsExist
Supercomputing '09 53
![Page 54: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/54.jpg)
IB Network Layer Capabilities
• Most capabilities are similar to that of the link layer, but
y p
p y
as applied to IB routers
– Routers can send packets across subnets (subnet areRouters can send packets across subnets (subnet are
management domains, not administrative domains)
– Subnet management packets are consumed by routersSubnet management packets are consumed by routers,
not forwarded to the next subnet
• Several additional features as well• Several additional features as well
– E.g., routing and flow labels
Supercomputing '09 54
![Page 55: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/55.jpg)
Routing and Flow Labels
• Routing follows the IPv6 packet format
g
– Easy interoperability with Wide-area translations
– Link layer might still need to be translated to the y gappropriate layer-2 protocol (e.g., Ethernet, SONET)
• Flow Labels allow routers to specify which packets p y pbelong to the same connection
– Switches can optimize communication by sending packets p y g pwith the same label in order
– Flow labels can change in the router, but packets belonging to one label will always do so
Supercomputing '09 55
![Page 56: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/56.jpg)
Hardware Protocol Offload
CompleteComplete Hardware
ImplementationsExist
Supercomputing '09 56
![Page 57: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/57.jpg)
IB Transport Servicesp
ConnectionService Type Connection Oriented Acknowledged Transport
Reliable Connection Yes Yes IBA
Unreliable Connection Yes No IBA
Reliable Datagram No Yes IBAReliable Datagram No Yes IBA
Unreliable Datagram No No IBA
RAW Datagram No No Raw
• Each transport service can have zero or more QPs associated with it
Supercomputing '09
• e.g., you can have 4 QPs based on RC and one based on UD
57
![Page 58: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/58.jpg)
Trade-offs in Different Transport Typesp yp
Supercomputing '09 58
![Page 59: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/59.jpg)
Shared Receive Queue (SRQ)Q ( Q)
Process Process
One RQ per connection One SRQ for all connections
m p
• SRQ is a hardware mechanism for a process to share
One RQ per connection One SRQ for all connectionsn -1
preceive resources (memory) across multiple connections
– Introduced in specification v1.2
• 0 < p << m*(n-1)Supercomputing '09 59
![Page 60: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/60.jpg)
eXtended Reliable Connection (XRC)( )
M = # of nodesN = # of processes/node
RC Connections XRC Connections
(M2 1)*N (M 1)*N
N # of processes/node
(M2-1) N (M-1) N
• Each QP takes at least one page of memoryg y– Connections between all processes is very costly for RC
• New IB Transport added: eXtended Reliable Connection– Allows connections between nodes instead of processes
Supercomputing '09 60
![Page 61: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/61.jpg)
IB Overview
• InfiniBand– Architecture and Basic Hardware Components
– Communication Model and SemanticsC i ti M d l• Communication Model
• Memory registration and protection
• Channel and memory semantics
– Novel Features• Hardware Protocol Offload
Link network and transport layer features– Link, network and transport layer features
– Management and Services• Subnet Management
• Hardware support for scalable network management
Supercomputing '09 61
![Page 62: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/62.jpg)
Concepts in IB Management
• Agents
p g
– Processes or hardware units running on each adapter, switch, router (everything on the network)Provide capability to query and set parameters– Provide capability to query and set parameters
• Managers– Make high-level decisions and implement it on the– Make high-level decisions and implement it on the
network fabric using the agents
• Messaging schemesg g– Used for interactions between the manager and agents
(or between agents)
• MessagesSupercomputing '09 62
![Page 63: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/63.jpg)
Subnet Managerg
Inactive Link
Active Links
Inactive Links
Compute Node
Switch Multicast Join
Multicast Setup
Node
Multicast Join
Multicast Setup
Subnet ManagerSubnet
ManagerSubnet
Manager
Supercomputing '09 63
![Page 64: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/64.jpg)
10GE Overview
• 10-Gigabit Ethernet Family
– Architecture and Components
• Stack Layouty
• Out-of-Order Data Placement
• Dynamic and Fine-grained Data Rate Controly g
– Existing Implementations of 10GE/iWARP
Supercomputing '09 64
![Page 65: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/65.jpg)
IB and 10GE:C liti d DiffCommonalities and Differences
IB iWARP/10GEIB iWARP/10GE
Hardware Acceleration SupportedSupported
(for TOE and iWARP)Supported
RDMA SupportedSupported
(for iWARP)
Atomic Operations Supported Not supported
Multicast Supported Supported
Data Placement OrderedOut-of-order(for iWARP)
Data Rate-control Static and Coarse-grainedDynamic and Fine-grained
(for TOE and iWARP)
QoS PrioritizationPrioritization and
Fixed Bandwidth QoS
Supercomputing '09
Fixed Bandwidth QoS
65
![Page 66: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/66.jpg)
iWARP Architecture and Components
• RDMA Protocol (RDMAP)
p
iWARP Offload Engines
– Feature-rich interface
– Security Management
Application or LibraryUser
RDMAP
• Remote Direct Data Placement (RDDP)
– Data Placement and Delivery
RDDP
SCTPMPA
– Multi Stream Semantics
– Connection Management
HardwareSCTP
IP
TCP
• Marker PDU Aligned (MPA)
– Middle Box Fragmentation
Device Driver
Network Adapter(e.g., 10GigE)
– Data Integrity (CRC)
Supercomputing '09
(Courtesy iWARP Specification)
66
![Page 67: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/67.jpg)
Decoupled Data Placement andD t D li
• Place data as it arrives, whether in or out-of-order
Data Delivery
• If data is out-of-order, place it at the appropriate offset
• Issues from the application’s perspective:• Issues from the application s perspective:
– Second half of the message has been placed does not
th t th fi t h lf f th h i d llmean that the first half of the message has arrived as well
– If one message has been placed, it does not mean that
that the previous messages have been placed
Supercomputing '09 67
![Page 68: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/68.jpg)
Protocol Stack Issues withO t f O d D t Pl t
• The receiver network stack has to understand each
Out-of-Order Data Placement
The receiver network stack has to understand each
frame of data
– If the frame is unchanged during transmission, this is easy!
• Issues to consider:• Issues to consider:
– Can we guarantee that the frame will be unchanged?
– What happens when intermediate switches segment data?
Supercomputing '09 68
![Page 69: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/69.jpg)
Switch Splicingp g
Switch
Splicing
Splicing
Intermediate Ethernet switches (e.g., those which support splicing) can segment a frame to multiple segments or coalesce multiple
segments to a single segment
Supercomputing '09 69
![Page 70: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/70.jpg)
Marker PDU Aligned (MPA) Protocol
• Deterministic Approach to find segment boundaries
g ( )
• Approach:– Places strips of data at regular intervals (based on data p g (
sequence number)
– Interval is set to be 512 bytes (small enough to ensure that each Ethernet frame has at least one)
• Minimum IP packet size is 536 bytes (RFC 879)
– Each strip points to the RDDP header
• Each segment independently has enough information about where it needs to be placed
Supercomputing '09 70
![Page 71: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/71.jpg)
MPA Frame Format
RDDPHeader Data Payload (if any)
Data Payload (if any)
Segment Length PadCRC
Supercomputing '09
C C
71
![Page 72: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/72.jpg)
Dynamic and Fine-grainedR t C t l
• Part of the Ethernet standard, not iWARP
Rate Control
– Network vendors use a separate interface to support it
• Dynamic bandwidth allocation to flows based on interval between two packets in a flow– E.g., one stall for every packet sent on a 10 Gbps network
refers to a bandwidth allocation of 5 Gbps
– Complicated because of TCP windowing behavior
• Important for high-latency/high-bandwidth networks– Large windows exposed on the receiver side
– Receiver overflow controlled through rate control
Supercomputing '09 72
![Page 73: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/73.jpg)
Prioritization vs. Fixed Bandwidth QoS
• Can allow for simple prioritization:
Q
– E.g., connection 1 performs better than connection 2
– 8 classes provided (a connection can be in any class)• Similar to SLs in InfiniBand
– Two priority classes for high-priority trafficE t t ffi f it li ti• E.g., management traffic or your favorite application
• Or can allow for specific bandwidth requests:E t f 3 62 Gb b d idth– E.g., can request for 3.62 Gbps bandwidth
– Packet pacing and stalls used to achieve this
• Query functionality to find out “remaining bandwidth”• Query functionality to find out remaining bandwidth
Supercomputing '09 73
![Page 74: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/74.jpg)
10GE Overview
• 10-Gigabit Ethernet Family
– Architecture and Components
• Stack Layouty
• Out-of-Order Data Placement
• Dynamic and Fine-grained Data Rate Controly g
– Existing Implementations of 10GE/iWARP
Supercomputing '09 74
![Page 75: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/75.jpg)
Current Usage of Ethernetg
Regular Ethernet
TOE
WideArea
NetworkiWARP
Regular Ethernet Cluster
System Area Network or Cluster Environment
iWARP Cluster
Supercomputing '09
Distributed Cluster Environment
75
![Page 76: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/76.jpg)
Software iWARP based Compatibility
• Regular Ethernet adapters and TOEs
p y
g p
are fully compatible
C tibilit ith iWARP i d• Compatibility with iWARP required
• Software iWARP emulates the TOE
functionality of iWARP on the host
– Fully compatible with hardware iWARPRegular Ethernet iWARP y p
– Internally utilizes host TCP/IP stackEthernet Environment
Supercomputing '09 76
![Page 77: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/77.jpg)
Different iWARP Implementationsp
Chelsio NetEffect
Application
High Performance Sockets
Application
High Performance Sockets
ApplicationApplication
OSU, OSC OSU, ANL Chelsio, NetEffect, Ammasso
Sockets
TCP
IP
SoftwareiWARP
Sockets
TCP
IP
Kernel-leveliWARP
TCP (Modified with MPA)
SocketsUser-level iWARP
Sockets
TCP IP
Device Driver
Offloaded TCP
Device Driver
Offloaded TCP
Offloaded iWARP
IP
Device Driver
IP
TCP
Device Driver
Regular Ethernet Adapters
Network Adapter
Offloaded IP
TCP Offload Engines
Network AdapterOffloaded IP
iWARP compliant Adapters
Network AdapterNetwork Adapter
Supercomputing '09
Adapters
77
![Page 78: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/78.jpg)
IB and 10GE Convergence
• InfiniBand/Ethernet Convergence Technologies
g
g g
– Virtual Protocol Interconnect
– InfiniBand over EthernetInfiniBand over Ethernet
– RDMA over Converged Enhanced Ethernet
Supercomputing '09 78
![Page 79: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/79.jpg)
Virtual Protocol Interconnect (VPI)
• Single network firmware to support b th IB d Eth t
( )
Applicationsboth IB and Ethernet
• Autosensing of layer-2 protocol– Can be configured to automatically
IB Verbs Sockets
g ywork with either IB or Ethernet networks
• Multi port adapters can use one portIB Network
IB TransportLayer TCP
• Multi-port adapters can use one port on IB and another on Ethernet
• Multiple use modes:IB Link Layer
HardwareTCP/IP support
Ethernet
IB NetworkLayer
IP
– Datacenters with IB inside the cluster and Ethernet outside
– Clusters with IB network and
IB Link Layer Ethernet Link Layer
Clusters with IB network and Ethernet management
Supercomputing '09
IB Port Ethernet Port
79
![Page 80: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/80.jpg)
(InfiniBand) RDMA over Ethernet (IB E)
• Native convergence of IB network and t t l ith Eth t li k l
(IBoE)
transport layers with Ethernet link layer
• IB packets encapsulated in Ethernet frames
• IB network layer already uses IPv6 framesIB Verbs
Application
IB network layer already uses IPv6 frames
• Pros:– Works natively in Ethernet environments
IB Transport
IB Verbs
Hardware– Has all the benefits of IB verbs
• Cons:Network bandwidth limited to EthernetEthernet
IB NetworkHardware
– Network bandwidth limited to Ethernet switches (currently 10Gbps), even though IB can provide 32Gbps
S IB ti li k l f t
Ethernet
– Some IB native link-layer features are optional in (regular) Ethernet
Supercomputing '09 80
![Page 81: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/81.jpg)
(InfiniBand) RDMA overC d E h d Eth t (R CEE)
• Very similar to IB over Ethernet
Converged Enhanced Ethernet (RoCEE)
– Often used interchangeably with IBoE
– Can be used to explicitly specify link layer is Converged Enhanced Ethernet (CEE)IB Verbs
Application
g ( )
• Pros:– Works natively in Ethernet environmentsIB Transport
IB Verbs
Hardware– Has all the benefits of IB verbs
– CEE is very similar to the link layer of native IB, so there are no missing features
CEE
IB NetworkHardware
• Cons:– Network bandwidth limited to Ethernet
switches (currently 10Gbps) even though
CEE
switches (currently 10Gbps), even though IB can provide 32Gbps
Supercomputing '09 81
![Page 82: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/82.jpg)
IB and 10GE: Feature Comparison
IB iWARP/10GE IBoE RoCEE
p
Hardware Acceleration Yes Yes Yes Yes
RDMA Yes Yes Yes Yes
Atomic Operations Yes No Yes Yes
Multicast Optional No Optional Optional
Data Placement Ordered Out-of-order Ordered Ordered
Prioritization Optional Optional Optional Yes
Fixed BW QoS (ETS) No Optional Optional Yes
Ethernet Compatibility No Yes Yes YesCompatibility
TCP/IP CompatibilityYes
(using IPoIB)Yes No No
Supercomputing '09 82
![Page 83: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/83.jpg)
Presentation Overview
• Introduction
• Why InfiniBand and 10-Gigabit Ethernet?
• Overview of IB, 10GE, their Convergence and Features
• IB and 10GE HW/SW Products and Installations• IB and 10GE HW/SW Products and Installations
• Sample Case Studies and Performance Numbers
• Conclusions and Final Q&A
Supercomputing '09 83
![Page 84: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/84.jpg)
IB Hardware Products
• Many IB vendors: Mellanox, Voltaire and QlogicAli d ith d I t l IBM SUN D ll– Aligned with many server vendors: Intel, IBM, SUN, Dell
– And many integrators: Appro, Advanced Clustering, Microway, …
• Broadly two kinds of adapters– Offloading (Mellanox) and Onloading (Qlogic)
• Adapters with different interfaces:– Dual port 4X with PCI-X (64 bit/133 MHz) PCIe x8 PCIe 2 0 and HTDual port 4X with PCI X (64 bit/133 MHz), PCIe x8, PCIe 2.0 and HT
• MemFree Adapter– No memory on HCA Uses System memory (through PCIe)– Good for LOM designs (Tyan S2935, Supermicro 6015T-INFB)
• Different speeds– SDR (8 Gbps), DDR (16 Gbps) and QDR (32 Gbps)
• Some 12X SDR adapters exist as well (24 Gbps each way)
Supercomputing '09 84
![Page 85: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/85.jpg)
Tyan Thunder S2935 Boardy
(Courtesy Tyan)
Si il b d f S i ith LOM f t l il bl
Supercomputing '09
Similar boards from Supermicro with LOM features are also available
85
![Page 86: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/86.jpg)
IB Hardware Products (contd.)
• Customized adapters to work with IB switches
C XD1 (f l b O ti b ) C CX1
( )
– Cray XD1 (formerly by Octigabay), Cray CX1
• Switches:
– 4X SDR and DDR (8-288 ports); 12X SDR (small sizes)4X SDR and DDR (8 288 ports); 12X SDR (small sizes)
– 3456-port “Magnum” switch from SUN used at TACC• 72-port “nano magnum”
– 36-port Mellanox InfiniScale IV QDR switch silicon in early 2008• Up to 648-port QDR switch by SUN
N IB it h ili f Ql i i t d d t SC ’08– New IB switch silicon from Qlogic introduced at SC ’08
• Up to 846-port QDR switch by Qlogic
• Switch Routers with GatewaysSwitch Routers with Gateways
– IB-to-FC; IB-to-IPSupercomputing '09 86
![Page 87: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/87.jpg)
IB Software Products
• Low-level software stacks
– VAPI (Verbs-Level API) from Mellanox
– Modified and customized VAPI from other vendors
– New initiative: Open Fabrics (formerly OpenIB)• http://www.openfabrics.org
• Open-source code available with Linux distributions
• Initially IB; later extended to incorporate iWARP
• High-level software stacks
– MPI, SDP, IPoIB, SRP, iSER, DAPL, NFS, PVFS on i k ( i il VAPI d O F b i )various stacks (primarily VAPI and OpenFabrics)
Supercomputing '09 87
![Page 88: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/88.jpg)
10G, 40G and 100G Ethernet Products
• 10GE adaptersI t l M i M ll (C tX)
,
– Intel, Myricom, Mellanox (ConnectX)• 10GE/iWARP adapters
– Chelsio, NetEffect (now owned by Intel), ( y )• 10GE switches
– Fulcrum Microsystems L l i h b d 24 ili• Low latency switch based on 24-port silicon
• FM4000 switch with IP routing, and TCP/UDP support
– Fujitsu, Woven Systems (144 ports), Myricom (512 ports), Quadrics (96 ports), Force10, Cisco, Arista (formerly Arastra)
• 40GE and 100GE switchesNortel Networks– Nortel Networks
• 10GE downlinks with 40GE and 100GE uplinksSupercomputing '09 88
![Page 89: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/89.jpg)
Mellanox ConnectX Architecture
• Early adapter supporting IB/10GE convergence– Support for VPI and IBoE
f• Includes other features as well– Hardware support for
VirtualizationVirtualization
– Quality of Service
– Stateless Offloads
Supercomputing '09
(Courtesy Mellanox)
89
![Page 90: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/90.jpg)
OpenFabrics
• Open source organization (formerly OpenIB)
p
– www.openfabrics.org
• Incorporates both IB and iWARP in a unified manner– Support for Linux and Windows
– Design of complete stack with `best of breed’ components• Gen1
• Gen2 (current focus)
• Users can download the entire stack and run• Users can download the entire stack and run– Latest release is OFED 1.4.2
– OFED 1.5 is underwayy
Supercomputing '09 90
![Page 91: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/91.jpg)
OpenFabrics Stack with UnifiedV b I t fVerbs Interface
Verbs InterfaceVerbs Interface(libibverbs)
Mellanox(libmthca)Mellanox(libmthca)
QLogic(libipathverbs)
QLogic(libipathverbs)
IBM (libehca)
IBM (libehca)
Chelsio(libcxgb3)Chelsio
(libcxgb3)Mellanox(libmthca)
QLogic(libipathverbs)
IBM (libehca)
Chelsio(libcxgb3)
User Level
Mellanox (ib_mthca)Mellanox (ib_mthca)
QLogic(ib_ipath)QLogic
(ib_ipath)IBM
(ib_ehca)IBM
(ib_ehca)Chelsio
(ib_cxgb3)Chelsio
(ib_cxgb3)Mellanox (ib_mthca)
QLogic(ib_ipath)
IBM(ib_ehca)
Chelsio(ib_cxgb3)
Kernel Level
MellanoxAdapters
QLogicAdapters
ChelsioAdapters
IBMAdapters
Supercomputing '09
Adapters Adapters AdaptersAdapters
91
![Page 92: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/92.jpg)
OpenFabrics on Convergent IB/10GE
• For IBoE and RoCEE, the upper-
p g
Verbs Interfacelevel stacks remain completely unchanged
• Within the hardware:
(libibverbs)
ConnectX Within the hardware:– Transport and network layers
remain completely unchanged
B th IB d Eth t ( CEE) li k
ConnectX(libmlx4)
ConnectX
User Level
Kernel Level – Both IB and Ethernet (or CEE) link layers are supported on the network adapter
(ib_mlx4)Kernel Level
ConnectXAd t • Note: The OpenFabrics stack is not
valid for the Ethernet path in VPI– That still uses sockets and TCP/IP
Adapters
That still uses sockets and TCP/IP
Supercomputing '09
10GE IB
92
![Page 93: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/93.jpg)
OpenFabrics Software StackpSA Subnet Administrator
MAD Management DatagramOpenDiag
Application Level Application Level
ClusteredDB Access
SocketsBasedA
VariousMPIs
Access toFile
S t
BlockStorageA
IP BasedApp
A
SMA Subnet Manager Agent
PMA Performance Manager Agent
IPoIB IP over InfiniBandInfiniBand OpenFabrics User Level Verbs / API iWARP R-NIC
SDP Lib
User Level MAD API
Open SM
DiagTools
User APIsUser APIs
U S U S
DB AccessAccess MPIs SystemsAccessAccess
UDAPL
SDP Sockets Direct Protocol
SRP SCSI RDMA Protocol (Initiator)
iSER iSCSI RDMA Protocol (Initiator)
RDS Reliable Datagram Service
SDPIPoIB SRP iSER RDS
SDP Lib
Upper Layer Protocol
Upper Layer Protocol
Kernel SpaceKernel Space
User Space User Space
NFS-RDMARPC
ClusterFile Sys
g
UDAPL User Direct Access Programming Lib
HCA Host Channel Adapter
R-NIC RDMA NIC
ConnectionManager
MADSA Client
ConnectionManager
Connection ManagerAbstraction (CMA)
Mid-LayerMid-LayerSMA
el b
ypas
sel
byp
ass
el b
ypas
sel
byp
ass
CommonKeyKeyHardware
Specific DriverHardware Specific
Driver
InfiniBand OpenFabrics Kernel Level Verbs / API iWARP R-NIC
ProviderProviderApps & Access
Ker
neK
erne
Ker
neK
erne
Supercomputing '09
InfiniBand
iWARPInfiniBand HCA iWARP R-NICHardwareHardware
AccessMethodsfor usingOF Stack
93
![Page 94: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/94.jpg)
InfiniBand in the Top500pSystems Performance
P t h f I fi iB d i t dil i i
Supercomputing '09
Percentage share of InfiniBand is steadily increasing
94
![Page 95: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/95.jpg)
Large-scale InfiniBand Installationsg
• 151 IB clusters (30.2%) in the June ’09 TOP500 list (www.top500.org)
• Installations in the Top 30 (15 of them):
129,600 cores (RoadRunner) at LANL (1st) 12,288 cores at GENCI-CINES, France (20th)
51,200 cores (Pleiades) at NASA Ames (4th) 8,320 cores in UK (25th)
62,976 cores (Ranger) at TACC (8th) 8,320 cores in UK (26th)
th th26,304 cores (Juropa) at TACC (10th) 8,064 cores (DKRZ) in Germany (27th)
30,720 cores (Dawning) at Shanghai (15th) 12,032 cores at JAXA, Japan (28th)
14 336 cores at New Mexico (17th) 10 240 cores at TEP France (29th)14,336 cores at New Mexico (17 ) 10,240 cores at TEP, France (29 )
14,384 cores at Tata CRL, India (18th) 13,728 cores in Sweden (30th)
18,224 cores at LLNL (19th) More are getting installed !
Supercomputing '09 95
![Page 96: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/96.jpg)
10GE Installations
• Several Enterprise Computing DomainsE t i D t t (HP I t l) d Fi i l M k t– Enterprise Datacenters (HP, Intel) and Financial Markets
– Animation firms (e.g., Universal Studios created “The Hulk” and many new movies using 10GE)
• Scientific Computing Installations– 5,600-core installation in Purdue with Chelsio/iWARP
640 i t ll ti i U i it f H id lb G– 640-core installation in University of Heidelberg, Germany– 512-core installation at Sandia National Laboratory (SNL) with
Chelsio/iWARP and Woven Systems switch– 256-core installation at Argonne National Lab with Myri-10G
• Integrated SystemsBG/P uses 10GE for I/O (ranks 3 7 9 14 24 in the Top 25)– BG/P uses 10GE for I/O (ranks 3, 7, 9, 14, 24 in the Top 25)
• ESnet to install 62M $ 100GE infrastructure for US DOESupercomputing '09 96
![Page 97: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/97.jpg)
Dual IB/10GE Systemsy
• Such systems are being integrated
• E.g., the T2K-Tsukuba systemTsukuba system (300 TFlop System)
• Systems at three sites (Tsukuba, Tokyo, Kyoto)
• Internal connectivity: Quad-rail IB ConnectX network
(Courtesy Taisuke Boku, University of Tsukuba)
• External connectivity: 10GE
Supercomputing '09 97
![Page 98: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/98.jpg)
Presentation Overview
• Introduction
• Why InfiniBand and 10-Gigabit Ethernet?
• Overview of IB, 10GE, their Convergence and Features
• IB and 10GE HW/SW Products and Installations• IB and 10GE HW/SW Products and Installations
• Sample Case Studies and Performance Numbers
• Conclusions and Final Q&A
Supercomputing '09 98
![Page 99: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/99.jpg)
Case Studies
• Low level Performance• Low-level Performance
• Message Passing Interface (MPI)g g ( )
• File Systems
Supercomputing '09 99
![Page 100: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/100.jpg)
Low-level Latency Measurementsy
40N ti IB
Small Messages12000
Large Messages
30
35Native IB
VPI-IB
VPI-Eth
IBoE 8000
10000
15
20
25
Late
ncy
(us)
6000
8000
Late
ncy
(us)
5
10
15
2000
4000
0
5
2 4 8 16 32 64 128
256
512 1K 2K 4K
Message Size (bytes)
0
Message Size (bytes)
Supercomputing '09
Message Size (bytes) Message Size (bytes)
ConnectX-DDR: 2.4 GHz Quad-core (Nehalem) Intel with IB and 10GE switches
100
![Page 101: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/101.jpg)
Low-level Uni- and Bi-directional B d idth M tBandwidth Measurements
1600Uni-directional Bandwidth
1200
1400
1600
s)
600
800
1000
ndw
idth
(MB
p
200
400
600Native IB VPI-IB
VPI-Eth IBoE
Ban
0
Message Size (bytes)
Supercomputing '09
Message Size (bytes)
ConnectX-DDR: 2.4 GHz Quad-core (Nehalem) Intel with IB and 10GE switches101
![Page 102: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/102.jpg)
Case Studies
• Low level Performance• Low-level Performance
• Message Passing Interface (MPI)g g ( )
• File Systems
Supercomputing '09 102
![Page 103: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/103.jpg)
Message Passing Interface (MPI)
• De-facto message passing standard
g g ( )
– Point-to-point communication
– Collective communication (broadcast, multicast, reduction, barrier)
– MPI-1 and MPI-2 available; MPI-3 under discussion
• Has been implemented for various past commodity networks (Myrinet, Quadrics)
• How can it be designed and efficiently implemented for InfiniBand and iWARP?
Supercomputing '09 103
![Page 104: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/104.jpg)
MVAPICH/MVAPICH2 Software
• High Performance MPI Library for IB and 10GE– MVAPICH (MPI-1) and MVAPICH2 (MPI-2)
– Used by more than 975 organizations in 51 countries
– More than 34,000 downloads from OSU site directly
– Empowering many TOP500 clusters• 8th ranked 62,976-core cluster (Ranger) at TACC
– Available with software stacks of many IB, 10GE and server vendors including Open Fabrics Enterprise Distribution (OFED)vendors including Open Fabrics Enterprise Distribution (OFED)
– Also supports uDAPL device to work with any network supporting uDAPL
– http://mvapich.cse.ohio-state.edu/Supercomputing '09 104
![Page 105: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/105.jpg)
MPICH2 Software Stack
• High-performance and Widely Portable MPIS t MPI 1 MPI 2 d MPI 2 1– Supports MPI-1, MPI-2 and MPI-2.1
– Supports multiple networks (TCP, IB, iWARP, Myrinet)– Commercial support by many vendors
• IBM (integrated stack distributed by Argonne)• Microsoft, Intel (in process of integrating their stack)
– Used by many derivative implementationsUsed by many derivative implementations• E.g., MVAPICH2, IBM, Intel, Microsoft, SiCortex, Cray, Myricom• MPICH2 and its derivatives support many Top500 systems
(estimated at more than 90%)(estimated at more than 90%)– Available with many software distributions– Integrated with the ROMIO MPI-IO implementation and the MPE
profiling libraryprofiling library– http://www.mcs.anl.gov/research/projects/mpich2
Supercomputing '09 105
![Page 106: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/106.jpg)
One-way Latency: MPI over IBy y
7Small Message Latency
400MVAPICH-InfiniHost III-DDR
Large Message Latency
5
6
250
300
350MVAPICH-InfiniHost III-DDR
MVAPICH-Qlogic-SDR
MVAPICH-ConnectX-DDR
MVAPICH-ConnectX-QDR-PCIe2
3
4
Late
ncy
(us)
2.77150
200
250 MVAPICH-Qlogic-DDR-PCIe2
Late
ncy
(us)
1
2
1.061.281.492.19
50
100
0
Message Size (bytes)
0
Message Size (bytes)
Supercomputing '09
InfiniHost III and ConnectX-DDR: 2.33 GHz Quad-core (Clovertown) Intel with IB switch
ConnectX-QDR-PCIe2: 2.83 GHz Quad-core (Harpertown) Intel with back-to-back
106
![Page 107: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/107.jpg)
Bandwidth: MPI over IB
3500MVAPICH-InfiniHost III-DDR
Unidirectional Bandwidth6000
Bidirectional Bandwidth5553.5
2500
3000 MVAPICH-Qlogic-SDRMVAPICH-ConnectX-DDRMVAPICH-ConnectX-QDR-PCIe2MVAPICH-Qlogic-DDR-PCIe2
sec
3022.1
1952.94000
5000
sec
5553.5
3621.4
1500
2000
Mill
ionB
ytes
/s
1399.8
1389.42000
3000
Mill
ionB
ytes
/s
2457.4
2718.3
0
500
1000 936.5
0
1000
2000
1519.8
0
Message Size (bytes)
0
Message Size (bytes)
Supercomputing '09
InfiniHost III and ConnectX-DDR: 2.33 GHz Quad-core (Clovertown) Intel with IB switch
ConnectX-QDR-PCIe2: 2.4 GHz Quad-core (Nehalem) Intel with IB switch
107
![Page 108: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/108.jpg)
One-way Latency: MPI over iWARP
35One-way Latency
y y
25
30 Chelsio (TCP/IP)
Chelsio (iWARP)
15
20
Late
ncy
(us)
15.47
5
10
L
6.88
00 1 2 4 8 16 32 64 128 256 512 1K 2K 4K
Message Size (bytes)
Supercomputing '09
2.0 GHz Quad-core Intel with 10GE (Fulcrum) Switch
108
![Page 109: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/109.jpg)
Bandwidth: MPI over iWARP
1400Ch l i (TCP/IP)
Unidirectional Bandwidth2500
Bidirectional Bandwidth
2260 8
1000
1200Chelsio (TCP/IP)
Chelsio (iWARP)
ec 839 8
2000
c
2260.81231.8
600
800
Mill
ionB
ytes
/se 839.8
1000
1500
illio
nByt
es/s
ec
855.3
200
400
M
500M
i
0
Message Size (bytes)
0
Message Size (bytes)Message Size (bytes) Message Size (bytes)
Supercomputing '09
2.0 GHz Quad-core Intel with 10GE (Fulcrum) Switch109
![Page 110: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/110.jpg)
Convergent Technologies:MPI L tMPI Latency
25Small Messages
2500Large Messages
20
Native IB
VPI-IB
VPI-Eth
IB E
2000
15IBoE
Late
ncy
(us)
1000
1500
Late
ncy
(us)
5
10L
500
1000L
0
0 1 2 4 8 16 32 64 128
256
512
1K
0
Supercomputing '09
2 5
Message Size (bytes) Message Size (bytes)
ConnectX-DDR: 2.4 GHz Quad-core (Nehalem) Intel with IB and 10GE switches110
![Page 111: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/111.jpg)
Convergent Technologies:MPI U i d Bi di ti l B d idthMPI Uni- and Bi-directional Bandwidth
1600Uni-directional Bandwidth
3500Bi-directional Bandwidth
1200
1400Native IB
VPI-IB
VPI-Eth
IBoEs) 2500
3000
s)
800
1000IBoE
ndw
idth
(MB
ps
1500
2000
ndw
idth
(MB
ps200
400
600Ban
500
1000
1500
Ban
0
200
0 2 8 32 128
512 2K 8K 32K
28K
12K
2M
0
500
0 2 8 32 128
512
2K 8K 32K
28K
12K
2M
Supercomputing '09
3 12 5Message Size (bytes)
3 12 5Message Size (bytes)
ConnectX-DDR: 2.4 GHz Quad-core (Nehalem) Intel with IB and 10GE switches111
![Page 112: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/112.jpg)
Case Studies
• Low level Performance• Low-level Performance
• Message Passing Interface (MPI)g g ( )
• File Systems
Supercomputing '09 112
![Page 113: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/113.jpg)
Sample Diagram of State-of-the-Art File S t
• Sample file systems:
Systems
– Lustre, Panasas, GPFS, Sistina/Redhat GFS
– PVFS, Google File systems, Oracle Cluster File system (OCFS2)
Metadata Server
Metadata Server Server
Computing node I/O server
Network
Computing node I/O server
Supercomputing '09
Computing node I/O server
113
![Page 114: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/114.jpg)
Lustre Performance
1200Write Performance (4 OSSs)
3500Read Performance (4 OSSs)
800
1000IPoIB
Native
ut (M
Bps
)
2000
2500
3000
ut (M
Bps
)
200
400
600
Thro
ughp
u
500
1000
1500
Thro
ughp
u
01 2 3 4
Number of Clients
0
500
1 2 3 4Number of Clients
• Lustre over Native IB– Write: 1.38X faster than IPoIB; Read: 2.16X faster than IPoIB
• Memory copies in IPoIB and Native IB
Supercomputing '09
• Memory copies in IPoIB and Native IB– Reduced throughput and high overhead; I/O servers are saturated
114
![Page 115: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/115.jpg)
CPU Utilization
90
100IPoIB (Read) IPoIB (Write)
60
70
80
90 Native (Read) Native (Write)
(%)
30
40
50
60
PU U
tiliz
atio
n
0
10
20
30C
01 2 3 4
Number of Clients
• 4 OSS nodes, IOzone record size 1MB
Supercomputing '09
4 OSS nodes, IOzone record size 1MB
• Offers potential for greater scalability115
![Page 116: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/116.jpg)
NFS/RDMA Performance
1000Write (tmpfs)
1000Read (tmpfs)
600
700
800
900
t (M
B/s
)
600
700
800
900
t (M
B/s
)
200
300
400
500
Read-ReadThro
ughp
u
200
300
400
500
Thro
ughp
ut
0
100
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Read-Write
Number of Threads
0
100
200
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Number of Threads
• IOzone Read Bandwidth up to 913 MB/s (Sun x2200’s with x8 PCIe)• Read-Write design by OSU, available with the latest OpenSolaris• NFS/RDMA is being added into OFED 1 4
Supercomputing '09
NFS/RDMA is being added into OFED 1.4
R. Noronha, L. Chai, T. Talpey and D. K. Panda, “Designing NFS With RDMA For Security, Performance and Scalability”, ICPP ‘07
116
![Page 117: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/117.jpg)
Summary of Design Performance R lt
• Current generation IB adapters, 10GE/iWARP
Results
g p
adapters and software environments are already
delivering competitive performanceg p p
• IB and 10GE/iWARP hardware, firmware, and
software are going through rapid changessoftware are going through rapid changes
• Convergence between IB and 10GigE is emerging
• Significant performance improvement is expected in
near future
Supercomputing '09 117
![Page 118: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/118.jpg)
Presentation Overview
• Introduction
• Why InfiniBand and 10-Gigabit Ethernet?
• Overview of IB, 10GE, their Convergence and Features
• IB and 10GE HW/SW Products and Installations• IB and 10GE HW/SW Products and Installations
• Sample Case Studies and Performance Numbers
• Conclusions and Final Q&A
Supercomputing '09 118
![Page 119: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/119.jpg)
Concluding Remarks
• Presented network architectures & trends in Clusters
g
• Presented background and details of IB and 10GE
– Highlighted the main features of IB and 10GE and their convergence
– Gave an overview of IB and 10GE hw/sw products
– Discussed sample performance numbers in designing various high-end systems with IB and 10GE
• IB and 10GE are emerging as new architectures leading to a new generation of networked computing systems, opening many research issues needing novel solutions
Supercomputing '09 119
![Page 120: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/120.jpg)
Funding Acknowledgmentsg g
Our research is supported by the following organizations
• Funding support by
• Equipment support by
Supercomputing '09 120
![Page 121: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/121.jpg)
Personnel Acknowledgmentsg
Current Students – M Kalaiya (M S )
Past Students – P Balaji (Ph D ) – A Mamidala (Ph D )M. Kalaiya (M. S.)
– K. Kandalla (M.S.)– P. Lai (Ph.D.)– M. Luo (Ph.D.)
P. Balaji (Ph.D.)
– D. Buntinas (Ph.D.)
– S. Bhagvat (M.S.)
– L. Chai (Ph.D.)
A. Mamidala (Ph.D.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)– G. Santhanaraman (Ph.D.)
– G. Marsh (Ph. D.)– X. Ouyang (Ph.D.)– S. Potluri (M. S.)
H S b i (Ph D )
– B. Chandrasekharan (M.S.)– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– K. Vaidyanathan (Ph.D.)– H. Subramoni (Ph.D.)
Current Post-Doc– E Mancini
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– S. Kini (M.S.)
M Koop (Ph D )
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)E. Mancini
Current Programmer– J. Perkins
– M. Koop (Ph.D.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– P. Lai (Ph. D.)
Supercomputing '09
( )
– J. Liu (Ph.D.)
121
![Page 122: 10 Gigabit Ethernet vs Infinibandbasic](https://reader033.vdocuments.net/reader033/viewer/2022061105/543cc46cb1af9f5f378b47e1/html5/thumbnails/122.jpg)
Web Pointers
http://www.cse.ohio-state.edu/~pandahttp://www.mcs.anl.gov/~balaji
http://www.cse.ohio-state.edu/~koophttp://nowlab.cse.ohio-state.edu
MVAPICH Web Pagehttp://mvapich.cse.ohio-state.edu
[email protected]@mcs.anl.gov
Supercomputing '09
122