10 gigabit ethernet vs infinibandbasic

122
InfiniBand and 10-Gigabit Ethernet for Dummies Dhabaleswar K. (DK) Panda A Tutorial at Supercomputing ‘09 by Pavan Balaji Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Pavan Balaji Argonne National Laboratory E-mail: [email protected] http://www.mcs.anl.gov/~balaji http://www.cse.ohio state.edu/ panda http://www.mcs.anl.gov/ balaji Matthew Koop NASA Goddard E il tth k @ E-mail: matthew.koop@nasa.gov http://www.cse.ohio-state.edu/~koop

Upload: harsh-narang

Post on 14-Oct-2014

69 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: 10 Gigabit Ethernet vs Infinibandbasic

InfiniBand and 10-Gigabit Ethernet for Dummies

Dhabaleswar K. (DK) Panda

A Tutorial at Supercomputing ‘09by

Pavan BalajiDhabaleswar K. (DK) PandaThe Ohio State University

E-mail: [email protected]://www.cse.ohio-state.edu/~panda

Pavan BalajiArgonne National LaboratoryE-mail: [email protected]

http://www.mcs.anl.gov/~balajihttp://www.cse.ohio state.edu/ panda http://www.mcs.anl.gov/ balaji

Matthew KoopNASA Goddard

E il tth k @E-mail: [email protected]://www.cse.ohio-state.edu/~koop

Page 2: 10 Gigabit Ethernet vs Infinibandbasic

Presentation Overview

• Introduction

• Why InfiniBand and 10-Gigabit Ethernet?

• Overview of IB, 10GE, their Convergence and Features

• IB and 10GE HW/SW Products and Installations• IB and 10GE HW/SW Products and Installations

• Sample Case Studies and Performance Numbers

• Conclusions and Final Q&A

Supercomputing '09 2

Page 3: 10 Gigabit Ethernet vs Infinibandbasic

Current and Next Generation A li ti d C ti S t

• Big demand for

Applications and Computing Systems

– High Performance Computing (HPC)

– File-systems, multimedia, database, visualization

– Enterprise Multi-tier datacentersEnterprise Multi tier datacenters

• Processor performance continues to grow – Chip density doubling every 18 months (multi-cores)

• Commodity networking also continues to grow – Increase in speed and features & affordable pricing

• Clusters are increasingly becoming popular to design next generation computing systems– Scalability Modularity and Upgradeability withScalability, Modularity and Upgradeability with

compute and network technologies

Supercomputing '09 3

Page 4: 10 Gigabit Ethernet vs Infinibandbasic

Trends for Computing Clusters in the T 500 Li t

• Top 500 list of Supercomputers (www.top500.org)

Top 500 List

Jun. 2001: 33/500 (6.6%) Nov. 2005: 360/500 (72.0%)

Nov. 2001: 43/500 (8.6%) Jun. 2006: 364/500 (72.8%)

Jun. 2002: 80/500 (16%) Nov. 2006: 361/500 (72.2%)

Nov. 2002: 93/500 (18.6%) Jun. 2007: 373/500 (74.6%)

Jun. 2003: 149/500 (29.8%) Nov. 2007: 406/500 (81.2%)

Nov. 2003: 208/500 (41.6%) Jun. 2008: 400/500 (80.0%)

Jun. 2004: 291/500 (58.2%) Nov. 2008: 410/500 (82.0%)

Nov. 2004: 294/500 (58.8%) Jun. 2009: 410/500 (82.0%)

Supercomputing '09

Jun. 2005: 304/500 (60.8%) Nov. 2009: To be announced

4

Page 5: 10 Gigabit Ethernet vs Infinibandbasic

Integrated High-End Computing E i tEnvironments

Compute cluster Storage cluster

Meta-DataManager

I/O ServerNode

MetaData

DataCompute

Node

ComputeNode

I/O ServerNode Data

ComputeNode

LANLANFrontend

NodeNode

I/O ServerNode Data

ComputeNodeLAN/WAN

Enterprise Multi-tier Datacenter for Visualization and Mining

Routers/Servers

Database Server

Application Server

.

...

SwitchRouters/Servers

Application Server

Application Server

Database Server

Database Server

Switch SwitchRouters/Servers

Supercomputing '09

.

Tier1 Tier3

Routers/Servers

Application Server

Database Server

5

Page 6: 10 Gigabit Ethernet vs Infinibandbasic

Networking and I/O Requirements

• Good Systems Area Network with excellent performance

g q

(low latency and high bandwidth) for inter-processor communication (IPC) and I/O

• Good Storage Area Networks high performance I/O

• Good WAN connectivity in addition to intra-clusterGood WAN connectivity in addition to intra cluster SAN/LAN connectivity

• Quality of Service (QoS) for interactive applicationsQuality of Service (QoS) for interactive applications

• RAS (Reliability, Availability, and Serviceability)

With l t• With low cost

Supercomputing '09 6

Page 7: 10 Gigabit Ethernet vs Infinibandbasic

Major Components in Computing S t

• Hardware Components

SystemsP0

– Processing Core and Memory sub-system

I/O B

Core 0

Core 1

Core 2

Core 3MemoryProcessing

Bottleneck

I

– I/O Bus

– Network Adapter

– Network Switch

P1

Core 0

Core 1

Core 2

Core 3Memory

IO

BUS

Network Switch

• Software Components– Communication software

Core 1 Core 3

I/OBottleneck

S

NetworkAdapter

NetworkSwitch

NetworkBottleneck

Supercomputing '09 7

Page 8: 10 Gigabit Ethernet vs Infinibandbasic

Processing Bottlenecks inT diti l P t l

• Ex: TCP/IP, UDP/IP

Traditional Protocols

• Generic architecture for all network interfaces

• Host-handles almost all aspects of communication– Data buffering (copies on sender and receiver)

– Data integrity (checksum)

– Routing aspects (IP routing)

• Signaling between different layers– Hardware interrupt whenever a packet arrives or is sent

– Software signals between different layers to handle protocol processing in different priority levels

Supercomputing '09 8

Page 9: 10 Gigabit Ethernet vs Infinibandbasic

Bottlenecks in TraditionalI/O I t f d N t k

• Traditionally relied on bus-based technologies

I/O Interfaces and Networks

– E.g., PCI, PCI-X, Shared Ethernet

– One bit per wire

– Performance increase through:• Increasing clock speed

• Increasing bus width

– Not scalable:Cross talk bet een bits• Cross talk between bits

• Skew between wires

• Signal integrity makes it difficult to increase bus widthSignal integrity makes it difficult to increase bus width significantly, especially for high clock speeds

Supercomputing '09 9

Page 10: 10 Gigabit Ethernet vs Infinibandbasic

InfiniBand (“Infinite Bandwidth”) and10 Gi bit Eth t

• Industry Networking Standards

10-Gigabit Ethernet

• Processing Bottleneck:– Hardware offloaded protocol stacks with user-level

communication access

• Network Bottleneck:– Bit serial differential signaling

• Independent pairs of wires to transmit independent data (called a lane)(called a lane)

• Scalable to any number of lanes• Easy to increase clock speed of lanes (since each lane

consists only of a pair of wires)

Supercomputing '09 10

Page 11: 10 Gigabit Ethernet vs Infinibandbasic

Interplay with I/O Technologies

• InfiniBand initially intended to replace I/O bus

p y g

technologies with networking-like technology– That is, bit serial differential signaling

– With enhancements in I/O technologies that use a similar architecture (HyperTransport, PCI Express), this has become mostly irrelevant nowthis has become mostly irrelevant now

• Both IB and 10GE today come as network adapters that plug into existing I/O technologiesthat plug into existing I/O technologies

Supercomputing '09 11

Page 12: 10 Gigabit Ethernet vs Infinibandbasic

Presentation Overview

• Introduction

• Why InfiniBand and 10-Gigabit Ethernet?

• Overview of IB, 10GE, their Convergence and Features

• IB and 10GE HW/SW Products and Installations• IB and 10GE HW/SW Products and Installations

• Sample Case Studies and Performance Numbers

• Conclusions and Final Q&A

Supercomputing '09 12

Page 13: 10 Gigabit Ethernet vs Infinibandbasic

Trends in I/O Interfaces with Servers

• Network performance depends on– Networking technology (adapter + switch)– Network interface (last mile bottleneck)

PCI 1990 33MHz/32bit: 1.05Gbps (shared bidirectional)

PCI-X1998 (v1.0)2003 (v2.0)

133MHz/64bit: 8.5Gbps (shared bidirectional)266-533MHz/64bit: 17Gbps (shared bidirectional)( ) p ( )

HyperTransport (HT)by AMD

2001 (v1.0), 2004 (v2.0)2006 (v3.0), 2008 (v3.1)

102.4Gbps (v1.0), 179.2Gbps (v2.0)332.8Gbps (v3.0), 409.6Gbps (v3.1)

Gen1: 4X (8Gbps) 8X (16Gbps) 16X (32Gbps)PCI-Express (PCIe)

by Intel2003 (Gen1), 2007 (Gen2)

2009 (Gen3 standard)

Gen1: 4X (8Gbps), 8X (16Gbps), 16X (32Gbps)Gen2: 4X (16Gbps), 8X (32Gbps), 16X (64Gbps)

Gen3: 4X (~32Gbps), 8X (~64Gbps), 16X (~128Gbps)

Intel QuickPath 2009 153 6 204 8Gbps per linkIntel QuickPath 2009 153.6-204.8Gbps per link

Supercomputing '09 13

Page 14: 10 Gigabit Ethernet vs Infinibandbasic

Growth in Commodity Network T h l

Representative commodity networks; their entries into the market

Technology

Ethernet (1979 - ) 10 Mbit/sec

Fast Ethernet (1993 -) 100 Mbit/sec

Gigabit Ethernet (1995 -) 1000 Mbit /sec

ATM (1995 -) 155/622/1024 Mbit/sec

Myrinet (1993 -) 1 Gbit/sec

Fibre Channel (1994 -) 1 Gbit/sec

InfiniBand (2001 -) 2 Gbit/sec (1X SDR)

10-Gigabit Ethernet (2001 -) 10 Gbit/sec

InfiniBand (2003 -) 8 Gbit/sec (4X SDR)

InfiniBand (2005 -) 16 Gbit/sec (4X DDR)

24 Gbit/sec (12X SDR)

InfiniBand (2007 -) 32 Gbit/sec (4X QDR)

InfiniBand (2011-) 64 Gbit/sec (4X EDR)

Supercomputing '09

16 times in the last 9 years14

Page 15: 10 Gigabit Ethernet vs Infinibandbasic

Capabilities ofHi h P f N t k

• Intelligent Network Interface Cards

High-Performance Networks

• Support entire protocol processing completely in hardware (hardware protocol offload engines)( p g )

• Provide a rich communication interface to applications

– User-level communication capabilityUser level communication capability

– Gets rid of intermediate data buffering requirements

• No software signaling between communication layers• No software signaling between communication layers

– All layers are implemented on a dedicated hardware unit and not on a shared host CPUunit, and not on a shared host CPU

Supercomputing '09 15

Page 16: 10 Gigabit Ethernet vs Infinibandbasic

Previous High-Performance Network St k

• Virtual Interface Architecture

Stacks

– Standardized by Intel, Compaq, Microsoft

• Fast Messages (FM)– Developed by UIUC

• Myricom GMP i t t l t k f M i– Proprietary protocol stack from Myricom

• These network stacks set the trend for high-performance communication requirementsperformance communication requirements– Hardware offloaded protocol stack– Support for fast and secure user-level access to the pp

protocol stackSupercomputing '09 16

Page 17: 10 Gigabit Ethernet vs Infinibandbasic

IB Trade Association

• IB Trade Association was formed with seven industry leaders (Compaq, Dell, HP, IBM, Intel, Microsoft, and Sun)

• Goal: To design a scalable and high performance i ti d I/O hit t b t ki i t t dcommunication and I/O architecture by taking an integrated

view of computing, networking, and storage technologies

• Many other industry participated in the effort to define the IBMany other industry participated in the effort to define the IB architecture specification

• IB Architecture (Volume 1, Version 1.0) was released to ( )public on Oct 24, 2000

– Latest version 1.2.1 released January 2008

• http://www.infinibandta.org

Supercomputing '09 17

Page 18: 10 Gigabit Ethernet vs Infinibandbasic

IB Hardware Acceleration

• Some IB models have multiple hardware acceleratorsp

– E.g., Mellanox IB adapters

• Protocol Offload Engines• Protocol Offload Engines

– Completely implement layers 2-4 in hardware

• Additional hardware supported features also present

– RDMA, Multicast, QoS, Fault Tolerance, and many more

Supercomputing '09 18

Page 19: 10 Gigabit Ethernet vs Infinibandbasic

10-Gigabit Ethernet Consortium

• 10GE Alliance formed by several industry leaders to

g

take the Ethernet family to the next speed step

• Goal: To achieve a scalable and high performance communication architecture while maintaining backward compatibility with Ethernet

• http://www.ethernetalliance.org

• Upcoming 40-Gbps (Servers) and 100-Gbps Ethernet (Backbones, Switches, Routers): IEEE 802.3 WG

• Energy-efficient and power-conscious protocols– On-the-fly link speed reduction for under-utilized links

Supercomputing '09 19

Page 20: 10 Gigabit Ethernet vs Infinibandbasic

Ethernet Hardware Acceleration

• Interrupt Coalescing– Improves throughput, but degrades latency

• Jumbo FramesN l i I ibl i h i i i h– No latency impact; Incompatible with existing switches

• Hardware Checksum EnginesChecksum performed in hardware significantly faster– Checksum performed in hardware significantly faster

– Shown to have minimal benefit independently

• Segmentation Offload EnginesSegmentation Offload Engines– Supported by most 10GE products because of its backward

compatibility considered “regular” Ethernet– Heavily used in the “server-on-steroids” model

Supercomputing '09 20

Page 21: 10 Gigabit Ethernet vs Infinibandbasic

TOE and iWARP Accelerators

• TCP Offload Engines (TOE)– Hardware Acceleration for the entire TCP/IP stack

– Initially patented by Tehuti Networks

– Actually refers to the IC on the network adapter that implements TCP/IP

I ti ll f d t th ti t k d t– In practice, usually referred to as the entire network adapter

• Internet Wide-Area RDMA Protocol (iWARP)St d di d b IETF d th RDMA C ti– Standardized by IETF and the RDMA Consortium

– Support acceleration features (like IB) for Ethernet

htt // i tf & htt // d ti• http://www.ietf.org & http://www.rdmaconsortium.org

Supercomputing '09 21

Page 22: 10 Gigabit Ethernet vs Infinibandbasic

Converged Enhanced Ethernet

• Popularly known as Datacenter Ethernet

g

• Combines a number of Ethernet (optional) standards into one umbrella; sample enhancements include:– Priority-based flow-control: Link-level flow control for

each Class of Service (CoS)

– Enhanced Transmission Selection: Bandwidth assignment to each CoS

Datacenter Bridging Exchange Protocols: Congestion– Datacenter Bridging Exchange Protocols: Congestion notification, Priority classes

– End-to-end Congestion notification: Per flowEnd to end Congestion notification: Per flow congestion control to supplement per link flow control

Supercomputing '09 22

Page 23: 10 Gigabit Ethernet vs Infinibandbasic

Presentation Overview

• Introduction

• Why InfiniBand and 10-Gigabit Ethernet?

• Overview of IB, 10GE their Convergence and Features

• IB and 10GE HW/SW Products and Installations• IB and 10GE HW/SW Products and Installations

• Sample Case Studies and Performance Numbers

• Conclusions and Final Q&A

Supercomputing '09 23

Page 24: 10 Gigabit Ethernet vs Infinibandbasic

IB, 10GE and their Convergence

• InfiniBand

, g

– Architecture and Basic Hardware Components– Novel Features– IB Verbs InterfaceIB Verbs Interface– Management and Services

• 10-Gigabit Ethernet Family– Architecture and Components– Existing Implementations of 10GE/iWARP

InfiniBand/Ethernet Convergence Technologies• InfiniBand/Ethernet Convergence Technologies– Virtual Protocol Interconnect– InfiniBand over Ethernet– RDMA over Converged Enhanced Ethernet

Supercomputing '09 24

Page 25: 10 Gigabit Ethernet vs Infinibandbasic

IB Overview

• InfiniBand– Architecture and Basic Hardware Components

– Communication Model and SemanticsC i ti M d l• Communication Model

• Memory registration and protection

• Channel and memory semantics

– Novel Features• Hardware Protocol Offload

Link network and transport layer features– Link, network and transport layer features

– Management and Services• Subnet Management

• Hardware support for scalable network management

Supercomputing '09 25

Page 26: 10 Gigabit Ethernet vs Infinibandbasic

A Typical IB Networkyp

Three primary components

Channel Adapters

Switches/Routers

Links and connectors

Supercomputing '09 26

Page 27: 10 Gigabit Ethernet vs Infinibandbasic

Components: Channel Adapters

• Used by processing and I/O units to

p p

connect to fabric

• Consume & generate IB packets

• Programmable DMA engines with protection features

• May have multiple ports

– Independent buffering channeled through Virtual Lanes

• Host Channel Adapters (HCAs)

Supercomputing '09 27

Page 28: 10 Gigabit Ethernet vs Infinibandbasic

Components: Switches and Routersp

• Relay packets from a link to another• Relay packets from a link to another

• Switches: intra-subnet

• Routers: inter-subnet

• May support multicastSupercomputing '09 28

Page 29: 10 Gigabit Ethernet vs Infinibandbasic

Components: Links & Repeaters

• Network Links

p p

– Copper, Optical, Printed Circuit wiring on Back Plane– Not directly addressable

• Traditional adapters built for copper cablingTraditional adapters built for copper cabling– Restricted by cable length (signal integrity)

• Intel Connects: Optical cables with Copper-to-optical conversion hubs (acquired by Emcore)– Up to 100m length

550 picoseconds– 550 picosecondscopper-to-optical conversion latency

• Available from other vendors (Luxtera) (Courtesy Intel)

• Repeaters (Vol. 2 of InfiniBand specification)

Supercomputing '09 29

Page 30: 10 Gigabit Ethernet vs Infinibandbasic

IB Overview

• InfiniBand– Architecture and Basic Hardware Components

– Communication Model and SemanticsC i ti M d l• Communication Model

• Memory registration and protection

• Channel and memory semantics

– Novel Features• Hardware Protocol Offload

Link network and transport layer features– Link, network and transport layer features

– Management and Services• Subnet Management

• Hardware support for scalable network management

Supercomputing '09 30

Page 31: 10 Gigabit Ethernet vs Infinibandbasic

InfiniBand Communication Model

Basic InfiniBand CommunicationCommunication

Semantics

Supercomputing '09 31

Page 32: 10 Gigabit Ethernet vs Infinibandbasic

Queue Pair ModelQ

Communication in InfiniBand uses a Queue Pair

• Each QP has two queuesCQQP

Model for all data transfer

– Send Queue (SQ)

– Receive Queue (RQ)

CQQPSend Recv

• A QP must be linked to a Complete Queue (CQ)

– Gives notification of operation– Gives notification of operation completion from QPs InfiniBand Device

Supercomputing '09 32

Page 33: 10 Gigabit Ethernet vs Infinibandbasic

Queue Pair Model: WQEs and CQEsQueue Pair Model: WQEs and CQEs

• Entries used for QPEntries used for QP communication are data structures called Work Queue CQQP

Send RecvRequests (WQEs)– Called “Wookies”

• Completed WQEs are placed in

Send Recv

WQEs CQEs

• Completed WQEs are placed in the CQ with additional information

I fi iB d D i– They are now called CQEs

(“Cookies”)

InfiniBand Device

Supercomputing '09 33

Page 34: 10 Gigabit Ethernet vs Infinibandbasic

WQEs and CQEs

• Send WQEs contain

Q Q

data about what buffer to send from, how much to send, etc.

• Receive WQEs contain data about what buffer to

i i t h hreceive into, how much to receive, etc.

• CQEs contain data about which QP the completed WQE was posted on how muchposted on, how much data actually arrived

Supercomputing '09 34

Page 35: 10 Gigabit Ethernet vs Infinibandbasic

Memory Registrationy g

Before we do any communication:All memory used for communication must be registered

1. Registration Request • Send virtual address and lengthProcess

All memory used for communication must be registered

g

2. Kernel handles virtual->physical mapping and pins region into physical memory1 4• Process cannot map memory that it does

not own (security !)

3 HCA caches the virtual to physical

Kernel2

3. HCA caches the virtual to physical mapping and issues a handle• Includes an l_key and r_key

3

HCA/RNIC

4. Handle is returned to application

Supercomputing '09 35

Page 36: 10 Gigabit Ethernet vs Infinibandbasic

Memory Protectiony

For security, keys are required for all operations that touch buffers

• To send or receive data the l_key must be provided to the HCA

Process

that touch buffers

be provided to the HCA

• HCA verifies access to local memory

• For RDMA the initiator must have theKernell_key

• For RDMA, the initiator must have the r_key for the remote virtual address

• Possibly exchanged with a send/recvHCA/RNIC Possibly exchanged with a send/recv

• r_key is not encrypted in IBHCA/RNIC

Supercomputing '09

r_key is needed for RDMA operations

36

Page 37: 10 Gigabit Ethernet vs Infinibandbasic

Communication in the Channel S ti (S d/R i M d l)

Memory Memory

Semantics (Send/Receive Model)Processor Processor

ReceiveBuffer

SendBuffer

CQQP

Send Recv CQQP

Send RecvProcessor is involved only to:

1. Post receive WQE2. Post send WQEQ

3. Pull out completed CQEs from the CQ

InfiniBand DeviceInfiniBand Device

S d WQE t i i f ti b t th Receive WQE contains information on the receive

Hardware ACK

Supercomputing '09 37

Send WQE contains information about the send buffer

Receive WQE contains information on the receive buffer; Incoming messages have to be matched to

a receive WQE to know where to place the data

Page 38: 10 Gigabit Ethernet vs Infinibandbasic

Communication in the Memory S ti (RDMA M d l)

Memory Memory

Semantics (RDMA Model)Processor Processor

ReceiveBuffer

SendBuffer

CQQP

Send Recv CQQP

Send RecvInitiator processor is involved only to:

1. Post send WQE2. Pull out completed CQE from the send CQ

No involvement from the target processor

InfiniBand DeviceInfiniBand Device

S d WQE t i i f ti b t th

Hardware ACK

Supercomputing '09 38

Send WQE contains information about the send buffer and the receive buffer

Page 39: 10 Gigabit Ethernet vs Infinibandbasic

IB Overview

• InfiniBand– Architecture and Basic Hardware Components

– Communication Model and SemanticsC i ti M d l• Communication Model

• Memory registration and protection

• Channel and memory semantics

– Novel Features• Hardware Protocol Offload

Link network and transport layer features– Link, network and transport layer features

– Management and Services• Subnet Management

• Hardware support for scalable network management

Supercomputing '09 39

Page 40: 10 Gigabit Ethernet vs Infinibandbasic

Hardware Protocol Offload

CompleteComplete Hardware

ImplementationsExist

Supercomputing '09 40

Page 41: 10 Gigabit Ethernet vs Infinibandbasic

Link Layer Capabilities

• CRC-based Data Integrity

y p

CRC based Data Integrity

• Buffering and Flow Control

• Virtual Lanes, Service Levels and QoS

• Switching and Multicast

• IB WAN CapabilityIB WAN Capability

Supercomputing '09 41

Page 42: 10 Gigabit Ethernet vs Infinibandbasic

CRC-based Data Integrity

• Two forms of CRC to achieve both early error detection

g y

and end-to-end reliability– Invariant CRC (ICRC) covers fields that do not change

per link (per network hop)per link (per network hop)• E.g., routing headers (if there are no routers), transport

headers, data payload• 32-bit CRC (compatible with Ethernet CRC)• End-to-end reliability (does not include I/O bus)

V i t CRC (VCRC) thi– Variant CRC (VCRC) covers everything• Erroneous packets do not have to reach the destination

before being discarded• Early error detection

Supercomputing '09 42

Page 43: 10 Gigabit Ethernet vs Infinibandbasic

Buffering and Flow Control

• IB provides an absolute credit-based flow-control

g

p

– Receiver guarantees that it has enough space allotted for

N blocks of data

– Occasional update of available credits by the receiver

• Has no relation to the number of messages but only to• Has no relation to the number of messages, but only to

the total amount of data being sent

O 1MB i i l 1024 1KB– One 1MB message is equivalent to 1024 1KB messages

(except for rounding off at message boundaries)

Supercomputing '09 43

Page 44: 10 Gigabit Ethernet vs Infinibandbasic

Link Layer Capabilities

• CRC-based Data Integrity

y p

CRC based Data Integrity

• Buffering and Flow Control

• Virtual Lanes, Service Levels and QoS

• Switching and Multicast

• IB WAN CapabilityIB WAN Capability

Supercomputing '09 44

Page 45: 10 Gigabit Ethernet vs Infinibandbasic

Virtual Lanes

• Multiple virtual links within same physical link– Between 2 and 16

S t b ff d fl• Separate buffers and flow control

Avoids Head of Line– Avoids Head-of-Line Blocking

• VL15: reserved for management

• Each port supports one or more data VL

Supercomputing '09 45

Page 46: 10 Gigabit Ethernet vs Infinibandbasic

Service Levels and QoS

• Service Level (SL):

Q

– Packets may operate at one of 16 different SLs– Meaning not defined by IB

S• SL to VL mapping:– SL determines which VL on the next link is to be used

E h t ( it h t d d ) h SL t VL– Each port (switches, routers, end nodes) has a SL to VL mapping table configured by the subnet management

• Partitions:Partitions:– Fabric administration (through Subnet Manager) may assign

specific SLs to different partitions to isolate traffic flows

Supercomputing '09 46

Page 47: 10 Gigabit Ethernet vs Infinibandbasic

Traffic Segregation Benefitsg g

• Segregation of Server, Network and Storage Traffic – On

• InfiniBand Virtual Lanes allow the multiplexing of multiple

IPC, Load Balancing, Web Caches, ASP

the same physical network

S SS the multiplexing of multiple independent logical traffic flows on the same physical link

Servers ServersServers

• Providing the benefits of independent, separate networks while eliminating the cost and

InfiniBand Network

Virtual Lanes

FabricInfiniBandFabric

while eliminating the cost and difficulties associated with maintaining two or more

t kStorage Area Network

IP Network

networks

Supercomputing '09Courtesy: Mellanox Technologies

Routers, Switches VPN’s, DSLAMs

RAID, NAS, Backup

47

Page 48: 10 Gigabit Ethernet vs Infinibandbasic

Switching (Layer-2 Routing) and M lti t

• Each port has one or more associated LIDs (Local

Multicast

Identifiers)– Switches look up which port to forward a packet to based

it d ti ti LID (DLID)on its destination LID (DLID)– This information is maintained at the switch

• For multicast packets the switch needs to maintain• For multicast packets, the switch needs to maintain multiple output ports to forward the packet to– Packet is replicated to each appropriate output portPacket is replicated to each appropriate output port– Ensures at-most once delivery & loop-free forwarding– There is an interface for a group management protocol

• Create, join/leave, prune, delete group

Supercomputing '09 48

Page 49: 10 Gigabit Ethernet vs Infinibandbasic

Destination-based Switching/Routingg g

Spine Blocks

Leaf Blocks

Spine Blocks

S it hi IB t R ti U ifi d b IB SPEC

An Example IB Switch Block Diagram (Mellanox 144-Port)

Switching: IB supports Virtual Cut Through (VCT)

Routing: Unspecified by IB SPECUp*/Down*, Shift are popular routing

engines supported by OFED

• Fat-Tree is a popular topology for IB Clusters

• Different over subscription ratio may be used• Different over-subscription ratio may be used

Supercomputing '09 49

Page 50: 10 Gigabit Ethernet vs Infinibandbasic

IB Switching/Routing: An Exampleg g p

Spine Blocks

Leaf Blocks

Spine Blocks

1 2 3 4

P1P2 DLID Out-Port

2 1Forwarding TableLID: 2

LID: 4

• Someone has to setup these tables and give every port an LID

2 1

4 4

LID: 4

• Someone has to setup these tables and give every port an LID– “Subnet Manager” does this work (more discussion on this later)

• Different routing algorithms may give different pathsDifferent routing algorithms may give different paths

Supercomputing '09 50

Page 51: 10 Gigabit Ethernet vs Infinibandbasic

IB Multicast Examplep

Supercomputing '09 51

Page 52: 10 Gigabit Ethernet vs Infinibandbasic

IB WAN Capability

• Getting increased attention for:

p y

– Remote Storage, Remote Visualization– Cluster Aggregation (Cluster-of-clusters)

• IB Optical switches by multiple vendors• IB-Optical switches by multiple vendors– Obsidian Research Corporation: www.obsidianresearch.com– Network Equipment Technology (NET): www.net.com– Layer-1 changes from copper to optical; everything else stays

the same• Low latency copper optical copper conversion• Low-latency copper-optical-copper conversion

• Large link-level buffers for flow-control– Data messages do not have to wait for round-trip hopsg p p– Important in the wide-area network

Supercomputing '09 52

Page 53: 10 Gigabit Ethernet vs Infinibandbasic

Hardware Protocol Offload

CompleteComplete Hardware

ImplementationsExist

Supercomputing '09 53

Page 54: 10 Gigabit Ethernet vs Infinibandbasic

IB Network Layer Capabilities

• Most capabilities are similar to that of the link layer, but

y p

p y

as applied to IB routers

– Routers can send packets across subnets (subnet areRouters can send packets across subnets (subnet are

management domains, not administrative domains)

– Subnet management packets are consumed by routersSubnet management packets are consumed by routers,

not forwarded to the next subnet

• Several additional features as well• Several additional features as well

– E.g., routing and flow labels

Supercomputing '09 54

Page 55: 10 Gigabit Ethernet vs Infinibandbasic

Routing and Flow Labels

• Routing follows the IPv6 packet format

g

– Easy interoperability with Wide-area translations

– Link layer might still need to be translated to the y gappropriate layer-2 protocol (e.g., Ethernet, SONET)

• Flow Labels allow routers to specify which packets p y pbelong to the same connection

– Switches can optimize communication by sending packets p y g pwith the same label in order

– Flow labels can change in the router, but packets belonging to one label will always do so

Supercomputing '09 55

Page 56: 10 Gigabit Ethernet vs Infinibandbasic

Hardware Protocol Offload

CompleteComplete Hardware

ImplementationsExist

Supercomputing '09 56

Page 57: 10 Gigabit Ethernet vs Infinibandbasic

IB Transport Servicesp

ConnectionService Type Connection Oriented Acknowledged Transport

Reliable Connection Yes Yes IBA

Unreliable Connection Yes No IBA

Reliable Datagram No Yes IBAReliable Datagram No Yes IBA

Unreliable Datagram No No IBA

RAW Datagram No No Raw

• Each transport service can have zero or more QPs associated with it

Supercomputing '09

• e.g., you can have 4 QPs based on RC and one based on UD

57

Page 58: 10 Gigabit Ethernet vs Infinibandbasic

Trade-offs in Different Transport Typesp yp

Supercomputing '09 58

Page 59: 10 Gigabit Ethernet vs Infinibandbasic

Shared Receive Queue (SRQ)Q ( Q)

Process Process

One RQ per connection One SRQ for all connections

m p

• SRQ is a hardware mechanism for a process to share

One RQ per connection One SRQ for all connectionsn -1

preceive resources (memory) across multiple connections

– Introduced in specification v1.2

• 0 < p << m*(n-1)Supercomputing '09 59

Page 60: 10 Gigabit Ethernet vs Infinibandbasic

eXtended Reliable Connection (XRC)( )

M = # of nodesN = # of processes/node

RC Connections XRC Connections

(M2 1)*N (M 1)*N

N # of processes/node

(M2-1) N (M-1) N

• Each QP takes at least one page of memoryg y– Connections between all processes is very costly for RC

• New IB Transport added: eXtended Reliable Connection– Allows connections between nodes instead of processes

Supercomputing '09 60

Page 61: 10 Gigabit Ethernet vs Infinibandbasic

IB Overview

• InfiniBand– Architecture and Basic Hardware Components

– Communication Model and SemanticsC i ti M d l• Communication Model

• Memory registration and protection

• Channel and memory semantics

– Novel Features• Hardware Protocol Offload

Link network and transport layer features– Link, network and transport layer features

– Management and Services• Subnet Management

• Hardware support for scalable network management

Supercomputing '09 61

Page 62: 10 Gigabit Ethernet vs Infinibandbasic

Concepts in IB Management

• Agents

p g

– Processes or hardware units running on each adapter, switch, router (everything on the network)Provide capability to query and set parameters– Provide capability to query and set parameters

• Managers– Make high-level decisions and implement it on the– Make high-level decisions and implement it on the

network fabric using the agents

• Messaging schemesg g– Used for interactions between the manager and agents

(or between agents)

• MessagesSupercomputing '09 62

Page 63: 10 Gigabit Ethernet vs Infinibandbasic

Subnet Managerg

Inactive Link

Active Links

Inactive Links

Compute Node

Switch Multicast Join

Multicast Setup

Node

Multicast Join

Multicast Setup

Subnet ManagerSubnet

ManagerSubnet

Manager

Supercomputing '09 63

Page 64: 10 Gigabit Ethernet vs Infinibandbasic

10GE Overview

• 10-Gigabit Ethernet Family

– Architecture and Components

• Stack Layouty

• Out-of-Order Data Placement

• Dynamic and Fine-grained Data Rate Controly g

– Existing Implementations of 10GE/iWARP

Supercomputing '09 64

Page 65: 10 Gigabit Ethernet vs Infinibandbasic

IB and 10GE:C liti d DiffCommonalities and Differences

IB iWARP/10GEIB iWARP/10GE

Hardware Acceleration SupportedSupported

(for TOE and iWARP)Supported

RDMA SupportedSupported

(for iWARP)

Atomic Operations Supported Not supported

Multicast Supported Supported

Data Placement OrderedOut-of-order(for iWARP)

Data Rate-control Static and Coarse-grainedDynamic and Fine-grained

(for TOE and iWARP)

QoS PrioritizationPrioritization and

Fixed Bandwidth QoS

Supercomputing '09

Fixed Bandwidth QoS

65

Page 66: 10 Gigabit Ethernet vs Infinibandbasic

iWARP Architecture and Components

• RDMA Protocol (RDMAP)

p

iWARP Offload Engines

– Feature-rich interface

– Security Management

Application or LibraryUser

RDMAP

• Remote Direct Data Placement (RDDP)

– Data Placement and Delivery

RDDP

SCTPMPA

– Multi Stream Semantics

– Connection Management

HardwareSCTP

IP

TCP

• Marker PDU Aligned (MPA)

– Middle Box Fragmentation

Device Driver

Network Adapter(e.g., 10GigE)

– Data Integrity (CRC)

Supercomputing '09

(Courtesy iWARP Specification)

66

Page 67: 10 Gigabit Ethernet vs Infinibandbasic

Decoupled Data Placement andD t D li

• Place data as it arrives, whether in or out-of-order

Data Delivery

• If data is out-of-order, place it at the appropriate offset

• Issues from the application’s perspective:• Issues from the application s perspective:

– Second half of the message has been placed does not

th t th fi t h lf f th h i d llmean that the first half of the message has arrived as well

– If one message has been placed, it does not mean that

that the previous messages have been placed

Supercomputing '09 67

Page 68: 10 Gigabit Ethernet vs Infinibandbasic

Protocol Stack Issues withO t f O d D t Pl t

• The receiver network stack has to understand each

Out-of-Order Data Placement

The receiver network stack has to understand each

frame of data

– If the frame is unchanged during transmission, this is easy!

• Issues to consider:• Issues to consider:

– Can we guarantee that the frame will be unchanged?

– What happens when intermediate switches segment data?

Supercomputing '09 68

Page 69: 10 Gigabit Ethernet vs Infinibandbasic

Switch Splicingp g

Switch

Splicing

Splicing

Intermediate Ethernet switches (e.g., those which support splicing) can segment a frame to multiple segments or coalesce multiple

segments to a single segment

Supercomputing '09 69

Page 70: 10 Gigabit Ethernet vs Infinibandbasic

Marker PDU Aligned (MPA) Protocol

• Deterministic Approach to find segment boundaries

g ( )

• Approach:– Places strips of data at regular intervals (based on data p g (

sequence number)

– Interval is set to be 512 bytes (small enough to ensure that each Ethernet frame has at least one)

• Minimum IP packet size is 536 bytes (RFC 879)

– Each strip points to the RDDP header

• Each segment independently has enough information about where it needs to be placed

Supercomputing '09 70

Page 71: 10 Gigabit Ethernet vs Infinibandbasic

MPA Frame Format

RDDPHeader Data Payload (if any)

Data Payload (if any)

Segment Length PadCRC

Supercomputing '09

C C

71

Page 72: 10 Gigabit Ethernet vs Infinibandbasic

Dynamic and Fine-grainedR t C t l

• Part of the Ethernet standard, not iWARP

Rate Control

– Network vendors use a separate interface to support it

• Dynamic bandwidth allocation to flows based on interval between two packets in a flow– E.g., one stall for every packet sent on a 10 Gbps network

refers to a bandwidth allocation of 5 Gbps

– Complicated because of TCP windowing behavior

• Important for high-latency/high-bandwidth networks– Large windows exposed on the receiver side

– Receiver overflow controlled through rate control

Supercomputing '09 72

Page 73: 10 Gigabit Ethernet vs Infinibandbasic

Prioritization vs. Fixed Bandwidth QoS

• Can allow for simple prioritization:

Q

– E.g., connection 1 performs better than connection 2

– 8 classes provided (a connection can be in any class)• Similar to SLs in InfiniBand

– Two priority classes for high-priority trafficE t t ffi f it li ti• E.g., management traffic or your favorite application

• Or can allow for specific bandwidth requests:E t f 3 62 Gb b d idth– E.g., can request for 3.62 Gbps bandwidth

– Packet pacing and stalls used to achieve this

• Query functionality to find out “remaining bandwidth”• Query functionality to find out remaining bandwidth

Supercomputing '09 73

Page 74: 10 Gigabit Ethernet vs Infinibandbasic

10GE Overview

• 10-Gigabit Ethernet Family

– Architecture and Components

• Stack Layouty

• Out-of-Order Data Placement

• Dynamic and Fine-grained Data Rate Controly g

– Existing Implementations of 10GE/iWARP

Supercomputing '09 74

Page 75: 10 Gigabit Ethernet vs Infinibandbasic

Current Usage of Ethernetg

Regular Ethernet

TOE

WideArea

NetworkiWARP

Regular Ethernet Cluster

System Area Network or Cluster Environment

iWARP Cluster

Supercomputing '09

Distributed Cluster Environment

75

Page 76: 10 Gigabit Ethernet vs Infinibandbasic

Software iWARP based Compatibility

• Regular Ethernet adapters and TOEs

p y

g p

are fully compatible

C tibilit ith iWARP i d• Compatibility with iWARP required

• Software iWARP emulates the TOE

functionality of iWARP on the host

– Fully compatible with hardware iWARPRegular Ethernet iWARP y p

– Internally utilizes host TCP/IP stackEthernet Environment

Supercomputing '09 76

Page 77: 10 Gigabit Ethernet vs Infinibandbasic

Different iWARP Implementationsp

Chelsio NetEffect

Application

High Performance Sockets

Application

High Performance Sockets

ApplicationApplication

OSU, OSC OSU, ANL Chelsio, NetEffect, Ammasso

Sockets

TCP

IP

SoftwareiWARP

Sockets

TCP

IP

Kernel-leveliWARP

TCP (Modified with MPA)

SocketsUser-level iWARP

Sockets

TCP IP

Device Driver

Offloaded TCP

Device Driver

Offloaded TCP

Offloaded iWARP

IP

Device Driver

IP

TCP

Device Driver

Regular Ethernet Adapters

Network Adapter

Offloaded IP

TCP Offload Engines

Network AdapterOffloaded IP

iWARP compliant Adapters

Network AdapterNetwork Adapter

Supercomputing '09

Adapters

77

Page 78: 10 Gigabit Ethernet vs Infinibandbasic

IB and 10GE Convergence

• InfiniBand/Ethernet Convergence Technologies

g

g g

– Virtual Protocol Interconnect

– InfiniBand over EthernetInfiniBand over Ethernet

– RDMA over Converged Enhanced Ethernet

Supercomputing '09 78

Page 79: 10 Gigabit Ethernet vs Infinibandbasic

Virtual Protocol Interconnect (VPI)

• Single network firmware to support b th IB d Eth t

( )

Applicationsboth IB and Ethernet

• Autosensing of layer-2 protocol– Can be configured to automatically

IB Verbs Sockets

g ywork with either IB or Ethernet networks

• Multi port adapters can use one portIB Network

IB TransportLayer TCP

• Multi-port adapters can use one port on IB and another on Ethernet

• Multiple use modes:IB Link Layer

HardwareTCP/IP support

Ethernet

IB NetworkLayer

IP

– Datacenters with IB inside the cluster and Ethernet outside

– Clusters with IB network and

IB Link Layer Ethernet Link Layer

Clusters with IB network and Ethernet management

Supercomputing '09

IB Port Ethernet Port

79

Page 80: 10 Gigabit Ethernet vs Infinibandbasic

(InfiniBand) RDMA over Ethernet (IB E)

• Native convergence of IB network and t t l ith Eth t li k l

(IBoE)

transport layers with Ethernet link layer

• IB packets encapsulated in Ethernet frames

• IB network layer already uses IPv6 framesIB Verbs

Application

IB network layer already uses IPv6 frames

• Pros:– Works natively in Ethernet environments

IB Transport

IB Verbs

Hardware– Has all the benefits of IB verbs

• Cons:Network bandwidth limited to EthernetEthernet

IB NetworkHardware

– Network bandwidth limited to Ethernet switches (currently 10Gbps), even though IB can provide 32Gbps

S IB ti li k l f t

Ethernet

– Some IB native link-layer features are optional in (regular) Ethernet

Supercomputing '09 80

Page 81: 10 Gigabit Ethernet vs Infinibandbasic

(InfiniBand) RDMA overC d E h d Eth t (R CEE)

• Very similar to IB over Ethernet

Converged Enhanced Ethernet (RoCEE)

– Often used interchangeably with IBoE

– Can be used to explicitly specify link layer is Converged Enhanced Ethernet (CEE)IB Verbs

Application

g ( )

• Pros:– Works natively in Ethernet environmentsIB Transport

IB Verbs

Hardware– Has all the benefits of IB verbs

– CEE is very similar to the link layer of native IB, so there are no missing features

CEE

IB NetworkHardware

• Cons:– Network bandwidth limited to Ethernet

switches (currently 10Gbps) even though

CEE

switches (currently 10Gbps), even though IB can provide 32Gbps

Supercomputing '09 81

Page 82: 10 Gigabit Ethernet vs Infinibandbasic

IB and 10GE: Feature Comparison

IB iWARP/10GE IBoE RoCEE

p

Hardware Acceleration Yes Yes Yes Yes

RDMA Yes Yes Yes Yes

Atomic Operations Yes No Yes Yes

Multicast Optional No Optional Optional

Data Placement Ordered Out-of-order Ordered Ordered

Prioritization Optional Optional Optional Yes

Fixed BW QoS (ETS) No Optional Optional Yes

Ethernet Compatibility No Yes Yes YesCompatibility

TCP/IP CompatibilityYes

(using IPoIB)Yes No No

Supercomputing '09 82

Page 83: 10 Gigabit Ethernet vs Infinibandbasic

Presentation Overview

• Introduction

• Why InfiniBand and 10-Gigabit Ethernet?

• Overview of IB, 10GE, their Convergence and Features

• IB and 10GE HW/SW Products and Installations• IB and 10GE HW/SW Products and Installations

• Sample Case Studies and Performance Numbers

• Conclusions and Final Q&A

Supercomputing '09 83

Page 84: 10 Gigabit Ethernet vs Infinibandbasic

IB Hardware Products

• Many IB vendors: Mellanox, Voltaire and QlogicAli d ith d I t l IBM SUN D ll– Aligned with many server vendors: Intel, IBM, SUN, Dell

– And many integrators: Appro, Advanced Clustering, Microway, …

• Broadly two kinds of adapters– Offloading (Mellanox) and Onloading (Qlogic)

• Adapters with different interfaces:– Dual port 4X with PCI-X (64 bit/133 MHz) PCIe x8 PCIe 2 0 and HTDual port 4X with PCI X (64 bit/133 MHz), PCIe x8, PCIe 2.0 and HT

• MemFree Adapter– No memory on HCA Uses System memory (through PCIe)– Good for LOM designs (Tyan S2935, Supermicro 6015T-INFB)

• Different speeds– SDR (8 Gbps), DDR (16 Gbps) and QDR (32 Gbps)

• Some 12X SDR adapters exist as well (24 Gbps each way)

Supercomputing '09 84

Page 85: 10 Gigabit Ethernet vs Infinibandbasic

Tyan Thunder S2935 Boardy

(Courtesy Tyan)

Si il b d f S i ith LOM f t l il bl

Supercomputing '09

Similar boards from Supermicro with LOM features are also available

85

Page 86: 10 Gigabit Ethernet vs Infinibandbasic

IB Hardware Products (contd.)

• Customized adapters to work with IB switches

C XD1 (f l b O ti b ) C CX1

( )

– Cray XD1 (formerly by Octigabay), Cray CX1

• Switches:

– 4X SDR and DDR (8-288 ports); 12X SDR (small sizes)4X SDR and DDR (8 288 ports); 12X SDR (small sizes)

– 3456-port “Magnum” switch from SUN used at TACC• 72-port “nano magnum”

– 36-port Mellanox InfiniScale IV QDR switch silicon in early 2008• Up to 648-port QDR switch by SUN

N IB it h ili f Ql i i t d d t SC ’08– New IB switch silicon from Qlogic introduced at SC ’08

• Up to 846-port QDR switch by Qlogic

• Switch Routers with GatewaysSwitch Routers with Gateways

– IB-to-FC; IB-to-IPSupercomputing '09 86

Page 87: 10 Gigabit Ethernet vs Infinibandbasic

IB Software Products

• Low-level software stacks

– VAPI (Verbs-Level API) from Mellanox

– Modified and customized VAPI from other vendors

– New initiative: Open Fabrics (formerly OpenIB)• http://www.openfabrics.org

• Open-source code available with Linux distributions

• Initially IB; later extended to incorporate iWARP

• High-level software stacks

– MPI, SDP, IPoIB, SRP, iSER, DAPL, NFS, PVFS on i k ( i il VAPI d O F b i )various stacks (primarily VAPI and OpenFabrics)

Supercomputing '09 87

Page 88: 10 Gigabit Ethernet vs Infinibandbasic

10G, 40G and 100G Ethernet Products

• 10GE adaptersI t l M i M ll (C tX)

,

– Intel, Myricom, Mellanox (ConnectX)• 10GE/iWARP adapters

– Chelsio, NetEffect (now owned by Intel), ( y )• 10GE switches

– Fulcrum Microsystems L l i h b d 24 ili• Low latency switch based on 24-port silicon

• FM4000 switch with IP routing, and TCP/UDP support

– Fujitsu, Woven Systems (144 ports), Myricom (512 ports), Quadrics (96 ports), Force10, Cisco, Arista (formerly Arastra)

• 40GE and 100GE switchesNortel Networks– Nortel Networks

• 10GE downlinks with 40GE and 100GE uplinksSupercomputing '09 88

Page 89: 10 Gigabit Ethernet vs Infinibandbasic

Mellanox ConnectX Architecture

• Early adapter supporting IB/10GE convergence– Support for VPI and IBoE

f• Includes other features as well– Hardware support for

VirtualizationVirtualization

– Quality of Service

– Stateless Offloads

Supercomputing '09

(Courtesy Mellanox)

89

Page 90: 10 Gigabit Ethernet vs Infinibandbasic

OpenFabrics

• Open source organization (formerly OpenIB)

p

– www.openfabrics.org

• Incorporates both IB and iWARP in a unified manner– Support for Linux and Windows

– Design of complete stack with `best of breed’ components• Gen1

• Gen2 (current focus)

• Users can download the entire stack and run• Users can download the entire stack and run– Latest release is OFED 1.4.2

– OFED 1.5 is underwayy

Supercomputing '09 90

Page 91: 10 Gigabit Ethernet vs Infinibandbasic

OpenFabrics Stack with UnifiedV b I t fVerbs Interface

Verbs InterfaceVerbs Interface(libibverbs)

Mellanox(libmthca)Mellanox(libmthca)

QLogic(libipathverbs)

QLogic(libipathverbs)

IBM (libehca)

IBM (libehca)

Chelsio(libcxgb3)Chelsio

(libcxgb3)Mellanox(libmthca)

QLogic(libipathverbs)

IBM (libehca)

Chelsio(libcxgb3)

User Level

Mellanox (ib_mthca)Mellanox (ib_mthca)

QLogic(ib_ipath)QLogic

(ib_ipath)IBM

(ib_ehca)IBM

(ib_ehca)Chelsio

(ib_cxgb3)Chelsio

(ib_cxgb3)Mellanox (ib_mthca)

QLogic(ib_ipath)

IBM(ib_ehca)

Chelsio(ib_cxgb3)

Kernel Level

MellanoxAdapters

QLogicAdapters

ChelsioAdapters

IBMAdapters

Supercomputing '09

Adapters Adapters AdaptersAdapters

91

Page 92: 10 Gigabit Ethernet vs Infinibandbasic

OpenFabrics on Convergent IB/10GE

• For IBoE and RoCEE, the upper-

p g

Verbs Interfacelevel stacks remain completely unchanged

• Within the hardware:

(libibverbs)

ConnectX Within the hardware:– Transport and network layers

remain completely unchanged

B th IB d Eth t ( CEE) li k

ConnectX(libmlx4)

ConnectX

User Level

Kernel Level – Both IB and Ethernet (or CEE) link layers are supported on the network adapter

(ib_mlx4)Kernel Level

ConnectXAd t • Note: The OpenFabrics stack is not

valid for the Ethernet path in VPI– That still uses sockets and TCP/IP

Adapters

That still uses sockets and TCP/IP

Supercomputing '09

10GE IB

92

Page 93: 10 Gigabit Ethernet vs Infinibandbasic

OpenFabrics Software StackpSA Subnet Administrator

MAD Management DatagramOpenDiag

Application Level Application Level

ClusteredDB Access

SocketsBasedA

VariousMPIs

Access toFile

S t

BlockStorageA

IP BasedApp

A

SMA Subnet Manager Agent

PMA Performance Manager Agent

IPoIB IP over InfiniBandInfiniBand OpenFabrics User Level Verbs / API iWARP R-NIC

SDP Lib

User Level MAD API

Open SM

DiagTools

User APIsUser APIs

U S U S

DB AccessAccess MPIs SystemsAccessAccess

UDAPL

SDP Sockets Direct Protocol

SRP SCSI RDMA Protocol (Initiator)

iSER iSCSI RDMA Protocol (Initiator)

RDS Reliable Datagram Service

SDPIPoIB SRP iSER RDS

SDP Lib

Upper Layer Protocol

Upper Layer Protocol

Kernel SpaceKernel Space

User Space User Space

NFS-RDMARPC

ClusterFile Sys

g

UDAPL User Direct Access Programming Lib

HCA Host Channel Adapter

R-NIC RDMA NIC

ConnectionManager

MADSA Client

ConnectionManager

Connection ManagerAbstraction (CMA)

Mid-LayerMid-LayerSMA

el b

ypas

sel

byp

ass

el b

ypas

sel

byp

ass

CommonKeyKeyHardware

Specific DriverHardware Specific

Driver

InfiniBand OpenFabrics Kernel Level Verbs / API iWARP R-NIC

ProviderProviderApps & Access

Ker

neK

erne

Ker

neK

erne

Supercomputing '09

InfiniBand

iWARPInfiniBand HCA iWARP R-NICHardwareHardware

AccessMethodsfor usingOF Stack

93

Page 94: 10 Gigabit Ethernet vs Infinibandbasic

InfiniBand in the Top500pSystems Performance

P t h f I fi iB d i t dil i i

Supercomputing '09

Percentage share of InfiniBand is steadily increasing

94

Page 95: 10 Gigabit Ethernet vs Infinibandbasic

Large-scale InfiniBand Installationsg

• 151 IB clusters (30.2%) in the June ’09 TOP500 list (www.top500.org)

• Installations in the Top 30 (15 of them):

129,600 cores (RoadRunner) at LANL (1st) 12,288 cores at GENCI-CINES, France (20th)

51,200 cores (Pleiades) at NASA Ames (4th) 8,320 cores in UK (25th)

62,976 cores (Ranger) at TACC (8th) 8,320 cores in UK (26th)

th th26,304 cores (Juropa) at TACC (10th) 8,064 cores (DKRZ) in Germany (27th)

30,720 cores (Dawning) at Shanghai (15th) 12,032 cores at JAXA, Japan (28th)

14 336 cores at New Mexico (17th) 10 240 cores at TEP France (29th)14,336 cores at New Mexico (17 ) 10,240 cores at TEP, France (29 )

14,384 cores at Tata CRL, India (18th) 13,728 cores in Sweden (30th)

18,224 cores at LLNL (19th) More are getting installed !

Supercomputing '09 95

Page 96: 10 Gigabit Ethernet vs Infinibandbasic

10GE Installations

• Several Enterprise Computing DomainsE t i D t t (HP I t l) d Fi i l M k t– Enterprise Datacenters (HP, Intel) and Financial Markets

– Animation firms (e.g., Universal Studios created “The Hulk” and many new movies using 10GE)

• Scientific Computing Installations– 5,600-core installation in Purdue with Chelsio/iWARP

640 i t ll ti i U i it f H id lb G– 640-core installation in University of Heidelberg, Germany– 512-core installation at Sandia National Laboratory (SNL) with

Chelsio/iWARP and Woven Systems switch– 256-core installation at Argonne National Lab with Myri-10G

• Integrated SystemsBG/P uses 10GE for I/O (ranks 3 7 9 14 24 in the Top 25)– BG/P uses 10GE for I/O (ranks 3, 7, 9, 14, 24 in the Top 25)

• ESnet to install 62M $ 100GE infrastructure for US DOESupercomputing '09 96

Page 97: 10 Gigabit Ethernet vs Infinibandbasic

Dual IB/10GE Systemsy

• Such systems are being integrated

• E.g., the T2K-Tsukuba systemTsukuba system (300 TFlop System)

• Systems at three sites (Tsukuba, Tokyo, Kyoto)

• Internal connectivity: Quad-rail IB ConnectX network

(Courtesy Taisuke Boku, University of Tsukuba)

• External connectivity: 10GE

Supercomputing '09 97

Page 98: 10 Gigabit Ethernet vs Infinibandbasic

Presentation Overview

• Introduction

• Why InfiniBand and 10-Gigabit Ethernet?

• Overview of IB, 10GE, their Convergence and Features

• IB and 10GE HW/SW Products and Installations• IB and 10GE HW/SW Products and Installations

• Sample Case Studies and Performance Numbers

• Conclusions and Final Q&A

Supercomputing '09 98

Page 99: 10 Gigabit Ethernet vs Infinibandbasic

Case Studies

• Low level Performance• Low-level Performance

• Message Passing Interface (MPI)g g ( )

• File Systems

Supercomputing '09 99

Page 100: 10 Gigabit Ethernet vs Infinibandbasic

Low-level Latency Measurementsy

40N ti IB

Small Messages12000

Large Messages

30

35Native IB

VPI-IB

VPI-Eth

IBoE 8000

10000

15

20

25

Late

ncy

(us)

6000

8000

Late

ncy

(us)

5

10

15

2000

4000

0

5

2 4 8 16 32 64 128

256

512 1K 2K 4K

Message Size (bytes)

0

Message Size (bytes)

Supercomputing '09

Message Size (bytes) Message Size (bytes)

ConnectX-DDR: 2.4 GHz Quad-core (Nehalem) Intel with IB and 10GE switches

100

Page 101: 10 Gigabit Ethernet vs Infinibandbasic

Low-level Uni- and Bi-directional B d idth M tBandwidth Measurements

1600Uni-directional Bandwidth

1200

1400

1600

s)

600

800

1000

ndw

idth

(MB

p

200

400

600Native IB VPI-IB

VPI-Eth IBoE

Ban

0

Message Size (bytes)

Supercomputing '09

Message Size (bytes)

ConnectX-DDR: 2.4 GHz Quad-core (Nehalem) Intel with IB and 10GE switches101

Page 102: 10 Gigabit Ethernet vs Infinibandbasic

Case Studies

• Low level Performance• Low-level Performance

• Message Passing Interface (MPI)g g ( )

• File Systems

Supercomputing '09 102

Page 103: 10 Gigabit Ethernet vs Infinibandbasic

Message Passing Interface (MPI)

• De-facto message passing standard

g g ( )

– Point-to-point communication

– Collective communication (broadcast, multicast, reduction, barrier)

– MPI-1 and MPI-2 available; MPI-3 under discussion

• Has been implemented for various past commodity networks (Myrinet, Quadrics)

• How can it be designed and efficiently implemented for InfiniBand and iWARP?

Supercomputing '09 103

Page 104: 10 Gigabit Ethernet vs Infinibandbasic

MVAPICH/MVAPICH2 Software

• High Performance MPI Library for IB and 10GE– MVAPICH (MPI-1) and MVAPICH2 (MPI-2)

– Used by more than 975 organizations in 51 countries

– More than 34,000 downloads from OSU site directly

– Empowering many TOP500 clusters• 8th ranked 62,976-core cluster (Ranger) at TACC

– Available with software stacks of many IB, 10GE and server vendors including Open Fabrics Enterprise Distribution (OFED)vendors including Open Fabrics Enterprise Distribution (OFED)

– Also supports uDAPL device to work with any network supporting uDAPL

– http://mvapich.cse.ohio-state.edu/Supercomputing '09 104

Page 105: 10 Gigabit Ethernet vs Infinibandbasic

MPICH2 Software Stack

• High-performance and Widely Portable MPIS t MPI 1 MPI 2 d MPI 2 1– Supports MPI-1, MPI-2 and MPI-2.1

– Supports multiple networks (TCP, IB, iWARP, Myrinet)– Commercial support by many vendors

• IBM (integrated stack distributed by Argonne)• Microsoft, Intel (in process of integrating their stack)

– Used by many derivative implementationsUsed by many derivative implementations• E.g., MVAPICH2, IBM, Intel, Microsoft, SiCortex, Cray, Myricom• MPICH2 and its derivatives support many Top500 systems

(estimated at more than 90%)(estimated at more than 90%)– Available with many software distributions– Integrated with the ROMIO MPI-IO implementation and the MPE

profiling libraryprofiling library– http://www.mcs.anl.gov/research/projects/mpich2

Supercomputing '09 105

Page 106: 10 Gigabit Ethernet vs Infinibandbasic

One-way Latency: MPI over IBy y

7Small Message Latency

400MVAPICH-InfiniHost III-DDR

Large Message Latency

5

6

250

300

350MVAPICH-InfiniHost III-DDR

MVAPICH-Qlogic-SDR

MVAPICH-ConnectX-DDR

MVAPICH-ConnectX-QDR-PCIe2

3

4

Late

ncy

(us)

2.77150

200

250 MVAPICH-Qlogic-DDR-PCIe2

Late

ncy

(us)

1

2

1.061.281.492.19

50

100

0

Message Size (bytes)

0

Message Size (bytes)

Supercomputing '09

InfiniHost III and ConnectX-DDR: 2.33 GHz Quad-core (Clovertown) Intel with IB switch

ConnectX-QDR-PCIe2: 2.83 GHz Quad-core (Harpertown) Intel with back-to-back

106

Page 107: 10 Gigabit Ethernet vs Infinibandbasic

Bandwidth: MPI over IB

3500MVAPICH-InfiniHost III-DDR

Unidirectional Bandwidth6000

Bidirectional Bandwidth5553.5

2500

3000 MVAPICH-Qlogic-SDRMVAPICH-ConnectX-DDRMVAPICH-ConnectX-QDR-PCIe2MVAPICH-Qlogic-DDR-PCIe2

sec

3022.1

1952.94000

5000

sec

5553.5

3621.4

1500

2000

Mill

ionB

ytes

/s

1399.8

1389.42000

3000

Mill

ionB

ytes

/s

2457.4

2718.3

0

500

1000 936.5

0

1000

2000

1519.8

0

Message Size (bytes)

0

Message Size (bytes)

Supercomputing '09

InfiniHost III and ConnectX-DDR: 2.33 GHz Quad-core (Clovertown) Intel with IB switch

ConnectX-QDR-PCIe2: 2.4 GHz Quad-core (Nehalem) Intel with IB switch

107

Page 108: 10 Gigabit Ethernet vs Infinibandbasic

One-way Latency: MPI over iWARP

35One-way Latency

y y

25

30 Chelsio (TCP/IP)

Chelsio (iWARP)

15

20

Late

ncy

(us)

15.47

5

10

L

6.88

00 1 2 4 8 16 32 64 128 256 512 1K 2K 4K

Message Size (bytes)

Supercomputing '09

2.0 GHz Quad-core Intel with 10GE (Fulcrum) Switch

108

Page 109: 10 Gigabit Ethernet vs Infinibandbasic

Bandwidth: MPI over iWARP

1400Ch l i (TCP/IP)

Unidirectional Bandwidth2500

Bidirectional Bandwidth

2260 8

1000

1200Chelsio (TCP/IP)

Chelsio (iWARP)

ec 839 8

2000

c

2260.81231.8

600

800

Mill

ionB

ytes

/se 839.8

1000

1500

illio

nByt

es/s

ec

855.3

200

400

M

500M

i

0

Message Size (bytes)

0

Message Size (bytes)Message Size (bytes) Message Size (bytes)

Supercomputing '09

2.0 GHz Quad-core Intel with 10GE (Fulcrum) Switch109

Page 110: 10 Gigabit Ethernet vs Infinibandbasic

Convergent Technologies:MPI L tMPI Latency

25Small Messages

2500Large Messages

20

Native IB

VPI-IB

VPI-Eth

IB E

2000

15IBoE

Late

ncy

(us)

1000

1500

Late

ncy

(us)

5

10L

500

1000L

0

0 1 2 4 8 16 32 64 128

256

512

1K

0

Supercomputing '09

2 5

Message Size (bytes) Message Size (bytes)

ConnectX-DDR: 2.4 GHz Quad-core (Nehalem) Intel with IB and 10GE switches110

Page 111: 10 Gigabit Ethernet vs Infinibandbasic

Convergent Technologies:MPI U i d Bi di ti l B d idthMPI Uni- and Bi-directional Bandwidth

1600Uni-directional Bandwidth

3500Bi-directional Bandwidth

1200

1400Native IB

VPI-IB

VPI-Eth

IBoEs) 2500

3000

s)

800

1000IBoE

ndw

idth

(MB

ps

1500

2000

ndw

idth

(MB

ps200

400

600Ban

500

1000

1500

Ban

0

200

0 2 8 32 128

512 2K 8K 32K

28K

12K

2M

0

500

0 2 8 32 128

512

2K 8K 32K

28K

12K

2M

Supercomputing '09

3 12 5Message Size (bytes)

3 12 5Message Size (bytes)

ConnectX-DDR: 2.4 GHz Quad-core (Nehalem) Intel with IB and 10GE switches111

Page 112: 10 Gigabit Ethernet vs Infinibandbasic

Case Studies

• Low level Performance• Low-level Performance

• Message Passing Interface (MPI)g g ( )

• File Systems

Supercomputing '09 112

Page 113: 10 Gigabit Ethernet vs Infinibandbasic

Sample Diagram of State-of-the-Art File S t

• Sample file systems:

Systems

– Lustre, Panasas, GPFS, Sistina/Redhat GFS

– PVFS, Google File systems, Oracle Cluster File system (OCFS2)

Metadata Server

Metadata Server Server

Computing node I/O server

Network

Computing node I/O server

Supercomputing '09

Computing node I/O server

113

Page 114: 10 Gigabit Ethernet vs Infinibandbasic

Lustre Performance

1200Write Performance (4 OSSs)

3500Read Performance (4 OSSs)

800

1000IPoIB

Native

ut (M

Bps

)

2000

2500

3000

ut (M

Bps

)

200

400

600

Thro

ughp

u

500

1000

1500

Thro

ughp

u

01 2 3 4

Number of Clients

0

500

1 2 3 4Number of Clients

• Lustre over Native IB– Write: 1.38X faster than IPoIB; Read: 2.16X faster than IPoIB

• Memory copies in IPoIB and Native IB

Supercomputing '09

• Memory copies in IPoIB and Native IB– Reduced throughput and high overhead; I/O servers are saturated

114

Page 115: 10 Gigabit Ethernet vs Infinibandbasic

CPU Utilization

90

100IPoIB (Read) IPoIB (Write)

60

70

80

90 Native (Read) Native (Write)

(%)

30

40

50

60

PU U

tiliz

atio

n

0

10

20

30C

01 2 3 4

Number of Clients

• 4 OSS nodes, IOzone record size 1MB

Supercomputing '09

4 OSS nodes, IOzone record size 1MB

• Offers potential for greater scalability115

Page 116: 10 Gigabit Ethernet vs Infinibandbasic

NFS/RDMA Performance

1000Write (tmpfs)

1000Read (tmpfs)

600

700

800

900

t (M

B/s

)

600

700

800

900

t (M

B/s

)

200

300

400

500

Read-ReadThro

ughp

u

200

300

400

500

Thro

ughp

ut

0

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Read-Write

Number of Threads

0

100

200

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Number of Threads

• IOzone Read Bandwidth up to 913 MB/s (Sun x2200’s with x8 PCIe)• Read-Write design by OSU, available with the latest OpenSolaris• NFS/RDMA is being added into OFED 1 4

Supercomputing '09

NFS/RDMA is being added into OFED 1.4

R. Noronha, L. Chai, T. Talpey and D. K. Panda, “Designing NFS With RDMA For Security, Performance and Scalability”, ICPP ‘07

116

Page 117: 10 Gigabit Ethernet vs Infinibandbasic

Summary of Design Performance R lt

• Current generation IB adapters, 10GE/iWARP

Results

g p

adapters and software environments are already

delivering competitive performanceg p p

• IB and 10GE/iWARP hardware, firmware, and

software are going through rapid changessoftware are going through rapid changes

• Convergence between IB and 10GigE is emerging

• Significant performance improvement is expected in

near future

Supercomputing '09 117

Page 118: 10 Gigabit Ethernet vs Infinibandbasic

Presentation Overview

• Introduction

• Why InfiniBand and 10-Gigabit Ethernet?

• Overview of IB, 10GE, their Convergence and Features

• IB and 10GE HW/SW Products and Installations• IB and 10GE HW/SW Products and Installations

• Sample Case Studies and Performance Numbers

• Conclusions and Final Q&A

Supercomputing '09 118

Page 119: 10 Gigabit Ethernet vs Infinibandbasic

Concluding Remarks

• Presented network architectures & trends in Clusters

g

• Presented background and details of IB and 10GE

– Highlighted the main features of IB and 10GE and their convergence

– Gave an overview of IB and 10GE hw/sw products

– Discussed sample performance numbers in designing various high-end systems with IB and 10GE

• IB and 10GE are emerging as new architectures leading to a new generation of networked computing systems, opening many research issues needing novel solutions

Supercomputing '09 119

Page 120: 10 Gigabit Ethernet vs Infinibandbasic

Funding Acknowledgmentsg g

Our research is supported by the following organizations

• Funding support by

• Equipment support by

Supercomputing '09 120

Page 121: 10 Gigabit Ethernet vs Infinibandbasic

Personnel Acknowledgmentsg

Current Students – M Kalaiya (M S )

Past Students – P Balaji (Ph D ) – A Mamidala (Ph D )M. Kalaiya (M. S.)

– K. Kandalla (M.S.)– P. Lai (Ph.D.)– M. Luo (Ph.D.)

P. Balaji (Ph.D.)

– D. Buntinas (Ph.D.)

– S. Bhagvat (M.S.)

– L. Chai (Ph.D.)

A. Mamidala (Ph.D.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)– G. Santhanaraman (Ph.D.)

– G. Marsh (Ph. D.)– X. Ouyang (Ph.D.)– S. Potluri (M. S.)

H S b i (Ph D )

– B. Chandrasekharan (M.S.)– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– K. Vaidyanathan (Ph.D.)– H. Subramoni (Ph.D.)

Current Post-Doc– E Mancini

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– S. Kini (M.S.)

M Koop (Ph D )

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)E. Mancini

Current Programmer– J. Perkins

– M. Koop (Ph.D.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– P. Lai (Ph. D.)

Supercomputing '09

( )

– J. Liu (Ph.D.)

121

Page 122: 10 Gigabit Ethernet vs Infinibandbasic

Web Pointers

http://www.cse.ohio-state.edu/~pandahttp://www.mcs.anl.gov/~balaji

http://www.cse.ohio-state.edu/~koophttp://nowlab.cse.ohio-state.edu

MVAPICH Web Pagehttp://mvapich.cse.ohio-state.edu

[email protected]@mcs.anl.gov

Supercomputing '09

[email protected]

122