Transcript
Page 1: To Infiniband and Beyond

1

To Infiniband and Beyond: High Speed Interconnects in Commodity

HPC Clusters

Teresa Kaltz, PhD Research Computing

December 3, 2009

Page 2: To Infiniband and Beyond

Interconnect Types on Top 500

2

On the latest TOP500 list, there is exactly one 10 GigE deployment, compared to 181 InfiniBand-connected systems.

Michael Feldman, HPCwire Editor

Page 3: To Infiniband and Beyond

Top 500 Interconnects 2002-2009

3

0

50

100

150

200

250

300

350

400

450

500

2002 2003 2004 2005 2006 2007 2008 2009

Other

Infiniband

Ethernet

Page 4: To Infiniband and Beyond

What is Infiniband Anyway?

•  Open, standard interconnect architecture

–  http://www.infinibandta.org/index.php –  Complete specification available for download

•  Complete "ecosystem" –  Both hardware and software

•  High bandwidth, low latency, switch-based •  Allows remote direct memory access (RDMA)

4

Page 5: To Infiniband and Beyond

Why Remote DMA?

•  TCP offload engines reduce overhead via offloading protocol processing like checksum

•  2 copies on receive: NIC kernel user •  Solution is Remote DMA (RDMA)

5

Per Byte Percent Overhead User-system copy 16.5 % TCP Checksum 15.2 % Network-memory copy 31.8 % Per Packet Driver 8.2 % TCP+IP+ARP protocols 8.2 % OS overhead 19.8 %

Page 6: To Infiniband and Beyond

What is RDMA?

6

Page 7: To Infiniband and Beyond

Infiniband Signalling Rate

•  Each link is a point to point serial connection •  Usually aggregated into groups of four •  Unidirectional effective bandwidth

–  SDR 4X: 1 GB/s –  DDR 4X: 2 GB/s –  QDR 4X: 4 GB/s

•  Bidirectional bandwidth twice unidirectional •  Many factors impact measured performance!

7

Page 8: To Infiniband and Beyond

Infiniband Roadmap from IBTA

8

Page 9: To Infiniband and Beyond

DDR 4X Unidirectional Bandwidth

9

•  Achieved bandwidth limited by PCIe 8x Gen 1

•  Current platforms mostly ship with PCIe Gen 2

Page 10: To Infiniband and Beyond

QDR 4X Unidirectional Bandwidth

10 http://mvapich.cse.ohio-state.edu/performance/interNode.shtml

•  Still seem to have bottleneck at host if using QDR

Page 11: To Infiniband and Beyond

Latency Measurements: IB vs GbE

11

Page 12: To Infiniband and Beyond

Infiniband Latency Measurements

12

Page 13: To Infiniband and Beyond

Infiniband Silicon Vendors

•  Both switch and HCA parts –  Mellanox: Infiniscale, Infinihost –  Qlogic: Truescale, Infinipath

•  Many OEM's use their silicon •  Large switches

–  Parts arranged in fat tree topology

13

Page 14: To Infiniband and Beyond

Infiniband Switch Hardware

14

144 Ports

288 Ports

96 Ports

48 Ports 24 Ports

  24 port silicon product line at right

  Scales to thousands of ports

  Host-based and hardware- based subnet management

  Current generation (QDR) based on 36 port silicon

  Up to 864 ports in single switch!!

Page 15: To Infiniband and Beyond

Infiniband Topology

•  Infiniband uses credit-based flow control –  Need to avoid loops in topology that may produce

deadlock

•  Common topology for small and medium size networks is tree (CLOS)

•  Mesh/torus more cost effective for large clusters (>2500 hosts)

15

Page 16: To Infiniband and Beyond

Infiniband Routing

•  Infiniband is statically routed •  Subnet management software discovers fabric

and generates set of routing tables –  Most subnet managers support multiple routing

algorithms •  Tables updated with changes in topology only •  Often cannot achieve theoretical bisection

bandwidth with static routing •  QDR silicon introduces adaptive routing

16

Page 17: To Infiniband and Beyond

HPCC Random Ring Benchmark

17

0

200

400

600

800

1000

1200

1400

1600

Avg

Ban

dwid

th (M

B/s

)

Number of Enclosures

"Routing 1"

"Routing 2"

"Routing 3"

"Routing 4"

Page 18: To Infiniband and Beyond

Infiniband Specification for Software

•  IB specification does not define API •  Actions are known as "verbs"

–  Services provided to upper layer protocols –  Send verb, receive verb, etc

•  Community has standardized around open source distribution called OFED to provide verbs

•  Some Infiniband software is also available from vendors –  Subnet management

18

Page 19: To Infiniband and Beyond

Application Support of Infiniband

•  All MPI implementations support native IB –  OpenMPI, MVAPICH, Intel MPI

•  Existing socket applications –  IP over IB –  Sockets direct protocol (SDP)

•  Does NOT require re-link of application

•  Oracle uses RDS (reliable datagram sockets) –  First available in Oracle 10g R2

•  Developer can program to "verbs" layer

19

Page 20: To Infiniband and Beyond

Infiniband Software Layers

20

Page 21: To Infiniband and Beyond

OFED Software

•  Openfabrics Enterprise Distribution software from Openfabrics Alliance –  http://www.openfabrics.org/

•  Contains everything needed to run Infiniband –  HCA drivers –  verbs implementation –  subnet management –  diagnostic tools

•  Versions qualified together

21

Page 22: To Infiniband and Beyond

Openfabrics Software Components

22

Page 23: To Infiniband and Beyond

"High Performance" Ethernet

•  1 GbE cheap and ubiquitous –  hardware acceleration –  multiple multiport NIC's –  supported in kernel

•  10 GbE still used primarily as uplinks from edge switches and as backbone

•  Some vendors providing 10 GbE to server –  low cost NIC on motherboard –  HCA's with performance proportional to cost

23

Page 24: To Infiniband and Beyond

RDMA over Ethernet

•  NIC capable of RDMA is called RNIC •  RDMA is primary method of reducing latency on

host side •  Multiple vendors have RNIC's

–  Mainstream: Broadcom, Intel, etc. –  Boutique: Chelsio, Mellanox, etc.

•  New Ethernet standards –  "Data Center Bridging"; "Converged Enhanced

Ethernet"; "Data Center Ethernet"; etc

24

Page 25: To Infiniband and Beyond

What is iWarp?

•  RDMA consortium (RDMAC) standardized some protocols with are now part of the IETF Remote Data Direct Placement (RDDP) working group

•  http://www.rdmaconsortium.org/home •  Also defined SRP, iSER in addition to verbs •  iWARP supported in OFED •  Most specification work complete in ~2003

25

Page 26: To Infiniband and Beyond

RDMA over Ethernet?

26

The name ‘RoCEE’ (RDMA over Converged Enhanced Ethernet), is a working name.

You might hear me say RoXE, RoE, RDMAoE, IBXoE, IBXE or any other of a host of equally obscure names.

Tom Talpey, Microsoft Corporation Paul Grun, System Fabric Works August 2009

Page 27: To Infiniband and Beyond

The Future: InfiniFibreNet

•  Vendors moving towards "converged fabrics" •  Using same "fabric" for both networking and

storage •  Storage protocols and IB over Ethernet •  Storage protocols over Infiniband

–  NFS over RDMA, lustre

•  Gateway switches and converged adapters –  Various combinations of Ethernet, IB and FC

27

Page 28: To Infiniband and Beyond

Any Questions?

28

THANK YOU!

(And no mention of The Cloud)


Top Related