to infiniband and beyond

28
1 To Infiniband and Beyond: High Speed Interconnects in Commodity HPC Clusters Teresa Kaltz, PhD Research Computing December 3, 2009

Upload: ian-stokes-rees

Post on 06-May-2015

2.744 views

Category:

Technology


1 download

DESCRIPTION

Harvard HPC Seminar SeriesTheresa Kaltz, PhD, High Performance Technical Computing, FAS, HarvardDue to the wide availability and low cost of high speed networking, commodity clusters have become the de facto standard for building high performance parallel computing systems. This talk will introduce the leading technology for high speed interconnects called Infiniband and compare its deployment and performance to Ethernet. In addition, some emerging interconnect technologies and trends in cluster networking will be discussed.

TRANSCRIPT

Page 1: To Infiniband and Beyond

1

To Infiniband and Beyond: High Speed Interconnects in Commodity

HPC Clusters

Teresa Kaltz, PhD Research Computing

December 3, 2009

Page 2: To Infiniband and Beyond

Interconnect Types on Top 500

2

On the latest TOP500 list, there is exactly one 10 GigE deployment, compared to 181 InfiniBand-connected systems.

Michael Feldman, HPCwire Editor

Page 3: To Infiniband and Beyond

Top 500 Interconnects 2002-2009

3

0

50

100

150

200

250

300

350

400

450

500

2002 2003 2004 2005 2006 2007 2008 2009

Other

Infiniband

Ethernet

Page 4: To Infiniband and Beyond

What is Infiniband Anyway?

•  Open, standard interconnect architecture

–  http://www.infinibandta.org/index.php –  Complete specification available for download

•  Complete "ecosystem" –  Both hardware and software

•  High bandwidth, low latency, switch-based •  Allows remote direct memory access (RDMA)

4

Page 5: To Infiniband and Beyond

Why Remote DMA?

•  TCP offload engines reduce overhead via offloading protocol processing like checksum

•  2 copies on receive: NIC kernel user •  Solution is Remote DMA (RDMA)

5

Per Byte Percent Overhead User-system copy 16.5 % TCP Checksum 15.2 % Network-memory copy 31.8 % Per Packet Driver 8.2 % TCP+IP+ARP protocols 8.2 % OS overhead 19.8 %

Page 6: To Infiniband and Beyond

What is RDMA?

6

Page 7: To Infiniband and Beyond

Infiniband Signalling Rate

•  Each link is a point to point serial connection •  Usually aggregated into groups of four •  Unidirectional effective bandwidth

–  SDR 4X: 1 GB/s –  DDR 4X: 2 GB/s –  QDR 4X: 4 GB/s

•  Bidirectional bandwidth twice unidirectional •  Many factors impact measured performance!

7

Page 8: To Infiniband and Beyond

Infiniband Roadmap from IBTA

8

Page 9: To Infiniband and Beyond

DDR 4X Unidirectional Bandwidth

9

•  Achieved bandwidth limited by PCIe 8x Gen 1

•  Current platforms mostly ship with PCIe Gen 2

Page 10: To Infiniband and Beyond

QDR 4X Unidirectional Bandwidth

10 http://mvapich.cse.ohio-state.edu/performance/interNode.shtml

•  Still seem to have bottleneck at host if using QDR

Page 11: To Infiniband and Beyond

Latency Measurements: IB vs GbE

11

Page 12: To Infiniband and Beyond

Infiniband Latency Measurements

12

Page 13: To Infiniband and Beyond

Infiniband Silicon Vendors

•  Both switch and HCA parts –  Mellanox: Infiniscale, Infinihost –  Qlogic: Truescale, Infinipath

•  Many OEM's use their silicon •  Large switches

–  Parts arranged in fat tree topology

13

Page 14: To Infiniband and Beyond

Infiniband Switch Hardware

14

144 Ports

288 Ports

96 Ports

48 Ports 24 Ports

  24 port silicon product line at right

  Scales to thousands of ports

  Host-based and hardware- based subnet management

  Current generation (QDR) based on 36 port silicon

  Up to 864 ports in single switch!!

Page 15: To Infiniband and Beyond

Infiniband Topology

•  Infiniband uses credit-based flow control –  Need to avoid loops in topology that may produce

deadlock

•  Common topology for small and medium size networks is tree (CLOS)

•  Mesh/torus more cost effective for large clusters (>2500 hosts)

15

Page 16: To Infiniband and Beyond

Infiniband Routing

•  Infiniband is statically routed •  Subnet management software discovers fabric

and generates set of routing tables –  Most subnet managers support multiple routing

algorithms •  Tables updated with changes in topology only •  Often cannot achieve theoretical bisection

bandwidth with static routing •  QDR silicon introduces adaptive routing

16

Page 17: To Infiniband and Beyond

HPCC Random Ring Benchmark

17

0

200

400

600

800

1000

1200

1400

1600

Avg

Ban

dwid

th (M

B/s

)

Number of Enclosures

"Routing 1"

"Routing 2"

"Routing 3"

"Routing 4"

Page 18: To Infiniband and Beyond

Infiniband Specification for Software

•  IB specification does not define API •  Actions are known as "verbs"

–  Services provided to upper layer protocols –  Send verb, receive verb, etc

•  Community has standardized around open source distribution called OFED to provide verbs

•  Some Infiniband software is also available from vendors –  Subnet management

18

Page 19: To Infiniband and Beyond

Application Support of Infiniband

•  All MPI implementations support native IB –  OpenMPI, MVAPICH, Intel MPI

•  Existing socket applications –  IP over IB –  Sockets direct protocol (SDP)

•  Does NOT require re-link of application

•  Oracle uses RDS (reliable datagram sockets) –  First available in Oracle 10g R2

•  Developer can program to "verbs" layer

19

Page 20: To Infiniband and Beyond

Infiniband Software Layers

20

Page 21: To Infiniband and Beyond

OFED Software

•  Openfabrics Enterprise Distribution software from Openfabrics Alliance –  http://www.openfabrics.org/

•  Contains everything needed to run Infiniband –  HCA drivers –  verbs implementation –  subnet management –  diagnostic tools

•  Versions qualified together

21

Page 22: To Infiniband and Beyond

Openfabrics Software Components

22

Page 23: To Infiniband and Beyond

"High Performance" Ethernet

•  1 GbE cheap and ubiquitous –  hardware acceleration –  multiple multiport NIC's –  supported in kernel

•  10 GbE still used primarily as uplinks from edge switches and as backbone

•  Some vendors providing 10 GbE to server –  low cost NIC on motherboard –  HCA's with performance proportional to cost

23

Page 24: To Infiniband and Beyond

RDMA over Ethernet

•  NIC capable of RDMA is called RNIC •  RDMA is primary method of reducing latency on

host side •  Multiple vendors have RNIC's

–  Mainstream: Broadcom, Intel, etc. –  Boutique: Chelsio, Mellanox, etc.

•  New Ethernet standards –  "Data Center Bridging"; "Converged Enhanced

Ethernet"; "Data Center Ethernet"; etc

24

Page 25: To Infiniband and Beyond

What is iWarp?

•  RDMA consortium (RDMAC) standardized some protocols with are now part of the IETF Remote Data Direct Placement (RDDP) working group

•  http://www.rdmaconsortium.org/home •  Also defined SRP, iSER in addition to verbs •  iWARP supported in OFED •  Most specification work complete in ~2003

25

Page 26: To Infiniband and Beyond

RDMA over Ethernet?

26

The name ‘RoCEE’ (RDMA over Converged Enhanced Ethernet), is a working name.

You might hear me say RoXE, RoE, RDMAoE, IBXoE, IBXE or any other of a host of equally obscure names.

Tom Talpey, Microsoft Corporation Paul Grun, System Fabric Works August 2009

Page 27: To Infiniband and Beyond

The Future: InfiniFibreNet

•  Vendors moving towards "converged fabrics" •  Using same "fabric" for both networking and

storage •  Storage protocols and IB over Ethernet •  Storage protocols over Infiniband

–  NFS over RDMA, lustre

•  Gateway switches and converged adapters –  Various combinations of Ethernet, IB and FC

27

Page 28: To Infiniband and Beyond

Any Questions?

28

THANK YOU!

(And no mention of The Cloud)