open fabrics bof supercomputing 2009

33
www.openfabrics.org Open Fabrics BOF Supercomputing 2009 Tziporet Koren, Gilad Shainer, Yiftah Shahar, Stan Smith Hal Rosenstock, Jeff Squyres, DK Panda, Bob Woodruff, Betsy Zeller Rev. 1.0

Upload: jemma

Post on 15-Jan-2016

32 views

Category:

Documents


0 download

DESCRIPTION

Open Fabrics BOF Supercomputing 2009. Tziporet Koren, Gilad Shainer, Yiftah Shahar, Stan Smith Hal Rosenstock, Jeff Squyres, DK Panda, Bob Woodruff, Betsy Zeller. Rev. 1.0. Agenda. Open Fabrics Linux Update (15 – minutes) OFED 1.4.1, 1.4.2, and OFED 1.5 releases OFED 1.6 plans and roadmap - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Open Fabrics BOF Supercomputing 2009

www.openfabrics.org

Open Fabrics BOFSupercomputing 2009

Tziporet Koren, Gilad Shainer, Yiftah Shahar, Stan SmithHal Rosenstock, Jeff Squyres, DK Panda, Bob Woodruff, Betsy Zeller

Rev. 1.0

Page 2: Open Fabrics BOF Supercomputing 2009

2www.openfabrics.org

Agenda

Open Fabrics Linux Update (15 – minutes) OFED 1.4.1, 1.4.2, and OFED 1.5 releases OFED 1.6 plans and roadmap

Open Fabrics Windows Update (15 – minutes) WinOF 2.1 release WinOF 2.2 plans and roadmap

Open Discussion – 60 minutes OFED scalability, should we have a scalability roadmap? Should we be including MPIs in the OFED releases? OFA solutions for Ethernet clusters for HPC Questions and feedback from the community

Page 3: Open Fabrics BOF Supercomputing 2009

3www.openfabrics.org

Linux: OFED Components

HCA/NIC Drivers IB: IBM, Mellanox, QLogic iWARP: Chelsio, Intel

Core: Verbs, mad, SMA, CMA, SA cache IPoIB SDP SRP and SRP Target iSER RDS Qlogic_VNIC uDAPL OSM Diagnostic tools iSER Target NFS-RDMA

Bonding module Open iSCSI MPI Components

MVAPICH Open MPI MVAPICH2 Benchmark tests

Proprietary MPIs: Intel, HP, Platform mpi

Proprietary SMs: Sun, Voltaire, Qlogic, Mellanox

OFA Development Add on

Tested with

Page 4: Open Fabrics BOF Supercomputing 2009

4www.openfabrics.org

Update from Sonoma ’09 Session

Progress: Provide user space components in tarballs according to

distros requests

Page 5: Open Fabrics BOF Supercomputing 2009

5www.openfabrics.org

OFED 1.4.1 – Released May 2009

New features Added support for RHEL 5.3 and SLES11 NFS/RDMA: In beta quality with support for RHEL 5.2, 5.3 and SLES 10 SP2 Updated MPI packages: MVAPICH 1.1.0-3355, Open MPI 1.3.2 Updated bonding package: ib-bonding-0.9.0-40 Updated DAPL: compat-dapl-1.2.14 and dapl-2.0.19 Updated OpenSM version to include critical bug fixes Fixed RDS iWARP support Low level drivers updated: ehca, mlx4, cxgb3, nes, ipath, mthca Added a module parameter to control number of MTTs per segment in Mellanox HCAs

(mlx4 & mthca) mstflint update Enhanced OpenSM and management tools, user interface, HA, routing enhancements,

much more, too much to list… details in the backup slides

Page 6: Open Fabrics BOF Supercomputing 2009

6www.openfabrics.org

OFED 1.4.2 – Released August 2009

New features Critical bug fixes only Fixes to NES (Intel iWarp) driver Fixes to support running with Lustre installed NFS/RDMA critical bug fixes

Minimal QA Thus, recommended only for people hitting these critical bugs

Page 7: Open Fabrics BOF Supercomputing 2009

7www.openfabrics.org

OFED 1.5 – Release December 2009

New features Added support for RedHat EL5.4 and EL4.8 and SLES 10 SP 3 Added support for kernel.org 2.6.29 and 2.6.30 uDAPL scalability enhancements, new UCM provider Hardware driver for new Qlogic QDR HCA All user space packages released as tar balls for easier distro integration MVAPICH2 1.4 OpenMPI 1.3.3 Several new enhancements to OpenSM and management tools for improved

scalability, performance, QoS, routing, etc. (see backup slides for details) Bug fixes SDP Zero Copy, and other performance improvements

OFED 1.5-RDMAoE Branch Experimental branch of OFED-1.5 that also includes support for Mellanox RDMAoE For those that want to try out this new technology Open Fabrics board has voted to include this code in OFED/WinOF

• Which release should it go into ?OFED-1.5 ? or wait till the code is accepted upstream and there is a standard spec ?

Page 8: Open Fabrics BOF Supercomputing 2009

8www.openfabrics.org

OFED 1.5 OS Matrix

List of Supported Kernels for OFED 1.5 RHEL4: up6, up7, up8 RHEL5: up2, up3, up4 SLES10: SP2, SP3 SLES 11 Fedora Core 11* OpenSuSE 11* Kernel.org: 2.6.18-2.6.30 * minimal QA for these versions.

Page 9: Open Fabrics BOF Supercomputing 2009

9www.openfabrics.org

OFED 1.6 Plans

Preliminary Schedule Release at Nov 2010 Detailed schedule will be derivative from the above

Preliminary Feature List: Kernel.org: 2.6.33 and 2.6.34 SRIOV support Mellanox Vnic for BridgeX MMU notification for MPI (if accepted by the kernel) New HW from vendors (if any) RDMAoE (if not already in an earlier release)

Page 10: Open Fabrics BOF Supercomputing 2009

10www.openfabrics.org

OFED 1.6 OS Matrix

kernel.org: kernel 2.6.33 and 2.6.34 RHEL4: up6, up7, up8 (maybe drop at all if RHEL 6 is out –

lets talk in meeting) RHEL5: up2, up3, up4, up5 RHEL6 SLES10: SP2, SP3, SP4 SLES 11: SP1 Fedora Core: latest OpenSuSE: latest

•new for OFED 1.6 in bold•drop support for items in blue

Page 11: Open Fabrics BOF Supercomputing 2009

11www.openfabrics.org

Agenda

Open Fabrics Linux Update (15 – minutes) OFED 1.4.1, 1.4.2, and OFED 1.5 releases OFED 1.6 plans and roadmap

Open Fabrics Windows Update (15 – minutes) WinOF 2.1 release WinOF 2.2 plans and roadmap

Open Discussion – 60 minutes OFED scalability, should we have a scalability roadmap? Should we be including MPIs in the OFED releases? OFA solutions for Ethernet clusters for HPC Questions and feedback from the community

Page 12: Open Fabrics BOF Supercomputing 2009

12www.openfabrics.org

Windows OpenFabrics (WinOF)

WinOF 2.1 Released 9/30/2009 Winverbs fully integrated into IB core feature set. OFED Compatibility layers

• libibverbs, libmad, libumad, librdmacm

OFED Diagnostics on OFED Compat layers• Ibaddr, ibnetdiscover, ibroute, ibstat, saquery, sminfo…

Installer fully integrated with DriverStore + PNP. OFED uDAT/uDAPL code base on Windows. Server 2008 HPC integration Numerous Bug fixes.

Page 13: Open Fabrics BOF Supercomputing 2009

13www.openfabrics.org

WinOF Roadmap

WinOF 2.2 Release target Q1’2010, freeze in Q4’09

Features:• Windows 7 & Server 2008 R2 fully supported.• NDIS 6.0 IPoIB driver based on WHQL’ed source.• OpenSM 3.3.3 (WinOF 2.1 @ 3.0.0 ~OFED 1.2+).• SRP multi-path fixes.

WinOF 2.3 Release target Q4’2010, freeze early Q4’2010 Connected Mode IPoIB.

Page 14: Open Fabrics BOF Supercomputing 2009

14www.openfabrics.org

Agenda

Open Fabrics Linux Update (15 – minutes) OFED 1.4.1, 1.4.2, and OFED 1.5 releases OFED 1.6 plans and roadmap

Open Fabrics Windows Update (15 – minutes) WinOF 2.1 release WinOF 2.2 plans and roadmap

Open Discussion – 60 minutes OFED scalability, should we have a scalability roadmap? Should we be including MPIs in the OFED releases? OFA solutions for Ethernet clusters for HPC Questions and feedback from the community

Page 15: Open Fabrics BOF Supercomputing 2009

15www.openfabrics.org

OFA Scalability

Challenges and GoalsInfrastructure ScalabilityULPs/Apps ScalabilityPossible Improvements

Page 16: Open Fabrics BOF Supercomputing 2009

16www.openfabrics.org

Challenges and Goals

Scale out to 10K-20K or more nodes Performance Reliability Sometimes hard to differentiate feature from scalability

Focus additional attention/resources on issues Get ready for more detailed discussion at Sonoma

Page 17: Open Fabrics BOF Supercomputing 2009

17www.openfabrics.org

Infrastructure Scalability/Features Improved multicore affinity/awareness/support

Binding to specific hw threads in a core

• e.g. http://arstechnica.com/hardware/news/2009/09/ibms-8-core-power7-twice-the-muscle-half-the-

transistors.ars

Interrupt distribution

Binding HCAs/RNICs to numa nodes

Multicast Reliable multicast

• New IBTA optional feature

Better UD multicast performance• Small message mcast latency, with just two members in the mcast group, is 2x to 3x that of unicast

latency between the same pair

Flow control for SRQ New IBTA optional feature

CM extension

Page 18: Open Fabrics BOF Supercomputing 2009

18www.openfabrics.org

Infrastructure Scalability/Features

Fault tolerance Application transparent fault detection, isolation, recovery Multiple HCAs/NICs with transparent failover

IB monitoring Performance counters, throughput, hotspots, degraded

links This is IB's Achilles' heel...

• Need much better monitoring tools discover congestion, bottlenecks

Adaptive routing HCA out-of-order delivery Switch logic for state info & adaptive algorithm, etc.

Page 19: Open Fabrics BOF Supercomputing 2009

19www.openfabrics.org

Infrastructure Scalability

SA aspects

Primarily PathRecord

• OpenSM

• SA client

RDMA CM

Resolve route

• ARP query scalability

Resolve address

• SA PathRecord query scalability

Page 20: Open Fabrics BOF Supercomputing 2009

20www.openfabrics.org

Infrastructure Scalability

CM Higher abstraction model

• Current APIs are cumbersome & difficult to use OpenSM

Stateful failover• Replication

• Eliminate client re-registration Congestion manager

Page 21: Open Fabrics BOF Supercomputing 2009

21www.openfabrics.org

Possible Infrastructure Improvements

Adaptive MAD retransmission Better duplicate transaction handling by SA (and MAD ?) SA scalability in terms of PathRecord responses

More parallelization• Shadow DB ?• SA distribution beyond node

Tunable retry mechanism for various components RDMA CM API addition and ACM (Assistant to the IB CM)

Does this address higher abstraction model requested ?

Page 22: Open Fabrics BOF Supercomputing 2009

22www.openfabrics.org

Possible ULPs/Apps Improvements

MPI Don’t query PathRecords per core Hardware collective support

• Common API Reliable multicast ummunotify

BoIB (Boot over IB) SM improvements for handling non responsive SMAs as node transitions from boot

ROM to kernel infiniband as boot interface without ethernet suspenders

Bonding Load balancing

• Not just active/standby (failover)

DHCP Use raw (mmap) rather than BSD socket interface due to inadequate performance

Others ?

Page 23: Open Fabrics BOF Supercomputing 2009

23www.openfabrics.org

Agenda

Open Fabrics Linux Update (15 – minutes) OFED 1.4.1, 1.4.2, and OFED 1.5 releases OFED 1.6 plans and roadmap

Open Fabrics Windows Update (15 – minutes) WinOF 2.1 release WinOF 2.2 plans and roadmap

Open Discussion – 60 minutes OFED scalability, should we have a scalability roadmap? Should we be including MPIs in the OFED releases? OFA solutions for Ethernet clusters for HPC Questions and feedback from the community

Page 24: Open Fabrics BOF Supercomputing 2009

24www.openfabrics.org

MPI Distribution in OFED: Rationale

Open source MPI’s initially included to “bootstrap” the OFED project

MPI was the main user for OFED, so this seemed like a natural pairing

Made it (significantly) easier for customers to get their MPI jobs running on InfiniBand

Also necessary for political buy-in: unify under one, standard verbs API (vs. different MVAPI stacks)

QA testing of MPI + OFED is still extremely valuable

This is not a discussion of removing MPI + OFED QA

Page 25: Open Fabrics BOF Supercomputing 2009

25www.openfabrics.org

MPI Distribution in OFED: Pros

MPI is still the most common OFED “customer”

HPC customers get network stack + MPI in one package

Helps rapid MPI deployment on new clusters (out-of-box)

MPI-selector function allows to select MPI stack of choice during the

installation

Customers get QA assurance of specific MPI + OFED version

tuples

Helps to test multiple functionalities of the OFED stack and

IB/iWARP fabric with comprehensive suite of MPI-level

benchmarks

Page 26: Open Fabrics BOF Supercomputing 2009

26www.openfabrics.org

MPI Distribution in OFED: Cons

MPI’s have their own QA cycles

MPI+OFED QA testing is more for OFED, not MPI

Bundling induces project scheduling difficulties between OFED and various MPI packages

RedHat and SuSE both say “Don’t do this!”

They both already include the open source MPI’s

Makes it more difficult for them to take OFED drops

Many users will download the latest-n-greatest MPIs anyway – not the ones included in OFED

Page 27: Open Fabrics BOF Supercomputing 2009

27www.openfabrics.org

Agenda

Open Fabrics Linux Update (15 – minutes) OFED 1.4.1, 1.4.2, and OFED 1.5 releases OFED 1.6 plans and roadmap

Open Fabrics Windows Update (15 – minutes) WinOF 2.1 release WinOF 2.2 plans and roadmap

Open Discussion – 60 minutes OFED scalability, should we have a scalability roadmap? Should we be including MPIs in the OFED releases? OFA solutions for Ethernet clusters for HPC Questions and feedback from the community

Page 28: Open Fabrics BOF Supercomputing 2009

28www.openfabrics.org

OFA Solutions for Ethernet clusters for HPC

Would like to get some community feedback on

success stories for building HPC clusters using

Ethernet.

What works well ?

Things that need improvement to make it easier ?

Other ?

Page 29: Open Fabrics BOF Supercomputing 2009

29www.openfabrics.org

Agenda

Open Fabrics Linux Update (15 – minutes) OFED 1.4.1, 1.4.2, and OFED 1.5 releases OFED 1.6 plans and roadmap

Open Fabrics Windows Update (15 – minutes) WinOF 2.1 release WinOF 2.2 plans and roadmap

Open Discussion – 60 minutes OFED scalability, should we have a scalability roadmap? Should we be including MPIs in the OFED releases? OFA solutions for Ethernet clusters for HPC Questions and feedback from the community

Page 30: Open Fabrics BOF Supercomputing 2009

30www.openfabrics.org

Backup Slides

Page 31: Open Fabrics BOF Supercomputing 2009

31www.openfabrics.org

Open SM – OFED 1.4.1 and 1.4.2

Versions in OFED 1.4.2:

libibumad-1.3.1, libibmad-1.3.1, opensm-3.3.1, infiniband-diags-1.5.1

User Interface:

Unified configuration file

Configuration reloading on the fly

Improved Plugin interface – multiple plugins are supported

Mlticast: Ipv6 Solicited Node consolidation

Better diagnostic tools (new - ibsendtrap)

HA:

• OpenSM will query Standby SMs periodically

• Standby OpenSM notifies Master SM about priority change (Trap 144)

Page 32: Open Fabrics BOF Supercomputing 2009

32www.openfabrics.org

Open SM - OFED 1.4.2 Routing

Cached routing (- -R ftree,updn,minhop)

LMC improvements:

Preserve base LIDs routes

Ensure LMC paths balancing over different switches/chassis

Ordered paths balancing

Ports are sorted by switch loads

Port order file option (--guid_routing_order_file option)

Better LASH support: Mesh geometry analysis, Paths balancing over multiple links

General

Port IDs for Up/Down

Min hop weights

Connecting root nodes with Up/Down

Connecting IO nodes with FastTree

Page 33: Open Fabrics BOF Supercomputing 2009

33www.openfabrics.org

Open SM - new features in OFED 1.5

Scalability & performance

Optimized SL2VL setup

Parallel LFTs setup

Parallel MFTs setup

Routing & multicast

FTree improvements

Routing engine reloading

Mesh switch reordering

optimizations

MGID to MLID compression

QoS improvements

SL2VL setup optimization

QoS/LASH co-exist

Major bug fix

MCG join/leave fixes

Clean delayed MCG deletion