mellanox end to end solution and infiniband fabric...

138
Mellanox End to End Solution And InfiniBand Fabric Application introduction 刘博 Mellanox System Engineer 2012919

Upload: others

Post on 23-May-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

Mellanox End to End Solution And InfiniBand Fabric Application introduction

刘博 Mellanox System Engineer

2012年9月19日

Page 2: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 2

Leading provider of high-throughput, low-latency server and storage interconnect • FDR 56Gb/s InfiniBand and 10/40GbE • Reduces application wait-time for data • Dramatically increases ROI on data center infrastructure

Company headquarters:

• Yokneam, Israel; Sunnyvale, California • ~950 employees; worldwide

Solid financial position

• Record revenue in FY’11; $259.3M • Q2’12 revenue = $133.5M; up 110.7% Y-o-Y • Q3’12 guidance ~$150.0M to $155.0M • Cash + investments @ 6/30/12 = $327.8M

Company Overview Ticker: MLNX

5 of top 10 Global Banks

Fortune 100 Penetration

10 of top 10 Automotive

Manufacturers

5 of top 10 Pharmaceutical

Companies

9 of top 10 Oil and Gas Companies

Page 3: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 3

Positioned to Capture the “Big Data” Opportunity

High-Performance Computing

Proliferation of Data will be a Catalyst for Growth

7.9 zettabytes

2011 – 2015 CAGR = 45%

The Digital Universe1

1 Source: IDC Digital Universe study, June 2011

Web 2.0 Cloud Computing Enterprise Data Center

2011E 2015E

1.8 zettabytes

Page 4: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 4

Markets that Require Fast Interconnect

Web 2.0 DB/Enterprise HPC

Up to 10X Performance and

Simulation Runtime

33% Higher GPU Performance

Unlimited Scalability

Lowest Latency

62% Better Execution Time

42% Faster Messages

Per Second

Financial Services Cloud

12X More Throughput

Support More Users at Higher Bandwidth

Improve and

Guarantee SLAs

10X Database Query Performance

4X Faster VM

Migration

More VMs per Server and More

Bandwidth per VM

Storage

2X Hadoop Performance

13X Memcached

Performance

4X Price/Perf

Mellanox storage acceleration software provides >80% more IOPS (I/O operations per second)

Page 5: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 5

Mellanox Interconnect Products Enable Customer Choice

10Gb/s, 40Gb/s, 56Gb/s Ethernet and InfiniBand

10Gb/s, 40Gb/s, 56Gb/s Ethernet and InfiniBand

Application Acceleration

• Big Data • Storage • TCP/UDP

Adapters Switches Software

• Database • Trading • HPC

Virtual Protocol Interconnect (VPI) Provides High Performance Over any Converged Interconnect with Same Software Infrastructure

Cables

10Gb/s, 40Gb/s, 56Gb/s Ethernet and InfiniBand

Page 6: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 6

Host/Fabric Software

Leading Supplier of End-to-End Connectivity Solutions for Servers and Storage

Virtual Protocol Interconnect

Storage Front / Back-End Server / Compute Switch / Gateway

56G IB & FCoIB 56G InfiniBand

10/40GigE & FCoE 10/40GigE

Industries Only End-to-End InfiniBand and Ethernet Portfolio

ICs Switches/Gateways Adapter Cards Cables

Fibre Channel

Virtual Protocol Interconnect

Page 7: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 7

Mellanox Multi-Protocol/VPI Connectivity Technology Efficient, Flexible and Scalable for Maximum ROI

Applications Transparency enables Data Center Agility

Financial

Cloud

Computing

Cloud & Web 2.0

Clustered Database

Clustered Database

Financial

Financial

Web Services

HPC Applications

Database Apps

CRM, ERP Apps

Business Logic

.NET, JAVA

Mellanox VPI Connectivity Solution

Trading, Analytics

Web 2.0 Storage

Page 8: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 8

Service-Oriented Cloud Infrastructure

Storage NFS, CIFS, iSCSI

NFS-RDMA, SRP, iSER, Fibre Channel, Clustered

Networking TCP/IP/UDP

Sockets

Clustering MPI, DAPL, RDS, Sockets

Management SNMP, SMI-S

OpenView, Tivoli, BMC, Computer Associates

App1 App2 App3 App4 AppX …

Acceleration Engines

Protocols

Applications

Networking Virtualization Clustering Storage RDMA

Ethernet FC

NAS

InfiniBand

IB Storage

Storage Storage LAN SAN

Converged Fabric 40Gig Ethernet 56Gig InfiniBand

Fabric

Running Any Protocol over Any Convergence Fabric

Page 9: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 9

Mellanox Advanced InfiniBand Solutions

- Collectives Accelerations (FCA/CORE-Direct) - GPU Accelerations (GPUDirect) - MPI/SHMEM - RDMA - Quality of Service

- Adaptive Routing - Congestion Management - Traffic aware Routing (TARA)

- UFM, FabricIT - Integration with job schedulers - Inbox Drivers

Server and Storage High-Speed Connectivity

Networking Efficiency/Scalability

Application Accelerations

Host/Fabric Software Management

- Latency - Bandwidth

- CPU Utilization - Message rate

Page 10: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 10

Cables

648 ports

324 ports

216 ports 108 ports

20 and 40Gb/s Modular Switches

Switch Silicon

Gateway Silicon

Software

Adapters

Adapter Cards Adapter Silicon

Systems

InfiniBand Switches Gateway

IS5025, IS5030, IS5031, IS5035

IS5100 IS5200 IS5300 IS5600 BridgeX® BX5020 InfiniScale IV

ConnectX® -2 InfiniBand

BridgeX ® VPI

ConnectX®-2 InfiniBand Dual Port

QSFP w/ PCIe 2.0

ConnectX®-2 VPI QSFP IB and SFP+

10GigE

ConnectX®-2 Ethernet Dual Port SFP+ w/ PCIe

2.0

ConnectX®-2 InfiniBand Single Port

QSFP w/ PCIe 2.0

Delivering Unified I/O - 10/20/40G InfiniBand to 10GigE and 1/2/4/8G FC

36-port 40Gb/s Switch Systems

Fabric Management

Fabric IT

IS5022

8-port Non-blocking Remotely-managed

40Gb/s Switch System

Mellanox Product Line

4036SM

PWR PS/Fan

RstCLI

EthInfo SM23 24 25 26 27 28 29 30 31 32 33 34 35 36

Eth Switch

Page 11: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 11

Comprehensive End-to-End Ethernet Product Portfolio

SX6536 - 648p

Vantage 6024 – 24p

NICs

Cables

SX1036 - 36p

SX6518 - 324p

SX6512 - 216p SX1016 – 64P

Management software

Switches SX1024 48x10G + 12x40G

Page 12: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 13

IS5022 IS5023 IS5024 IS5025 IS5030 IS5035 4036

Ports 8 18 36 36 36 36 36

Switch Capacity 640Gb/s 1.44Tb/s 2.88Tb/s 2.88Tb/s 2.88Tb/s 2.88Tb/s 2.88Tb/s

Management - - - - Chassis management

Full management

Full management

- - - - SM 108 nodes

SM 648 nodes

SM 648 nodes

In-band FW update

In-band FW update

In-band FW update

In-band FW update

FabricITTM

UFMTM FabricITTM

UFMTM UFMTM

CPU - - - - PPC405 PPC460 PPC460

Management Eth Ports - - - - 1 2 1

Management USB port - - - - Yes Yes Yes

LEDs Status,UnitIDFan, PortErr

Status,UnitIDFan, PortErr

Status,UnitIDFan, PortErr

Status,Fan, PS1, PS2

Status,Fan, PS1, PS2

Status,Fan, PS1, PS2

Info, Fan, PS, SM

Design No FRUs No FRUs No FRUs Fan/PS FRU Fan/PS FRU Fan/PS FRU Fan/PS FRU

AC power inlet location Connector side panel

Connector side panel

Connector side panel

P/S side panel

P/S side panel

P/S side panel

P/S side panel

# of Power Supplies 1 1 1 1 (2nd optional)

1 (2nd optional)

1 (2nd optional) 2

InfiniBand QDR Edge Switch Comparison

Page 13: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 14

InfiniBand QDR Director Switch Comparison

IS5100 IS5200 IS5300 IS5600 4200 4700

Ports 108 216 324 648 144/162 324/648

Height (Shelf will add 1U-2U) 6U 9U 16U 29U 11U 19U

Switch Capacity 8.64Tb/s 17.28Tb/s 25.9Tb/s 51.8Tb/s 11.52Tb/s 25.92Tb/s

Spine modules 3 6 9 18 4 9

Leaf modules (max) 6 12 18 36 8 18

Management FabricITTM

UFMTM (May’11)

FabricITTM

UFMTM (May’11)

FabricITTM

UFMTM (May’11)

FabricITTM

UFMTM (May’11) UFMTM UFMTM

SM 648 nodes

SM 648 nodes

SM 648 nodes

SM 648 nodes

SM 648 nodes

SM 648 nodes

Power Supplies (Hot swappable, redundant) 2 + 1 3 + 1 4 + 2 8 + 2 Up to 4

(N + N) Up to 6 (N + N)

Fans (Hot swappable, redundant)

4 Type4 chassis 1 Type3 per spine

4 Type4 chassis 1 Type3 per spine

2 Type1 + 2 Type2 1 Type3 per spine

4 Type1 + 4 Type2 1 Type3 per spine

One Horizontal One Vertical

One Horizontal One Vertical

Page 14: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 15

SX6025 SX6036 SX6506 SX6512 SX6518 SX6536

Ports 36 36 108 216 324 648

Switch Capacity 4.032Tb/s 4.032Tb/s

Height (Shelf will add 1U-2U) 1U 1U 6U 9U 16U 29U

Spine modules - - 3 6 9 18

Leaf modules (max) - - 6 12 18 36

Management - SM 648 nodes

SM 648 nodes

SM 648 nodes

SM 648 nodes

SM 648 nodes

In-band FW update

MX-OSTM

UFMTM MX-OSTM

UFMTM MX-OSTM

UFMTM MX-OSTM

UFMTM MX-OSTM

UFMTM

CPU - PPC460 PPC460 PPC460 PPC460 PPC460

LEDs Status,Fan, PS1, PS2

PortERR, UnitID

Status,Fan, PS1, PS2

PortERR, UnitID

IS5X00 like + PortERR, UnitID

IS5X00 like + PortERR, UnitID

IS5X00 like + PortERR, UnitID

IS5X00 like + PortERR, UnitID

# of Power Supplies 1 (2nd optional)

1 (2nd optional) 2 + 1 3 + 1 4 + 2 8 + 2

New features VPI, FEC,

power governor, IB router

VPI, FEC, power governor,

IB router

VPI, FEC, power governor,

IB router

VPI, FEC, power governor,

IB router

VPI, FEC, power governor,

IB router

VPI, FEC, power governor,

IB router

FDR Switch Comparison

Page 15: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 16

Mellanox Grid Director 4036E Physical Specs

Base on Mellanox Grid Director 4036 + IB-ETH Bridging Silicon 34 x QDR/DDR/SDR (auto-negotiating) InfiniBand port (QSFP) 2 x 1/10GbE port (SFP+) Redundancy Management modle Shared FRUs with 4036:

• Rail kits, PS, Fan units

Measurement: • 19’’/1U High / 21‘’ deep

Mellanox IB-ETH Bridge Silicon

2 x 1/10GbE SFP+

34 x 40Gb/s IB QSFP

1U High

Page 16: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 17

IB Gateway System: BX5020

Dual hot-swappable redundant power supplies Replaceable fan drawer Embedded management

• PowerPC CPU, GigE and RS232 out-band management port

• Uplinks: 4 QSFP (IB) • Downlinks: 16 SFP+

• 12 1/10 GigE combination EN ports • Requires CX/CX2 HCAs

Page 17: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

InfiniBand Foundations

Page 18: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 19

InfiniBand Trade Association (IBTA)

Founded in 1999

Actively markets and promotes InfiniBand from an industry perspective through public relations engagements, developer conferences and workshops

Steering Committee Members:

Page 19: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 20

InfiniBand is a Switch Fabric Architecture

► Interconnect technology connecting CPUs and I/O ► Super high performance

High bandwidth (starting at 10Gbps and up to 60Gbps) – Lots of head room!

Low latency – Fast application response across the cluster. Low CPU Utilization with RDMA (Remote Direct Memory Access) – Unlike

Ethernet, communication bypasses the OS and the CPU’s.

► Increased application performance ► Single port solution for all LAN, SAN, and application

communication ► High reliability Subnet Manger with redundancy ► InfiniBand is a technology that was designed for large scale

grids and clusters

First industry standard high speed interconnect!

Page 20: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 21

InfiniBand Roadmap

SDR - Single Data Rate DDR - Double Data Rate QDR - Quad Data Rate FDR - Fourteen Data Rate EDR - Enhanced Data Rate HDR - High Data Rate NDR - Next Data Rate

Page 21: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 22

InfiniBand Resources

InfiniBand software is developed under OpenFabrics Open Source Alliance

http://www.openfabrics.org/index.html InfiniBand standard is developed by the InfiniBand Trade

Association http://www.infinibandta.org/home

Page 22: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 23

Industry standard defined by the InfiniBand Trade Association • Originated in 1999 InfiniBand™ specification defines an input/output architecture used

to interconnect servers, communications infrastructure equipment, storage and embedded systems InfiniBand is a pervasive, low-latency, high-bandwidth interconnect

which requires low processing overhead and is ideal to carry multiple traffic types (clustering, communications, storage, management) over a single connection. As a mature and field-proven technology, InfiniBand is used in

thousands of data centers, high-performance compute clusters and embedded applications that scale from small scale to large scale

What is InfiniBand?

Source: InfiniBand® Trade Association (IBTA) www.infinibandta.org

Page 23: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 24

InfiniBand Components Overview

Host Channel Adapter (HCA) • Device that terminates an IB link

and executes transport-level functions and support the verbs interface

Switch

• A device that routes packets from one link to another of the same IB Subnet

Router

• A device that transports packets between different IBA subnets

Bridge • InfiniBand to Ethernet

Processor Node

InfiniBand Subnet

Gateway

HCA

Processor Node

Processor Node

HCA

HCA

Storage Subsystem

Consoles

RAID

Ethernet

Gateway

Fibre Channel

HCA

Subnet Manager

•Switch

Switch Switch

Switch

Page 24: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 25

Host Channel Adapters (HCA)

Equivalent to a NIC (Ethernet) - GUID (Global Unique ID = MAC)

Converts PCI to InfiniBand CPU offload of transport operations End-to-end QoS and congestion control Communicate via Queue Pairs (QPs) HCA Options:

• Single Data Rate 2.5GB/S * 4 = 10 • Double Data Rate 5 GB/S * 4 = 20 • Quadruple Data Rate 10GB/S * 4 = 40 • Fourteen Data Rate 14 Gb/s * 4 = 56

Page 25: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 26

HCA Physical Address Global Unique Identifier (GUID) GUID - 64 bit

Host Channel Adapters (HCA’s) & all Switches require GUID & LID addresses

3 Types of Guids per ASIC - Node = Is meant to identify the HCA as a entity - Port = Identifies the Port as a port - System = Allows to combine multiple GUIDS creating one

entity

Global Unique Identifier “Like a Ethernet MAC address” - Assigned by IB Vendor - Persistent through reboots

LHJ
附注
Page 26: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 27

IB Fabric L2 Switching Addressing Local Identifier (LID)

Local Identifier “Like a dynamic IP address”

LID - 16 bit

Host Channel Adapters (HCA’s) & Switches all require GUID & LID addresses

• Assigned by the SM when port becomes active

• Not Persistent through reboots • Address ranges

0x0000 = reserved 0x0001 = 0xBFFF = Unicast 0xc001 = 0xFFFE = Multicast 0xFFFF = Reserved for special use

Page 27: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 28

Node & Switch Main identifiers

IB Port Basic Identifiers • Port number • Host Channel Adapter – HCA (IB “NIC” ) • Global universal id – GUID 64 bit ( like mac )

ex. 00:02:C9:02:00:41:38:30 - Each 36 ports “basic “ switch has its own switch & system GUID - All ports belong to the same “basic “ switch will share the switch GUID

• Local Identifier - LID • Virtual Lane –VL Used to separate different Bandwidth & Qos using same physical port

LID • Local Identifier that is assigned to any IB device

by the SM and used for packets “ routing “ within an IB fabric .

• All ports of the same ASIC unit are using the Same LID

Master SM

Switch

Port-1

Host

HCA Switch

00:02:C9:02:00:41:27:12 switch GUID

CA

Port-X

GUID- 00:02:C9:02:00:41:27:12 GUID -00:02:C9:02:00:41:38:35

LID 37

LID 14

LID 12

LID 8

Infiniband Link

LID 37 LID 14

Port x Port x

VL-15,VL_ 0-7

De- mux

Mux

Traffic Packets VL 0-7 Packets

Transmitted

Link Control VL-15

Traffic Packets VL 0-7

Page 28: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 29

Partitioning - Pkey to VLAN mapping

Define up to 64 partitions in a single Partition by mapping port and Ethernet VLAN to InfiniBand

PKEY

Page 29: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 30

Packet Flow QOS Management

Low Priority VL Arbitrary

Criteria Types • Group of source ports, • Groups of destination ports, • Partitions, • QOS classes • Application Service ID’s

SL 0-3

SL 4

SL 6

SL 7

SL 8

SL 10

SL 12

Default

GPFS

IP_O_IB

SDP

RDS

Native Multicast

Clock Sync

W 32

W 32

W 32

W 32

W 64

W 64

Fabric Nodes Users

Packets Criteria Categorized

Service level

W 64

Health

Bonds

Private

Stocks

Government

Virtual Lanes over Physical Link

VL-0

VL-1

VL-2

VL-3

VL-4

VL-5

VL-5 High Priority VL Arbitrary

Page 30: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

Introducing

Subnet Manager (SM)

Page 31: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 32

Subnet Manager (SM) Rules & Roles

Every subnet must have at least one - Manages all elements in the IB fabric - Discover subnet topology - Assign LIDs to devices - Calculate and program switch chip forwarding tables (LFT pathing) - Monitor changes in subnet

Implemented anywhere in the fabric - Node, Switch, Specialized device

No more than one active SM allowed - 1 Active (Master) and remaining are Standby (HA)

Page 32: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 33

Subnet Administrator (SA)

The SA is typically an extension of the SM A passive entity that provides a database

- Subnet topology - Device types - Device characteristics

Responds to queries

- Paths between HCAs - Event notification - Persistent information - Switch forwarding tables

Used to keep multiple SMs in sync

Page 33: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 34

InfiniBand Switch Operation

InfiniBand packets are ‘destination routed’ based on the Destination Logical ID (DLID) field in the header DLID is 16 bit address

- 48K values are used for unicast - 16K values are used for multicast

At each switch ASIC, the incoming unicast DLID is used as

an index into a Linear Forwarding Table (LFT) that returns the outgoing switch port number - E.g. the InfiniScale III ASIC supports all 48K possible LFT entries

Page 34: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 35

Subnet Management

CPU

HCA

System Memory

HCA

IB Switch

IB Switch

HCA

IB Switch

TCA

HCA

Subnet Manager

Each Subnet must have a Subnet Manager (SM)

SMA

SMA

SMA

SMA

SMA

SMA

SMA

SMA

Every entity (CA, SW, Router) must support a Subnet

Management Agent (SMA)

Subnet Manager

HCA IB Switch

Standby SM

Standby SM

Standby SM

Topology Discovery Fabric Maintenance

Multipathing: LMC Supports Multiple LIDS LMC: 1

LID = 6,7

Page 35: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 36

OpenSM (osm) is an Infiniband compliant subnet manger. Included in Linux Open Fabrics Enterprise Distribution. Ability to run several instance of osm on the cluster in a

Master/Slave(s) configuration for redundancy. Partitions (p-key) support QoS support Enhanced routing algorithms:

• Min-hop • Up-down • Fat-tree • LASH • DOR

OpenSM - Features

36

Page 36: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 37

Management Model

QP1 (virtualized per port) Uses any VL except 15 MADs called GMPs - LID-Routed Subject to Flow Control

Baseboard Management Agent

Communication Mgmt (Mgr/Agent)

Performance Management Agent

Device Management Agent

Vendor-Specific Agent

Application-Specific Agent

SNMP Tunneling Agent

Subnet Administration (an Agent)

General Service Interface

Subnet Manager (SM) Agent Subnet Manager

Subnet Management Interface

QP0 (virtualized per port) Always uses VL15 MADs called SMPs – LID or Direct-Routed No Flow Control

Pure InfiniBand Management Other Management Features

Page 37: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 38

Command line • Default (no parameters)

Scans and initializes the IB fabric and will occasionally sweep for changes • opensm –h for usage flags

E.g. to start with up-down routing: opensm –-routing_engine updn • Run is logged to two files:

- /var/log/messages – opensm messages, registers only general major events - /var/log/opensm.log - details of reported errors.

Start on Boot

• As a daemon: - /etc/init.d/opensmd start|stop|restart|status - /etc/opensm.conf for default parameters

# ONBOOT # To start OpenSM automatically set ONBOOT=yes ONBOOT=yes

SM detection • /etc/init.d/opensd status

- Shows opensm runtime status on a machine • sminfo

- Shows master and standby subnets running on the cluster

Running OpenSm

38

Page 38: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 39

A few important command line parameters: -c, --cache-options. Write out a list of all tunable OpenSM parameters, including

their current values from the command line as well as defaults for others, into the file /var/cache/opensm. This file can then be modified to change OSM parameters, such as HOQ (Head of Queue timer).

-g, --guid This option specifies the local port GUID value with which OpenSM

should bind. OpenSM may be bound to 1 port at a time. This option is used if the SM needs to bind to Port 2 of an HCA.

-R, --routing_engine This option chooses routing engine instead of Min Hop

algorithm (default). Supported engines: updn, file, ftree, lash -x, --honor_guid2lid. This option forces OpenSM to honor the guid2lid file, when

it comes out of Standby state, if such file exists under /var/cache/opensm -V This option sets the maximum verbosity level and forces log flushing.

OpenSM Command Line parameters

39

Page 39: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 40

Min Hop algorithm (DEFAULT) • Based on the minimum hops to each node where the path length is optimized. UPDN unicast routing algorithm

• Based on the minimum hops to each node, but it is constrained to ranking rules. This algorithm should be chosen if the subnet is not a pure Fat Tree, and a deadlock may occur due to a loop in the subnet. - Root GUID list file can be specified using the –a option

Fat Tree unicast routing algorithm

• This algorithm optimizes routing for a congestion-free “shift” communication pattern. It should be chosen if a subnet is a symmetrical Fat Tree of various types, not just a K-ary-N-Tree: non-constant K, not fully staffed, and for any CBB ratio. Similar to UPDN, Fat Tree routing is constrained to ranking rules. - Root GUID list file can be specified using the –a option

Addition algorithms • LASH - Uses InfiniBand virtual layers (SL) to provide deadlock-free shortest-path routing. • DOR. This provides deadlock free routes for hypercube and mesh clusters • Table Based. A file method which can load routes from a table.

Routing Algorithms

40

Page 40: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

IB Fabric Protocol Layers

Page 41: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 42

Software Transport Verbs and Upper Layer Protocols: - Interface between application programs and hardware. - Allows support of legacy protocols such as TCP/IP - Defines methodology for management functions Transport: - Delivers packets to the appropriate Queue Pair; Message Assembly/De-assembly, access rights, etc. Network: - How packets are routed between Different Partitions /subnets Data Link (Symbols and framing): - Flow control (credit-based); How packets are routed , from Source to Destination on the same Partition Subnet Physical: - Signal levels and Frequency; Media; Connectors

IB Architecture Layers

Page 42: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 43

Distributed Computing using IB

DLID VL

Defines QoS

Pkey D-QP

Hop Payload

lgth

ETH RDMA Reliable

Page 43: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 44

IB Packet

Page 44: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 45

Software Stack

45

Page 45: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 46

Physical Layer – Link Rate

InfiniBand uses serial stream of bits for data transfer Link Speed

• Single Data Rate (SDR) - 2.5Gb/s per lane (10Gb/s for 4x) • Double Data Rate (DDR) - 5Gb/s per lane (20Gb/s for 4x) • Quad Data Rate (QDR) - 10Gb/s per lane (40Gb/s for 4x) • Fourteen Data Rate (FDR) - 14Gb/s per lane (56Gb/s for 4x) • Enhanced Data rate (EDR) - 25Gb/s per lane (100Gb/s for 4x)

Link width • 1x – One differential pair per Tx/Rx • 4x – Four differential pairs per Tx/Rx • 12x - Twelve differential pairs per

Tx and per Rx Link rate

• Multiplication of the link width and link speed • Most common shipping today is 4x ports

1x Link =10Gbps

1x Link =10Gbps

1x Link =10Gbps

1x Link =10Gbps

4X QDR Cable

40Gbps TRX 40Gbps RCV

Page 46: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 47

InfiniBand Electrical Interface (Physical Layer Link Rate)

1X Link is the basic building block

- Differential pair of conductors for RX - Differential pair of conductors for TX - Link Rate per type

- Timed at 2.5 GHz with SDR - Doubled to 5GHz with DDR - Quad to 10GHz with QDR

TX

RX

TX

RX

1x Link

Differential Pair

TRX

RCV

Page 47: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 48

Physical Layer Cont’

Media types • Printed Circuit Board : several inches • Copper: 20m SDR, 10m DDR, 7m QDR • Fiber: 300m SDR, 150m DDR, 100/300m QDR

64/66 encoding on FDR links • Encoding makes it possible to send digital High Speed signals

to a Longer Distance • x actual data bits are sent on the line by y bits • 64/66 * 56 = 54.6Gbps

8/10 bit encoding (SDR, DDR, and QDR) • x/y Line efficiency ( example 80% * 40 = 32Gbps )

Industry standard components • Copper cables / Connectors • Optical cables • Backplane connectors

FR4 PCB

4X CX4

4x CX4 Fiber

4X QSFP Copper

Page 48: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 49 49

IB Headers

LRH: Local Routing Header – Includes LIDs, SL, etc

BTH: Base Transport Header – includes opcode, destination QP, partition, etc.

IB Headers

Encapsulation Header

IP Datagram

IB CRC

IB Headers

Encapsulation Header ARP IB

CRC

IB Headers LRH GRH BTH DETH = ICRC

VCRC … Link Layer NET Layer Transport Layer

All Layers

Transport Layer

Page 49: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 50

Link Layer Message Flow Example

50

Incoming Message size – up to 2Gbyte IB Routable unit Valid size 256byte to 4Kbyte

Packet

Packet

Packet

HW dis-assembles message to transfer

Routable Units

Transaction

Message

Message

Message

Message

Application accesses HW to post message request

Transaction

Message

Message

Message

Message

Transaction

Message

Message

Message

Message

Message

HW schedules execution

HW sends packets on serial link

link

Page 50: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 51

Link Layer Priority Implementation SL to VL Mapping

Packet

Packet

Packet

Transaction

Message

Message

Message

Message

Transaction

Message

Message

Message

Message

Transaction

Message

Message

Message

Message

Message link

Virtual lanes Packet specifies service level

Each link in fabric may support different

number of VLs

Data sent on serial link

LRH: Local Routing Header – Includes LIDs, SL, etc

IB Headers LRH GRH BTH DETH =

ICRC VCRC …

Page 51: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 52

Link Layer Priority Implementation SL to VL Mapping

52

link

Packet

Packet

Packet

Physical link

Packet specifies service level

Service level Mapped to Virtual Lane

Virtual lanes

Each link in fabric may support different

number of VLs

Message

Flow control

Credit-based flow control per VL

Data sent on serial link

Page 52: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 53

Link Layer Message Flow Example

53

Transaction

Message

Message

Message

Message

Message

Message

Message

Message

Message

Message

Message

Message

Transaction

Transaction

Transaction

Data Written to/ Read From

System Memory by HW

Packet

Packet

Packet

Message

HW Schedules execution of Message to System Memory

link

Data written into HCA input buffer per VL

Virtual Lane Input Buffers

Page 53: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 54

Arbitration

De- mux

Mux

Link Control

Packets

Credits Returned

Link Control

Receive Buffers Packets

Transmitted

Credit-based link-level flow control • Link Flow control , assures NO packet loss within fabric even in the presence of

congestion • Link Receivers grant packet receive buffer space credits per Virtual Lane • Flow control credits are issued in 64 byte units

Separate flow control per Virtual Lanes provides: • Alleviation of head-of-line blocking

Virtual Fabrics – Congestion and latency on one VL , does not impact traffic with guaranteed QOS on another VL , even though they share the same physical link

Link Layer – Flow Control

Page 54: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 55

InfiniBand Network Stack

User code

Kernel code

Hardware

InfiniBand node

InfiniBand Switch

Legacy node

Application

Network Layer

Link Layer

Physical Layer

Transport Layer

Network Layer

Link Layer

Physical Layer

Packet relay

PHY

Packet relay

PHY PH

Y Li

nk

PHY

Link

Router

Buffer

Buffer Buffer

Transport Layer

Application

Page 55: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 56

• The kernel code is divided logically into three layers:

• Upper level protocols

• Core InfiniBand modules

• HCA driver(s), Open Fabrics Enterprise Distribution (MELLANOX_OFED) is a

complete SW stack for RDMA capable devices.

Mellanox InfiniBand Software Stack

The Kernel Code

Page 56: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 57

Transport Layer: Queue Pairs

57

•QPs are in pairs (Send/Receive) • Every active connection / Session will be assigned with Individual Working Que Pair •Work Queue is the consumer/producer interface to the fabric

•The Consumer/producer initiates a Work Queue Element (WQE) •The Channel Adapter executes the work request •The Channel Adapter notifies on completion or errors by writing a Completion Queue Element (CQE) to a Completion Queue (CQ)

Page 57: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 58

Transport Layer: Work Request ( work Que Pair )

Data transfer • Send work request - Local gather – remote write - Remote memory read - Atomic remote operation

• Receive work request - Scatter received data to local buffer(s)

Memory management operations • Bind memory window - Open part of local memory for remote access

• Send & remote invalidate - Close remote window after operations’ completion

Control operations • Memory registration/mapping • Open/close connection (QP)

58

Host A RAM Send Queue

Receive Queue

Completion Queue

Host B RAM Send Queue

Receive Queue

Completion Queue

HCA HCA Receive Buffer

Send Buffer

Page 58: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 59

Transport Layer – Send operation example

Host A RAM Send Queue

Receive Queue

Completion Queue

Host B RAM Send Queue

Receive Queue

Completion Queue

HCA HCA

• The Receive side Application allocates receive buffer on the User Space Virtual Memory register it with the HCA,

• And place a receive Work Request on the Receive QUE

• The send side allocates a send buffer on the User Space Virtual Memory register it with the HCA,

• place a send Request On the send que

• HCA then Executes the send Request, • read the buffer of the Host Ram • and send to remote side (HCA)

• When the packet arrives to the HCA • It Executes the receive WQE Commands • Place the buffer CONTENT in

the appropriate location • And Generate a Completion Que

1

2

4

5

Receive Buffer

Send Buffer

3

Ready To Receive

Page 59: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 60

Transport Layer – RDMA Write Example

Host A RAM Send Queue

Receive Queue

Completion Queue

Host B RAM Send Queue

Receive Queue

Completion Queue

HCA HCA

• Application peforms memory Registration • And passes address and keys to

remote side • No HCA Receive que is assigned

• The send side allocates a send buffer on the User Space Virtual Memory register it with the HCA,

• place a send Request On the send quewith the remote side’s virtual address and the Remote Permission key

• HCA then Executes the send Request commands • Reads the buffer and send to remote side • send completion is generated

• When the packet arrives to the HCA • It checks the address and memory

keys • And write to Host memory directly • No use of HCA QUES

4

2

3

1

Page 60: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 61

Transport Services

61

snd rc

v

QP

snd rc

v

QP

snd rc

v

QP

snd rcv QP

snd rcv QP

snd rcv QP

snd rc

v QP

snd rc

v

QP

snd rc

v

QP

snd rcv QP

snd rcv QP

snd rcv QP

snd rcv

QP

snd rcv

QP

cmd cqe

CQ

cmd cqe C

Q

snd rcv

QP

snd rcv

QP

cmd cqe

CQ

cmd cqe C

Q

Unreliable Reliable N

on-c

onne

cted

C

onne

cted

UD ex.Multicast

UD Not used for RDMA

RD

RC ex.RDMA

UC

XRC

Page 61: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

InfiniBand Fabric Topologies

HOSTS/End Nodes

Leaf/Edge/Line Switches

Spine /Core Switches

Page 62: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 63

Min Hop algorithm (DEFAULT) • Based on the minimum hops to each node where the path length is optimized. UPDN unicast routing algorithm

• Based on the minimum hops to each node, but it is constrained to ranking rules. This algorithm should be chosen if the subnet is not a pure Fat Tree, and a deadlock may occur due to a loop in the subnet. - Root GUID list file can be specified using the –a option

Fat Tree unicast routing algorithm

• This algorithm optimizes routing for a congestion-free “shift” communication pattern. It should be chosen if a subnet is a symmetrical Fat Tree of various types, not just a K-ary-N-Tree: non-constant K, not fully staffed, and for any CBB ratio. Similar to UPDN, Fat Tree routing is constrained to ranking rules. - Root GUID list file can be specified using the –a option

Addition algorithms • LASH - Uses InfiniBand virtual layers (SL) to provide deadlock-free shortest-path routing. • DOR - This provides deadlock free routes for hypercube and mesh clusters • Table Based - A file method which can load routes from a table.

InfiniBand Route

Page 63: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 64

InfiniBand Topology

Topologies that are mainly in use for large clusters • Fat-Tree • 3D Torus • Mash

Fat-tree (also known as CBB)

• Flat network, can be set as oversubscribed network or not - In other words, blocking or non blocking

• Typically the lowest latency network

3D Torus

• An oversubscribed network, easier to scale • Fit more applications with locality

0,0

1,0

0,1

1,1

2,0 2,1

0,2

1,2

2,2

Page 64: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 65

The IB Fabric Basic Building Block

A single 36 ports IB switch chip , is the Basic

Block for every IB switch Module We create A multiple ports switching Module

using Multiple chips In this Example we create 72 ports

switch , using 6 identical chips • 4 chips will function as lines • 2 chips will function as core core

65

Mellanox 36 port

Asic (switch)

Edge/Leaf /line

Spine /Core

Page 65: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 66

CLOS Topology

Pyramid Shape Topology The switches at the Top of the Pyramid are

called Spines/Core The Core/Spine switches are Interconnected to the Other switch Environments

The switches at the Bottom of the Pyramid are called Leafs/Lines • The Leaf/Lines/Edge are connected to the Fabric Nodes/Hosts In A NON Blocking CLOS Fabric there are Equal Number of External

and internal connections External connections :

• The connections Between the Core and the Line switches Internal Connections

• The Connected of Hosts to the Line Switches In a non Blocking Fabric there is always a Balanced

Bidirectional Bandwidth In Case the Number of Internal Connections is Higher we have Blocking

Configuration

66

Page 66: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 67

CLOS - 3

The Topology detailed here is called CLOS 3 The path between source to Destination

includes 3 HOPS Example a session between A to B

• One Hop from A to switch L1-1 • Next Hop from switch L1-1 to switch L2-1 • Last Hop from L2-1 to L1-4

In this Example we can see 108 Non blocked Fabric • 108 Hosts are connected to the Line

switches • 108 Links connect between the Line Switches

To the Core witches to enable Non Blocking Interconnection of the Line switches

67

L1-1 L1-4

L2-1 L2-2

18 18 18 18 18 18

9 9 9 9 9

18* 6 = 108

Page 67: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 68

CLOS - 5

The Topology detailed here is called CLOS 5 The path between source to Destination

includes 5 HOPS Example - a session between A to B

1.One Hop from A to switch L1-1 2.Next Hop from switch L1-1 to switch L2-1 3.Next Hop from L2-1 to L3-1 4.Next Hop from L3-1 to L2-4 5.Next Hop from L2-4 to L1-8

68

L1-1 L1-7

L2-1 L2-2

L3-1 L3-2

L2-4

A B

L1-8

Page 68: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

Infiniband Cluster daignostics

Page 69: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 70 70

Cluster utilities

Integrated diagnostic tools • Queries cluster topology and indicates any port errors, link width, or link speed

mismatch.

• Automates calls to many “low level” operations

Easy to use • Similar flags, logs and reports for both tools

• Report using meaningful names when topology file is provided

Page 70: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 71

Ib commands list

Page 71: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 72

Determine if driver is loaded

/etc/init.d/openibd status • HCA driver is loaded • Configured devices - Ib0 - Ib1 - OFED modules are loaded ib_ipoib ib_mthca ib_core ib_srp

SM status • sminfo # sminfo sminfo: sm lid 1 sm guid 0x2c9030002cb6a, activity count 416348 priority 0 state 3 SMINFO_MASTER

Page 72: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 73 73

Ibdiagnet Tool

Ibdiagnet is an integrated Infiniband fabric diagnostics command line tool. It scans the IB fabric using directed / lid route packets and

extracts the available information regarding its connectivity and devices status It then checks for errors in the following scopes:

• Ports (Counters thresholds, port state) • Nodes (Firmware versions, LID assignmets) • Links (Links speed and width, Cables info) • Fabric (Topology matching, Subnet Manager, Routing)

Errors are reported to screen and saved in a log file

Page 73: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 74

ibdiagnet

/usr/bin/ibdiagnet

Use –-help or man ibdiagnet for detailed information. The tool runs and prints short report. The detailed reports are under /tmp (or as specified with -o flag) Common usage (example): Run the ibdiagnet: expect DDR links (-ls 5); expect 4x links (-lw 4); dump all files to /tmp/mydir Ibdiagnet -pm -ls 5 –lw 4x -o /tmp/mydir Output files: ibdiagnet.log - A dump of all the application reports generate according to the provided flags ibdiagnet.lst - List of all the nodes, ports and links in the fabric ibdiagnet.fdbs - A dump of the unicast forwarding tables of the fabric switches ibdiagnet.mcfdbs - A dump of the multicast forwarding tables of the fabric switches ibdiagnet.masks - In case of duplicate port/node Guids, these file include the map between masked Guid and real Guids ibdiagnet.sm - List of all the SM (state and priority) in the fabric ibdiagnet.pm - A dump of the pm Counters values, of the fabric links ibdiagnet.pkey - A dump of the the existing partitions and their member host ports ibdiagnet.mcg - A dump of the multicast groups, their properties and member host ports ibdiagnet.db - A dump of the internal subnet database. This file can be loaded in later runs using the load_db option

Page 74: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 75

Ibdiagnet -i <dev-index> -p <port-num>

• Device index (0..N) and port number connected to the network -o <out-dir>

• Directory to output the reports to -lw <1x|4x|12x> -ls <2.5|5|10>

• Link speed and width checked on every port on the network -pm -pc

• Perform error counters extensive check or clear counters respectively -r

• Extensive additional checks performed. -P

• Sets threshold for error levels. Also checks for errors of counters based on absolute value of the error counter. When not using –P

flag, error thresholds are only triggered based on how many errors were incremented DURING the ibdiagnet run. -c

• Packets to be sent on each link for error level checking

-h –V -v • Help, Verbosity and Revision flags respectively

OFED Tools

Performs InfiniBand fabric diagnostic. Issued on the Linux InfiniBand host. ibdiagnet [-c count][-v][-r][-o outputdir][-t topology][-s system][-I device][-p port][-wt topology][- pm][-pc][-P PM = value][-lw 1x|4x|12x][-ls 2.5|5|10][-skip checks][-load_db file][-h][-V]

Page 75: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 76 76

Ibdiagnet usage (Fabric Cleaning)

Ibdiagnet is particularly useful in finding misconfigured links (speed/width, topology mismatches, and marginal link/cable issues.

Typical usage: • Clear all port counters using ‘ibdiagnet –pc’ • Stress the cluster • Check cluster using ‘ibdiagnet –lw 4x –ls 5 –P all=1

- Checks for link speed, link width, and port error counters greater than 1

Page 76: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 77 77

Cluster utilities - ibnetdiscover

Reports a complete topology of cluster

Shows all interconnect connections reporting: • Port LIDs

• Port GUIDs

• Host names

• Link Speed

GUID to name file can be used for more readable topology in regards to switch devices

Page 77: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 78

ibnetdiscover

78

Vendor Id

Device ID

Image Guid

Chip Guid on Fabric board

Line Board Name

Card Slot 4

Box Type 1

Lid 5 common to all port of this chip

Port no. 1 of a chip on this switch Line board

Chip no.1

Box Type 1

Chip Guid on line board

Port no. 7 of a chip on this switch Fabric board (spine 1)

Fabric Board Name

Fabric Board Spine slot 1

Chip no.1 spine 1

Lid 3 common to all port of this chip

Link Current status

Page 78: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 79

HCA Device information

ibstat • displays basic information obtained from the local IB driver. • Normal output includes Firmware version, GUIDS, LID, SMLID, port state, link width active, and port physical state. • Has options to list CAs and/or Ports.

Page 79: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 80

Determine modules that are loaded

lsmod • ib_core • ib_mthca • ib_mad • ib_sa • ib_cm • ib_uverbs • ib_srp • ib_ipoib modinfo ‘module name’

• List all parameters accepted by the module • Module parameter can be added to /etc/modprobe.conf

Page 80: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 81

iblinkinfo

[root@raven1 ~]# iblinkinfo Switch 0x0008f104003f5d15 ISR2012/ISR2004 Voltaire sLB-2024: LID Port Number 6 1[ ] ==( 4X 5.0 Gbps Active/ LinkUp)==> 7 16[ ] "ISR2004 Voltaire sFB-2004" ( ) 6 2[ ] ==( 4X 5.0 Gbps Active/ LinkUp)==> 7 17[ ] "ISR2004 Voltaire sFB-2004" ( ) 6 3[ ] ==( 4X 5.0 Gbps Active/ LinkUp)==> 7 18[ ] "ISR2004 Voltaire sFB-2004" ( ) 6 4[ ] ==( Down/ Polling)==> [ ] "" ( ) 6 5[ ] ==( Down/ Polling)==> [ ] "" ( ) 6 6[ ] ==( Down/ Polling)==> [ ] "" ( ) 6 7[ ] ==( Down/ Polling)==> [ ] "" ( ) 6 8[ ] ==( Down/ Polling)==> [ ] "" ( ) 6 9[ ] ==( Down/ Polling)==> [ ] "" ( ) 6 10[ ] ==( 4X 5.0 Gbps Active/ LinkUp)==> 1 16[ ] "ISR2004 Voltaire sFB-2004" ( ) 6 11[ ] ==( 4X 5.0 Gbps Active/ LinkUp)==> 1 17[ ] "ISR2004 Voltaire sFB-2004" ( )

Switch 0x0008f104003f5d14 ISR2012/ISR2004 Voltaire sLB-2024: 5 1[ ] ==( 4X 5.0 Gbps Active/ LinkUp)==> 7 13[ ] "ISR2004 Voltaire sFB-2004" ( ) 5 2[ ] ==( 4X 5.0 Gbps Active/ LinkUp)==> 7 14[ ] "ISR2004 Voltaire sFB-2004" ( )

5 12[ ] ==( 4X 5.0 Gbps Active/ LinkUp)==> 1 15[ ] "ISR2004 Voltaire sFB-2004" ( ) 5 13[13] ==( 4X 5.0 Gbps Active/ LinkUp)==> 8 18[ ] "ISR9024D-M Voltaire" ( ) 5 14[14] ==( Down/ Polling)==> [ ] "" ( ) 5 15[15] ==( 4X 5.0 Gbps Active/ LinkUp)==> 10 1[ ] "raven5 HCA-1" ( ) 5 16[16] ==( Down/ Polling)==> [ ] "" ( ) 5 17[17] ==( 4X 5.0 Gbps Active/ LinkUp)==> 21 34[ ] "Voltaire 4036 # 4036-0036" ( ) 5 18[18] ==( Down/ Polling)==> [ ] "" ( )

Page 81: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 82

ibv_devinfo • Reports similar information to ibstat • Also includes PSID and an extended verbose mode (-v).

OFED Tools

Page 82: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 83

ibportstate Manages the state and link speed of an InfiniBand port. Issued on the Linux InfiniBand host.

OFED Tools

ibportstate [-d][-D][-e][-G][-h][-s smlid][-v][-C ca_name][-Pca_port][-t timeout] lid|dr_path|guid port [op]

Page 83: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 84

Perfquery Queries InfiniBand port counters. Issued on the Linux InfiniBand host.

OFED Tools

perfquery [-d][-e][-G][-h][-a][-l][-r][-R][-v][-V][-C ca_name][-P ca_port][-t timeout][lid|guid [[port][reset_mask]]]

Page 84: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 85

Ibhosts Displays host nodes. Issued on the Linux InfiniBand host.

OFED Tools

ibhosts [-h][topology|-C ca_name][-P ca_port][-t timeout]

Page 85: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 86

Ibnodes Displays InfiniBand nodes in topology. Issued on the Linux InfiniBand host.

OFED Tools

ibnodes [-h][topology|-C ca_name][-P ca_port][-t timeout]

Page 86: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 87

Ibswitches Displays InfiniBand switch node in the topology. Issued on the Linux InfiniBand host.

OFED Tools

ibswitches [-h][topology|-C ca_name][-P ca_port][-t timeout]

Page 87: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 88

sminfo Queries the InfiniBand SMInfo attribute. Issued on the Linux InfiniBand host.

OFED Tools

sminfo [-d][-e] -s state -p priority -a activity [-D][-G][-h][-V][-C ca_name][-P ca_port][-t timeout] smlid|smdr_path

Page 88: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 89

Clear counter and error report

Ibclearcounters # ibclearcounters # Summary: 74 nodes cleared 0 errors

ibclearerrors # ibclearerrors # Summary: 5 nodes cleared 0 errors

Page 89: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 90

Performance tests

Run performance tests • /usr/bin/ib_write_bw • /usr/bin/ib_write_lat • /usr/bin/ib_read_bw • /usr/bin/ib_read_lat • /usr/bin/ib_send_bw • /usr/bin/ib_send_lat

Usage • Server: <test name> <options> • Client: <test name> <options> <server IP address>

Page 90: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 91

Ib_write_bw

Page 91: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

Fabric Cleaning & Debug

Page 92: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 93

Troubleshooting (Cont)

1. Be sure your switch and hosts are powered on

2. Be sure cables are plugged in properly. 3.Check that the SM is running

4.Login to the master Switch CLI

-Run the command sm-info show and make sure that sm mode is enabled and sm state is master

-Run the command sm-info show few times , make sure sm activity counter is progressing

-In case the sm mode is disabled, enable it by typing the sm sm-info mode set enable command

-In case the sm state is not master it means that other switch or node in the fabric is running another SM that may be the master

Page 93: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 94

Fabric Troubleshooting using IB Tools (Host)

1) Run ibdiagnet command to see if any errors are being reported on the fabric.

2) If errors are detected, run ibclearcounters to clear all counters

3) Run ibclearerrors to clear all reported errors (this creates a clean baseline)

4) Run ibdiagnet again, if no errors are reported run some traffic and re-check. If errors are reported view system logs, isolate and take corrective action.

5) Re-run through steps 1-4 until error free

Page 94: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 95

Description of Port Counter Fields

Item Description Platform Type ISR9288 ISR9024, HCA400

ModuleType GER-400 – Ethernet to InfiniBand router FCR-400 – Fiber channel to InfiniBand router sLB24 – ISR9288 line board sFB12 – ISR9288 Fabric board System –ISR9024, HCA400

ModuleIndex The number of the module where the port exists. 0 for HCA or ISR9024

Port External port number or “internal#” for internal ports

Name Node name where possible, otherwise the nodes GUID .

NodeIP The IP address associated where this node where possible, otherwise 0.0.0.0

DeviceID Mellanox product device ID

MLID(#JoinedGroups) The multicast group ID that this port belongs to

PeerLID The remote connected Ports LID

PeerIBPort The remote connected Ports internal Port number

PeerPortGUID The remote connected Ports GUID

PeerPlatformType The remote connected Ports platform (ISR9288 ISR9024, HCA400)

PeerName The remote connected Ports node name (node name or GUID)

PeerModuleType This remote connected Ports Module Type (see ModuleType)

PeerModuleIndex The number of the module where the remote connected Port exists (see ModuleIndex)

PeerPort The external port number or “internal#” for internal ports where the remote connection exists.

Status OK or ALERT (port counter exceed threshold, 1X link, etc…)

Page 95: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 96

Description of Port Error Counters

Counter Description Importance SymbolErrorCounter Total number of symbol errors received on one

or more lanes. This counter can increase without it

indicated a significant problem

LinkErrorRecoveryCounter Total number of times the Port Training state machine has successfully completed the link error recovery process.

If SymbolErrors are increasing quickly AND this counter is increasing, it may be indicating a bad link

LinkDownedCounter Total number of times the Port Training state machine has failed the link error recovery process and downed the link

This counter is typically a true indication of the number of times the port has gone down (usually for valid reasons)

PortRcvErrors Total number of packets containing an error that were received on a port. These errors include:

- Local physical errors (CRC, VCRC, FCCRC and all physical errors that cause entry into the BAD PACKET or BAD PACKET DISCARD states of the packet receiver state machine)

- Malformed data packet errors - Malformed link packet errors - Packets discarded due to buffer overrun

This counter should not be increasing and a constantly increasing number probably indicates a bad link.

PortRcvRemotePhysicalErrors Total number of packets marked with the EBP delimiter received on the port.

This indicates that a problem is occurring ELSEWHERE in the fabric and that this port received a packet that was intentionally corrupted by another switch in the fabric.

PortRcvSwitchRelayErrors Total number of packets received on the port that were discarded because they could not be forwarded by the switch relay. Reasons for this include:

DLID mapping VL mapping Looping (output port = input port)

This counter can increase due to valid event occurring in the network.

Page 96: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 97

Description of Port Error Counters

Counter Description Importance PortXmitDiscards Total number of outbound packets discarded by the port

because the port is down or congested. Reasons for this include:

Output port is in the inactive state Packet length exceeded neighbor MTU Switch lifetime limit exceeded Switch HOQ limit exceeded

Typically will not increase. If it is, may be an indicator that HOQ or other parameter should be tweaked. Please contact Mellanox Customer Support.

PortXmitConstraintErrors Total number of packets not transmitted from the port for the following reasons:

FilterRawOutbound is true and packet is raw PartitionEnforcementOutbound is true and packet fails

partition check, IP version check, or transport header version check.

Typically will not increase. If it is, may be an indicator that a parameter should be tweaked. Please contact Mellanox Customer Support.

PortRcvConstraintErrors Total number of packets received on the port that are discarded for the following reasons:

FilterRawOutbound is true and packet is raw PartitionEnforcementOutbound is true and packet fails

partition check, IP version check, or transport header version check.

Typically will not increase. If it is, may be an indicator that a parameter should be tweaked. Please contact Mellanox Customer Support

LocalLinkIntegrityErrors The number of times that the frequency of packets containing local physical errors exceeded local_phy_errors.

This counter increasing in number usually indicates a bad link.

ExcessiveBufferOverrunErrors The number of times that overrun_errors consecutive flow control update periods occurred with at least one overrun error in each period (see Table 126 PortInfo on page 665 of IB spec).

Typically will not increase. If it is, may be an indicator that a parameter should be tweaked. Please contact Mellanox Customer Support

Page 97: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 98

Description of Port Error Counters

Counter Description Importance VL15Dropped Number of incoming VL15 packets dropped

due to resource limitations on port selected by PortSelect (due to lack of buffers)

This counter increasing in small increments is not seen as a problem.

PortXmitData Total number of data octets, divided by 4, transmitted on all VLs from the port selected by PortSelect. This includes all octets between (and not including) the start of packet delimiter and VCRC. It excludes all link packets.

PortRcvData Total number of data octets, divided by 4, received on all VLs from the port selected by PortSelect. This includes all octets between (and not including) the start of packet delimiter and VCRC. It excludes all link packets.

PortXmitPackets Total number of packets, excluding link packets, transmitted on all VLs from the port.

PortRcvPackets Total number of packets, excluding link packets, received on all VLs from the port.

Page 98: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 99

How to Identify Problems

Status = ALERT Width = 1X Increasing error counters

Counter Importance

SymbolError Can increase without a significant problem present

LinkErrorRecovery Increasing SymbolErrors and LinkErrorRecovery errors may indicate a bad link

LinkDowned Indicates number of times the port has gone down (usually for valid reasons)

PortRcvErrors This counter should not be increasing. Increasing number indicates a bad link

PortRcvRemotePhysicalErrors This indicates that a problem is occurring ELSEWHERE in the fabric and that this port received a packet that was intentionally corrupted by another switch in the fabric

PortRcvSwitchRelayErrors Does not indicate a problem

PortXmitDiscards May indicate HOQ or other parameter should be tweaked

PortXmitConstraintErrors May indicate that a parameter should be tweaked

PortRcvConstraintErrors May indicate that a parameter should be tweaked

LocalLinkIntegrityErrors Counter should not be increasing. Increasing number indicates a bad link

ExcessiveBufferOverrunErrors May indicate that a parameter should be tweaked

VL15Dropped This counter increasing in small increments is not seen as a problem.

Page 99: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 100

Common Host Problems

Bad cabling

HCA problem

IPOIB Interface problem

Missing Configuration

SM problem

Let’s start by checking the basics

Page 100: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

Unified Fabric manager (UFM)

Page 101: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 102

What Is UFM?

Monitor and Troubleshoot • Monitor and analyze traffic and fabric behavior • Detect and report problems automatically, suggest

corrective actions Operate

• Centralize and simplify fabric and devices operation, show summarized and analyzed information, and conduct group operations

Optimize Fabric performance and utilization • Apply optimal routing based on application requirements,

topology, and load • Most mature and optimized routing algorithms • Manage and visualize congestion and QoS Provision and Automate

• Expose the entire functionality via an extensible API, used for 3rd party integration or for automation and scripting

• Provide fabric and I/O partitioning, and application specific QoS

Page 102: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 103

Fabric Logical vs. Physical layers

Logical Layer

Fabric Policy

Mon

itorin

g Application Layer

Physical & topology Layer

Application A

Application C

Application Layer

Application B

Page 103: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 104

UFM Architecture

UFM Server

CLI GUI (Java)

Web Services

IB-SM (OpenSM)

Perf Mng Providers

Device Mng Providers

SQL DB

HA Daemon

User and Application Interfaces Access Control

Central administration

of multiple switches (or hosts)

Hierarchal performance monitoring,

variety of sources

Leverage open source SM

engine

Transparent fail-over

Fast retrieval, historical data

Manage complex relations and

workflows

Policy and role based access

control

Convenient access to fabric data

Plug-ins

Page 104: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 105

High Availability Throughout the Fabric

Seamless Subnet Manage failover Synchronization mechanism of SM and UFM Database Virtual IP for seamless failover of user interfaces

Synchronization

Heartbeat

Active UFM Server Standby UFM Server

Page 105: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 106

Fabric Optimization Cycle with UFM

Characterize traffic pattern and priorities

Fabric virtualization and QoS Optimize routing and job placement

Show traffic and congestion information

Feedback and Analysis

Optional Orchestrators & Schedulers

Application Requirements

UFM Optimization UFM Monitoring

Page 106: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 107

UFM Installation Procedures

Read UFM release notes Obtain a License Download the UFM Software Software Installation Prerequisites Installing UFM Stand alone Installing UFM with High Availability Initial Configuration

• Software Activation • Update UFM Configuration files

Running the UFM Software • Launch UFM GUI

Optional Software Components • UFM Agent Installation Prerequisites • Installing UFM Agent Software • Running the UFM Agent Software

Download Software

License

SW Install PRREQ

Install SA Software

SW Activation

Running SW

Initial Configuration

Define Http/s, Browser ,Java

Launch Gui Session

Page 107: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 108

Obtaining the license

1. Go to Voltaire‟s Licensing and Download Portal a. http://license.voltaire.com/LMManage/login.aspx

2. Log in as specified in the licensing email you received.

3. If you did not receive your Voltaire Licensing and Download Portal login

information, contact your product reseller. a. If you purchased UFM directly from Voltaire and you did not receive the login

information, contact [email protected].

4. Click the License tab. The list of software product serial licenses you own is displayed , as well as software product license information and status.

5. Select the serial number of the product license you want to activate.

Page 108: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 109

Launching UFM GUI session

1. To launch a UFM GUI session http://<UFM_server_IP>/ufmui or https://<UFM_server_IP>/ufmui

2. On the UFM Welcome Page, click Launch Unified Fabric Manager.

3. In the Login Window, enter User Name

Default admin user password

Default 123456 4. click OK.

Once you have entered your user name and password , the main window opens, showing the UFM Dashboard.

Page 109: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 110

UFM Agent

Performs Discovery of :

• Host IP address, CPU , memory and other parameters.

Statistics Collection of : • Host CPU, Memory, Disk performance

, Port Counters Remote upgrade of the HCA firmware and OFED IP interface creation per InfiniBand partition

Page 110: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 111

Task oriented top menu

Dashboard - Fabric central Design - Application correlation View - Discovery & topology context Manage Devices - Sortable online views Config - Event management policy Monitor - Real-time Monitoring Logs - Online search

Page 111: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 112

Central Dashboard Fabric Central

Alarms Capacity Map

Application “share” of resources

Root Cause Immediate suspect list

Top loaded servers

Oversubscribed Ports

The Entire Fabric in the Palm of Your Hand

Page 112: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 113

Design

Page 113: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 114

Create a Logical Network

Page 114: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 115

Create a Logical Network

Page 115: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 116

Create a Logical Server Group

Page 116: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 117

View

Page 117: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 118

View

Page 118: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 119

Manage Devices

Page 119: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 120

Manage Devices

Page 120: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 121

Manage Devices

Page 121: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 122

Manage Devices

Page 122: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 123

Manage Devices

Page 123: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 124

Config

Page 124: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 125

Config

Page 125: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 126

TARAExample - UFM Reduces Congestion

Before: High ingress congestion, low

bandwidth (high latency)

After: NO congestion, high bandwidth (low

latency)

UFM Enables to Maximize Hardware Utilization

Page 126: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 127

Config

Page 127: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 128

Config

Page 128: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 129

Monitor

Page 129: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 130

Monitor

Page 130: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 131

Monitor

Page 131: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 132

Monitor

Page 132: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 133

Advanced Monitoring Engine

Multiple sessions on demand Aggregation per Logical

Groups – no need to know physical nodes Aggregation per Multiple

Devices Various Graph options

(Linear, Bar, Histogram, Pie Chart) Correlate switch and

host info Formulas (AVG, Max,

SUM Min)

Page 133: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 134

Log

Page 134: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 135

GPUDirect的起源

The GPUDirect project was announced Nov 2009 • “NVIDIA Tesla GPUs To Communicate Faster Over Mellanox InfiniBand

Networks”, http://www.nvidia.com/object/io_1258539409179.html

GPUDirect was developed together by Mellanox and NVIDIA

• New interface (API) within the Tesla GPU driver • New interface within the Mellanox InfiniBand drivers • Linux kernel modification to allow direct communication between drivers

GPUDirect availability was announced May 2010

• “Mellanox Scalable HPC Solutions with NVIDIA GPUDirect Technology Enhance GPU-Based HPC Performance and Efficiency”

• “Mellanox was the lead partner in the development of NVIDIA GPUDirect”

Page 135: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 136

GPUDirect – GPU和网络的接口

CPU

GPU Chip set

GPU Memory

InfiniBand

System Memory 1

2

CPU

GPU Chip set

GPU Memory

InfiniBand

System Memory

1 2

Transmit Receive

CPU

GPU Chip set

GPU Memory

InfiniBand

System Memory 1 CPU

GPU Chip set

GPU Memory

InfiniBand

System Memory

1

GPUDirect

Page 136: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 137

LAMPS • 3 nodes, 10% gain

Amber – Cellulose

• 8 nodes, 32% gain

Amber – FactorX

• 8 nodes, 27% gain

GPUDirect – 应用性能表现

3 nodes, 1 GPU per node 3 nodes, 3 GPUs per node

Page 137: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 138

rCUDA – GPU成为了服务

GPU CPU

GPU CPU GPU

CPU GPU CPU

GPU CPU

Servers with GPUs

GPU as a Service

CPU VGPU

CPU VGPU

CPU VGPU

GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU

PCIe-equivalent performance

• 56Gb/s bandwidth

• 0.7usec latency

RDMA dwarfs overhead

• Maintains local access model

• Supports memory management

Independent GPU management

• GPU as network-resident service

Page 138: Mellanox End to End Solution And InfiniBand Fabric ...topic.it168.com/factory/ssc2012/Mellanox.pdfJun 30, 2012  · ConnectX® -2 InfiniBand BridgeX ® VPI ConnectX®-2 InfiniBand

Thank You