mellanox end to end solution and infiniband fabric...
TRANSCRIPT
Mellanox End to End Solution And InfiniBand Fabric Application introduction
刘博 Mellanox System Engineer
2012年9月19日
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 2
Leading provider of high-throughput, low-latency server and storage interconnect • FDR 56Gb/s InfiniBand and 10/40GbE • Reduces application wait-time for data • Dramatically increases ROI on data center infrastructure
Company headquarters:
• Yokneam, Israel; Sunnyvale, California • ~950 employees; worldwide
Solid financial position
• Record revenue in FY’11; $259.3M • Q2’12 revenue = $133.5M; up 110.7% Y-o-Y • Q3’12 guidance ~$150.0M to $155.0M • Cash + investments @ 6/30/12 = $327.8M
Company Overview Ticker: MLNX
5 of top 10 Global Banks
Fortune 100 Penetration
10 of top 10 Automotive
Manufacturers
5 of top 10 Pharmaceutical
Companies
9 of top 10 Oil and Gas Companies
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 3
Positioned to Capture the “Big Data” Opportunity
High-Performance Computing
Proliferation of Data will be a Catalyst for Growth
7.9 zettabytes
2011 – 2015 CAGR = 45%
The Digital Universe1
1 Source: IDC Digital Universe study, June 2011
Web 2.0 Cloud Computing Enterprise Data Center
2011E 2015E
1.8 zettabytes
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 4
Markets that Require Fast Interconnect
Web 2.0 DB/Enterprise HPC
Up to 10X Performance and
Simulation Runtime
33% Higher GPU Performance
Unlimited Scalability
Lowest Latency
62% Better Execution Time
42% Faster Messages
Per Second
Financial Services Cloud
12X More Throughput
Support More Users at Higher Bandwidth
Improve and
Guarantee SLAs
10X Database Query Performance
4X Faster VM
Migration
More VMs per Server and More
Bandwidth per VM
Storage
2X Hadoop Performance
13X Memcached
Performance
4X Price/Perf
Mellanox storage acceleration software provides >80% more IOPS (I/O operations per second)
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 5
Mellanox Interconnect Products Enable Customer Choice
10Gb/s, 40Gb/s, 56Gb/s Ethernet and InfiniBand
10Gb/s, 40Gb/s, 56Gb/s Ethernet and InfiniBand
Application Acceleration
• Big Data • Storage • TCP/UDP
Adapters Switches Software
• Database • Trading • HPC
Virtual Protocol Interconnect (VPI) Provides High Performance Over any Converged Interconnect with Same Software Infrastructure
Cables
10Gb/s, 40Gb/s, 56Gb/s Ethernet and InfiniBand
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 6
Host/Fabric Software
Leading Supplier of End-to-End Connectivity Solutions for Servers and Storage
Virtual Protocol Interconnect
Storage Front / Back-End Server / Compute Switch / Gateway
56G IB & FCoIB 56G InfiniBand
10/40GigE & FCoE 10/40GigE
Industries Only End-to-End InfiniBand and Ethernet Portfolio
ICs Switches/Gateways Adapter Cards Cables
Fibre Channel
Virtual Protocol Interconnect
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 7
Mellanox Multi-Protocol/VPI Connectivity Technology Efficient, Flexible and Scalable for Maximum ROI
Applications Transparency enables Data Center Agility
Financial
Cloud
Computing
Cloud & Web 2.0
Clustered Database
Clustered Database
Financial
Financial
Web Services
HPC Applications
Database Apps
CRM, ERP Apps
Business Logic
.NET, JAVA
Mellanox VPI Connectivity Solution
Trading, Analytics
Web 2.0 Storage
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 8
Service-Oriented Cloud Infrastructure
Storage NFS, CIFS, iSCSI
NFS-RDMA, SRP, iSER, Fibre Channel, Clustered
Networking TCP/IP/UDP
Sockets
Clustering MPI, DAPL, RDS, Sockets
Management SNMP, SMI-S
OpenView, Tivoli, BMC, Computer Associates
App1 App2 App3 App4 AppX …
Acceleration Engines
Protocols
Applications
Networking Virtualization Clustering Storage RDMA
Ethernet FC
NAS
InfiniBand
IB Storage
Storage Storage LAN SAN
Converged Fabric 40Gig Ethernet 56Gig InfiniBand
Fabric
Running Any Protocol over Any Convergence Fabric
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 9
Mellanox Advanced InfiniBand Solutions
- Collectives Accelerations (FCA/CORE-Direct) - GPU Accelerations (GPUDirect) - MPI/SHMEM - RDMA - Quality of Service
- Adaptive Routing - Congestion Management - Traffic aware Routing (TARA)
- UFM, FabricIT - Integration with job schedulers - Inbox Drivers
Server and Storage High-Speed Connectivity
Networking Efficiency/Scalability
Application Accelerations
Host/Fabric Software Management
- Latency - Bandwidth
- CPU Utilization - Message rate
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 10
Cables
648 ports
324 ports
216 ports 108 ports
20 and 40Gb/s Modular Switches
Switch Silicon
Gateway Silicon
Software
Adapters
Adapter Cards Adapter Silicon
Systems
InfiniBand Switches Gateway
IS5025, IS5030, IS5031, IS5035
IS5100 IS5200 IS5300 IS5600 BridgeX® BX5020 InfiniScale IV
ConnectX® -2 InfiniBand
BridgeX ® VPI
ConnectX®-2 InfiniBand Dual Port
QSFP w/ PCIe 2.0
ConnectX®-2 VPI QSFP IB and SFP+
10GigE
ConnectX®-2 Ethernet Dual Port SFP+ w/ PCIe
2.0
ConnectX®-2 InfiniBand Single Port
QSFP w/ PCIe 2.0
Delivering Unified I/O - 10/20/40G InfiniBand to 10GigE and 1/2/4/8G FC
36-port 40Gb/s Switch Systems
Fabric Management
Fabric IT
IS5022
8-port Non-blocking Remotely-managed
40Gb/s Switch System
Mellanox Product Line
4036SM
PWR PS/Fan
RstCLI
EthInfo SM23 24 25 26 27 28 29 30 31 32 33 34 35 36
Eth Switch
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 11
Comprehensive End-to-End Ethernet Product Portfolio
SX6536 - 648p
Vantage 6024 – 24p
NICs
Cables
SX1036 - 36p
SX6518 - 324p
SX6512 - 216p SX1016 – 64P
Management software
Switches SX1024 48x10G + 12x40G
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 13
IS5022 IS5023 IS5024 IS5025 IS5030 IS5035 4036
Ports 8 18 36 36 36 36 36
Switch Capacity 640Gb/s 1.44Tb/s 2.88Tb/s 2.88Tb/s 2.88Tb/s 2.88Tb/s 2.88Tb/s
Management - - - - Chassis management
Full management
Full management
- - - - SM 108 nodes
SM 648 nodes
SM 648 nodes
In-band FW update
In-band FW update
In-band FW update
In-band FW update
FabricITTM
UFMTM FabricITTM
UFMTM UFMTM
CPU - - - - PPC405 PPC460 PPC460
Management Eth Ports - - - - 1 2 1
Management USB port - - - - Yes Yes Yes
LEDs Status,UnitIDFan, PortErr
Status,UnitIDFan, PortErr
Status,UnitIDFan, PortErr
Status,Fan, PS1, PS2
Status,Fan, PS1, PS2
Status,Fan, PS1, PS2
Info, Fan, PS, SM
Design No FRUs No FRUs No FRUs Fan/PS FRU Fan/PS FRU Fan/PS FRU Fan/PS FRU
AC power inlet location Connector side panel
Connector side panel
Connector side panel
P/S side panel
P/S side panel
P/S side panel
P/S side panel
# of Power Supplies 1 1 1 1 (2nd optional)
1 (2nd optional)
1 (2nd optional) 2
InfiniBand QDR Edge Switch Comparison
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 14
InfiniBand QDR Director Switch Comparison
IS5100 IS5200 IS5300 IS5600 4200 4700
Ports 108 216 324 648 144/162 324/648
Height (Shelf will add 1U-2U) 6U 9U 16U 29U 11U 19U
Switch Capacity 8.64Tb/s 17.28Tb/s 25.9Tb/s 51.8Tb/s 11.52Tb/s 25.92Tb/s
Spine modules 3 6 9 18 4 9
Leaf modules (max) 6 12 18 36 8 18
Management FabricITTM
UFMTM (May’11)
FabricITTM
UFMTM (May’11)
FabricITTM
UFMTM (May’11)
FabricITTM
UFMTM (May’11) UFMTM UFMTM
SM 648 nodes
SM 648 nodes
SM 648 nodes
SM 648 nodes
SM 648 nodes
SM 648 nodes
Power Supplies (Hot swappable, redundant) 2 + 1 3 + 1 4 + 2 8 + 2 Up to 4
(N + N) Up to 6 (N + N)
Fans (Hot swappable, redundant)
4 Type4 chassis 1 Type3 per spine
4 Type4 chassis 1 Type3 per spine
2 Type1 + 2 Type2 1 Type3 per spine
4 Type1 + 4 Type2 1 Type3 per spine
One Horizontal One Vertical
One Horizontal One Vertical
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 15
SX6025 SX6036 SX6506 SX6512 SX6518 SX6536
Ports 36 36 108 216 324 648
Switch Capacity 4.032Tb/s 4.032Tb/s
Height (Shelf will add 1U-2U) 1U 1U 6U 9U 16U 29U
Spine modules - - 3 6 9 18
Leaf modules (max) - - 6 12 18 36
Management - SM 648 nodes
SM 648 nodes
SM 648 nodes
SM 648 nodes
SM 648 nodes
In-band FW update
MX-OSTM
UFMTM MX-OSTM
UFMTM MX-OSTM
UFMTM MX-OSTM
UFMTM MX-OSTM
UFMTM
CPU - PPC460 PPC460 PPC460 PPC460 PPC460
LEDs Status,Fan, PS1, PS2
PortERR, UnitID
Status,Fan, PS1, PS2
PortERR, UnitID
IS5X00 like + PortERR, UnitID
IS5X00 like + PortERR, UnitID
IS5X00 like + PortERR, UnitID
IS5X00 like + PortERR, UnitID
# of Power Supplies 1 (2nd optional)
1 (2nd optional) 2 + 1 3 + 1 4 + 2 8 + 2
New features VPI, FEC,
power governor, IB router
VPI, FEC, power governor,
IB router
VPI, FEC, power governor,
IB router
VPI, FEC, power governor,
IB router
VPI, FEC, power governor,
IB router
VPI, FEC, power governor,
IB router
FDR Switch Comparison
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 16
Mellanox Grid Director 4036E Physical Specs
Base on Mellanox Grid Director 4036 + IB-ETH Bridging Silicon 34 x QDR/DDR/SDR (auto-negotiating) InfiniBand port (QSFP) 2 x 1/10GbE port (SFP+) Redundancy Management modle Shared FRUs with 4036:
• Rail kits, PS, Fan units
Measurement: • 19’’/1U High / 21‘’ deep
Mellanox IB-ETH Bridge Silicon
2 x 1/10GbE SFP+
34 x 40Gb/s IB QSFP
1U High
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 17
IB Gateway System: BX5020
Dual hot-swappable redundant power supplies Replaceable fan drawer Embedded management
• PowerPC CPU, GigE and RS232 out-band management port
• Uplinks: 4 QSFP (IB) • Downlinks: 16 SFP+
• 12 1/10 GigE combination EN ports • Requires CX/CX2 HCAs
InfiniBand Foundations
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 19
InfiniBand Trade Association (IBTA)
Founded in 1999
Actively markets and promotes InfiniBand from an industry perspective through public relations engagements, developer conferences and workshops
Steering Committee Members:
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 20
InfiniBand is a Switch Fabric Architecture
► Interconnect technology connecting CPUs and I/O ► Super high performance
High bandwidth (starting at 10Gbps and up to 60Gbps) – Lots of head room!
Low latency – Fast application response across the cluster. Low CPU Utilization with RDMA (Remote Direct Memory Access) – Unlike
Ethernet, communication bypasses the OS and the CPU’s.
► Increased application performance ► Single port solution for all LAN, SAN, and application
communication ► High reliability Subnet Manger with redundancy ► InfiniBand is a technology that was designed for large scale
grids and clusters
First industry standard high speed interconnect!
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 21
InfiniBand Roadmap
SDR - Single Data Rate DDR - Double Data Rate QDR - Quad Data Rate FDR - Fourteen Data Rate EDR - Enhanced Data Rate HDR - High Data Rate NDR - Next Data Rate
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 22
InfiniBand Resources
InfiniBand software is developed under OpenFabrics Open Source Alliance
http://www.openfabrics.org/index.html InfiniBand standard is developed by the InfiniBand Trade
Association http://www.infinibandta.org/home
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 23
Industry standard defined by the InfiniBand Trade Association • Originated in 1999 InfiniBand™ specification defines an input/output architecture used
to interconnect servers, communications infrastructure equipment, storage and embedded systems InfiniBand is a pervasive, low-latency, high-bandwidth interconnect
which requires low processing overhead and is ideal to carry multiple traffic types (clustering, communications, storage, management) over a single connection. As a mature and field-proven technology, InfiniBand is used in
thousands of data centers, high-performance compute clusters and embedded applications that scale from small scale to large scale
What is InfiniBand?
Source: InfiniBand® Trade Association (IBTA) www.infinibandta.org
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 24
InfiniBand Components Overview
Host Channel Adapter (HCA) • Device that terminates an IB link
and executes transport-level functions and support the verbs interface
Switch
• A device that routes packets from one link to another of the same IB Subnet
Router
• A device that transports packets between different IBA subnets
Bridge • InfiniBand to Ethernet
Processor Node
InfiniBand Subnet
Gateway
HCA
Processor Node
Processor Node
HCA
HCA
Storage Subsystem
Consoles
RAID
Ethernet
Gateway
Fibre Channel
HCA
Subnet Manager
•Switch
Switch Switch
Switch
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 25
Host Channel Adapters (HCA)
Equivalent to a NIC (Ethernet) - GUID (Global Unique ID = MAC)
Converts PCI to InfiniBand CPU offload of transport operations End-to-end QoS and congestion control Communicate via Queue Pairs (QPs) HCA Options:
• Single Data Rate 2.5GB/S * 4 = 10 • Double Data Rate 5 GB/S * 4 = 20 • Quadruple Data Rate 10GB/S * 4 = 40 • Fourteen Data Rate 14 Gb/s * 4 = 56
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 26
HCA Physical Address Global Unique Identifier (GUID) GUID - 64 bit
Host Channel Adapters (HCA’s) & all Switches require GUID & LID addresses
3 Types of Guids per ASIC - Node = Is meant to identify the HCA as a entity - Port = Identifies the Port as a port - System = Allows to combine multiple GUIDS creating one
entity
Global Unique Identifier “Like a Ethernet MAC address” - Assigned by IB Vendor - Persistent through reboots
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 27
IB Fabric L2 Switching Addressing Local Identifier (LID)
Local Identifier “Like a dynamic IP address”
LID - 16 bit
Host Channel Adapters (HCA’s) & Switches all require GUID & LID addresses
• Assigned by the SM when port becomes active
• Not Persistent through reboots • Address ranges
0x0000 = reserved 0x0001 = 0xBFFF = Unicast 0xc001 = 0xFFFE = Multicast 0xFFFF = Reserved for special use
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 28
Node & Switch Main identifiers
IB Port Basic Identifiers • Port number • Host Channel Adapter – HCA (IB “NIC” ) • Global universal id – GUID 64 bit ( like mac )
ex. 00:02:C9:02:00:41:38:30 - Each 36 ports “basic “ switch has its own switch & system GUID - All ports belong to the same “basic “ switch will share the switch GUID
• Local Identifier - LID • Virtual Lane –VL Used to separate different Bandwidth & Qos using same physical port
LID • Local Identifier that is assigned to any IB device
by the SM and used for packets “ routing “ within an IB fabric .
• All ports of the same ASIC unit are using the Same LID
Master SM
Switch
Port-1
Host
HCA Switch
00:02:C9:02:00:41:27:12 switch GUID
CA
Port-X
GUID- 00:02:C9:02:00:41:27:12 GUID -00:02:C9:02:00:41:38:35
LID 37
LID 14
LID 12
LID 8
Infiniband Link
LID 37 LID 14
Port x Port x
VL-15,VL_ 0-7
De- mux
Mux
Traffic Packets VL 0-7 Packets
Transmitted
Link Control VL-15
Traffic Packets VL 0-7
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 29
Partitioning - Pkey to VLAN mapping
Define up to 64 partitions in a single Partition by mapping port and Ethernet VLAN to InfiniBand
PKEY
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 30
Packet Flow QOS Management
Low Priority VL Arbitrary
Criteria Types • Group of source ports, • Groups of destination ports, • Partitions, • QOS classes • Application Service ID’s
SL 0-3
SL 4
SL 6
SL 7
SL 8
SL 10
SL 12
Default
GPFS
IP_O_IB
SDP
RDS
Native Multicast
Clock Sync
W 32
W 32
W 32
W 32
W 64
W 64
Fabric Nodes Users
Packets Criteria Categorized
Service level
W 64
Health
Bonds
Private
Stocks
Government
Virtual Lanes over Physical Link
VL-0
VL-1
VL-2
VL-3
VL-4
VL-5
VL-5 High Priority VL Arbitrary
Introducing
Subnet Manager (SM)
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 32
Subnet Manager (SM) Rules & Roles
Every subnet must have at least one - Manages all elements in the IB fabric - Discover subnet topology - Assign LIDs to devices - Calculate and program switch chip forwarding tables (LFT pathing) - Monitor changes in subnet
Implemented anywhere in the fabric - Node, Switch, Specialized device
No more than one active SM allowed - 1 Active (Master) and remaining are Standby (HA)
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 33
Subnet Administrator (SA)
The SA is typically an extension of the SM A passive entity that provides a database
- Subnet topology - Device types - Device characteristics
Responds to queries
- Paths between HCAs - Event notification - Persistent information - Switch forwarding tables
Used to keep multiple SMs in sync
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 34
InfiniBand Switch Operation
InfiniBand packets are ‘destination routed’ based on the Destination Logical ID (DLID) field in the header DLID is 16 bit address
- 48K values are used for unicast - 16K values are used for multicast
At each switch ASIC, the incoming unicast DLID is used as
an index into a Linear Forwarding Table (LFT) that returns the outgoing switch port number - E.g. the InfiniScale III ASIC supports all 48K possible LFT entries
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 35
Subnet Management
CPU
HCA
System Memory
HCA
IB Switch
IB Switch
HCA
IB Switch
TCA
HCA
Subnet Manager
Each Subnet must have a Subnet Manager (SM)
SMA
SMA
SMA
SMA
SMA
SMA
SMA
SMA
Every entity (CA, SW, Router) must support a Subnet
Management Agent (SMA)
Subnet Manager
HCA IB Switch
Standby SM
Standby SM
Standby SM
Topology Discovery Fabric Maintenance
Multipathing: LMC Supports Multiple LIDS LMC: 1
LID = 6,7
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 36
OpenSM (osm) is an Infiniband compliant subnet manger. Included in Linux Open Fabrics Enterprise Distribution. Ability to run several instance of osm on the cluster in a
Master/Slave(s) configuration for redundancy. Partitions (p-key) support QoS support Enhanced routing algorithms:
• Min-hop • Up-down • Fat-tree • LASH • DOR
OpenSM - Features
36
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 37
Management Model
QP1 (virtualized per port) Uses any VL except 15 MADs called GMPs - LID-Routed Subject to Flow Control
Baseboard Management Agent
Communication Mgmt (Mgr/Agent)
Performance Management Agent
Device Management Agent
Vendor-Specific Agent
Application-Specific Agent
SNMP Tunneling Agent
Subnet Administration (an Agent)
General Service Interface
Subnet Manager (SM) Agent Subnet Manager
Subnet Management Interface
QP0 (virtualized per port) Always uses VL15 MADs called SMPs – LID or Direct-Routed No Flow Control
Pure InfiniBand Management Other Management Features
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 38
Command line • Default (no parameters)
Scans and initializes the IB fabric and will occasionally sweep for changes • opensm –h for usage flags
E.g. to start with up-down routing: opensm –-routing_engine updn • Run is logged to two files:
- /var/log/messages – opensm messages, registers only general major events - /var/log/opensm.log - details of reported errors.
Start on Boot
• As a daemon: - /etc/init.d/opensmd start|stop|restart|status - /etc/opensm.conf for default parameters
# ONBOOT # To start OpenSM automatically set ONBOOT=yes ONBOOT=yes
SM detection • /etc/init.d/opensd status
- Shows opensm runtime status on a machine • sminfo
- Shows master and standby subnets running on the cluster
Running OpenSm
38
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 39
A few important command line parameters: -c, --cache-options. Write out a list of all tunable OpenSM parameters, including
their current values from the command line as well as defaults for others, into the file /var/cache/opensm. This file can then be modified to change OSM parameters, such as HOQ (Head of Queue timer).
-g, --guid This option specifies the local port GUID value with which OpenSM
should bind. OpenSM may be bound to 1 port at a time. This option is used if the SM needs to bind to Port 2 of an HCA.
-R, --routing_engine This option chooses routing engine instead of Min Hop
algorithm (default). Supported engines: updn, file, ftree, lash -x, --honor_guid2lid. This option forces OpenSM to honor the guid2lid file, when
it comes out of Standby state, if such file exists under /var/cache/opensm -V This option sets the maximum verbosity level and forces log flushing.
OpenSM Command Line parameters
39
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 40
Min Hop algorithm (DEFAULT) • Based on the minimum hops to each node where the path length is optimized. UPDN unicast routing algorithm
• Based on the minimum hops to each node, but it is constrained to ranking rules. This algorithm should be chosen if the subnet is not a pure Fat Tree, and a deadlock may occur due to a loop in the subnet. - Root GUID list file can be specified using the –a option
Fat Tree unicast routing algorithm
• This algorithm optimizes routing for a congestion-free “shift” communication pattern. It should be chosen if a subnet is a symmetrical Fat Tree of various types, not just a K-ary-N-Tree: non-constant K, not fully staffed, and for any CBB ratio. Similar to UPDN, Fat Tree routing is constrained to ranking rules. - Root GUID list file can be specified using the –a option
Addition algorithms • LASH - Uses InfiniBand virtual layers (SL) to provide deadlock-free shortest-path routing. • DOR. This provides deadlock free routes for hypercube and mesh clusters • Table Based. A file method which can load routes from a table.
Routing Algorithms
40
IB Fabric Protocol Layers
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 42
Software Transport Verbs and Upper Layer Protocols: - Interface between application programs and hardware. - Allows support of legacy protocols such as TCP/IP - Defines methodology for management functions Transport: - Delivers packets to the appropriate Queue Pair; Message Assembly/De-assembly, access rights, etc. Network: - How packets are routed between Different Partitions /subnets Data Link (Symbols and framing): - Flow control (credit-based); How packets are routed , from Source to Destination on the same Partition Subnet Physical: - Signal levels and Frequency; Media; Connectors
IB Architecture Layers
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 43
Distributed Computing using IB
DLID VL
Defines QoS
Pkey D-QP
Hop Payload
lgth
ETH RDMA Reliable
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 44
IB Packet
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 45
Software Stack
45
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 46
Physical Layer – Link Rate
InfiniBand uses serial stream of bits for data transfer Link Speed
• Single Data Rate (SDR) - 2.5Gb/s per lane (10Gb/s for 4x) • Double Data Rate (DDR) - 5Gb/s per lane (20Gb/s for 4x) • Quad Data Rate (QDR) - 10Gb/s per lane (40Gb/s for 4x) • Fourteen Data Rate (FDR) - 14Gb/s per lane (56Gb/s for 4x) • Enhanced Data rate (EDR) - 25Gb/s per lane (100Gb/s for 4x)
Link width • 1x – One differential pair per Tx/Rx • 4x – Four differential pairs per Tx/Rx • 12x - Twelve differential pairs per
Tx and per Rx Link rate
• Multiplication of the link width and link speed • Most common shipping today is 4x ports
1x Link =10Gbps
1x Link =10Gbps
1x Link =10Gbps
1x Link =10Gbps
4X QDR Cable
40Gbps TRX 40Gbps RCV
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 47
InfiniBand Electrical Interface (Physical Layer Link Rate)
1X Link is the basic building block
- Differential pair of conductors for RX - Differential pair of conductors for TX - Link Rate per type
- Timed at 2.5 GHz with SDR - Doubled to 5GHz with DDR - Quad to 10GHz with QDR
TX
RX
TX
RX
1x Link
Differential Pair
TRX
RCV
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 48
Physical Layer Cont’
Media types • Printed Circuit Board : several inches • Copper: 20m SDR, 10m DDR, 7m QDR • Fiber: 300m SDR, 150m DDR, 100/300m QDR
64/66 encoding on FDR links • Encoding makes it possible to send digital High Speed signals
to a Longer Distance • x actual data bits are sent on the line by y bits • 64/66 * 56 = 54.6Gbps
8/10 bit encoding (SDR, DDR, and QDR) • x/y Line efficiency ( example 80% * 40 = 32Gbps )
Industry standard components • Copper cables / Connectors • Optical cables • Backplane connectors
FR4 PCB
4X CX4
4x CX4 Fiber
4X QSFP Copper
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 49 49
IB Headers
LRH: Local Routing Header – Includes LIDs, SL, etc
BTH: Base Transport Header – includes opcode, destination QP, partition, etc.
IB Headers
Encapsulation Header
IP Datagram
IB CRC
IB Headers
Encapsulation Header ARP IB
CRC
IB Headers LRH GRH BTH DETH = ICRC
VCRC … Link Layer NET Layer Transport Layer
All Layers
Transport Layer
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 50
Link Layer Message Flow Example
50
Incoming Message size – up to 2Gbyte IB Routable unit Valid size 256byte to 4Kbyte
Packet
Packet
Packet
HW dis-assembles message to transfer
Routable Units
Transaction
Message
Message
Message
Message
Application accesses HW to post message request
Transaction
Message
Message
Message
Message
Transaction
Message
Message
Message
Message
Message
HW schedules execution
HW sends packets on serial link
link
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 51
Link Layer Priority Implementation SL to VL Mapping
Packet
Packet
Packet
Transaction
Message
Message
Message
Message
Transaction
Message
Message
Message
Message
Transaction
Message
Message
Message
Message
Message link
Virtual lanes Packet specifies service level
Each link in fabric may support different
number of VLs
Data sent on serial link
LRH: Local Routing Header – Includes LIDs, SL, etc
IB Headers LRH GRH BTH DETH =
ICRC VCRC …
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 52
Link Layer Priority Implementation SL to VL Mapping
52
link
Packet
Packet
Packet
Physical link
Packet specifies service level
Service level Mapped to Virtual Lane
Virtual lanes
Each link in fabric may support different
number of VLs
Message
Flow control
Credit-based flow control per VL
Data sent on serial link
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 53
Link Layer Message Flow Example
53
Transaction
Message
Message
Message
Message
Message
Message
Message
Message
Message
Message
Message
Message
Transaction
Transaction
Transaction
Data Written to/ Read From
System Memory by HW
Packet
Packet
Packet
Message
HW Schedules execution of Message to System Memory
link
Data written into HCA input buffer per VL
Virtual Lane Input Buffers
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 54
Arbitration
De- mux
Mux
Link Control
Packets
Credits Returned
Link Control
Receive Buffers Packets
Transmitted
Credit-based link-level flow control • Link Flow control , assures NO packet loss within fabric even in the presence of
congestion • Link Receivers grant packet receive buffer space credits per Virtual Lane • Flow control credits are issued in 64 byte units
Separate flow control per Virtual Lanes provides: • Alleviation of head-of-line blocking
Virtual Fabrics – Congestion and latency on one VL , does not impact traffic with guaranteed QOS on another VL , even though they share the same physical link
Link Layer – Flow Control
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 55
InfiniBand Network Stack
User code
Kernel code
Hardware
InfiniBand node
InfiniBand Switch
Legacy node
Application
Network Layer
Link Layer
Physical Layer
Transport Layer
Network Layer
Link Layer
Physical Layer
Packet relay
PHY
Packet relay
PHY PH
Y Li
nk
PHY
Link
Router
Buffer
Buffer Buffer
Transport Layer
Application
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 56
• The kernel code is divided logically into three layers:
• Upper level protocols
• Core InfiniBand modules
• HCA driver(s), Open Fabrics Enterprise Distribution (MELLANOX_OFED) is a
complete SW stack for RDMA capable devices.
Mellanox InfiniBand Software Stack
The Kernel Code
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 57
Transport Layer: Queue Pairs
57
•QPs are in pairs (Send/Receive) • Every active connection / Session will be assigned with Individual Working Que Pair •Work Queue is the consumer/producer interface to the fabric
•The Consumer/producer initiates a Work Queue Element (WQE) •The Channel Adapter executes the work request •The Channel Adapter notifies on completion or errors by writing a Completion Queue Element (CQE) to a Completion Queue (CQ)
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 58
Transport Layer: Work Request ( work Que Pair )
Data transfer • Send work request - Local gather – remote write - Remote memory read - Atomic remote operation
• Receive work request - Scatter received data to local buffer(s)
Memory management operations • Bind memory window - Open part of local memory for remote access
• Send & remote invalidate - Close remote window after operations’ completion
Control operations • Memory registration/mapping • Open/close connection (QP)
58
Host A RAM Send Queue
Receive Queue
Completion Queue
Host B RAM Send Queue
Receive Queue
Completion Queue
HCA HCA Receive Buffer
Send Buffer
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 59
Transport Layer – Send operation example
Host A RAM Send Queue
Receive Queue
Completion Queue
Host B RAM Send Queue
Receive Queue
Completion Queue
HCA HCA
• The Receive side Application allocates receive buffer on the User Space Virtual Memory register it with the HCA,
• And place a receive Work Request on the Receive QUE
• The send side allocates a send buffer on the User Space Virtual Memory register it with the HCA,
• place a send Request On the send que
• HCA then Executes the send Request, • read the buffer of the Host Ram • and send to remote side (HCA)
• When the packet arrives to the HCA • It Executes the receive WQE Commands • Place the buffer CONTENT in
the appropriate location • And Generate a Completion Que
1
2
4
5
Receive Buffer
Send Buffer
3
Ready To Receive
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 60
Transport Layer – RDMA Write Example
Host A RAM Send Queue
Receive Queue
Completion Queue
Host B RAM Send Queue
Receive Queue
Completion Queue
HCA HCA
• Application peforms memory Registration • And passes address and keys to
remote side • No HCA Receive que is assigned
• The send side allocates a send buffer on the User Space Virtual Memory register it with the HCA,
• place a send Request On the send quewith the remote side’s virtual address and the Remote Permission key
• HCA then Executes the send Request commands • Reads the buffer and send to remote side • send completion is generated
• When the packet arrives to the HCA • It checks the address and memory
keys • And write to Host memory directly • No use of HCA QUES
4
2
3
1
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 61
Transport Services
61
snd rc
v
QP
snd rc
v
QP
snd rc
v
QP
snd rcv QP
snd rcv QP
snd rcv QP
snd rc
v QP
snd rc
v
QP
snd rc
v
QP
snd rcv QP
snd rcv QP
snd rcv QP
snd rcv
QP
snd rcv
QP
cmd cqe
CQ
cmd cqe C
Q
snd rcv
QP
snd rcv
QP
cmd cqe
CQ
cmd cqe C
Q
Unreliable Reliable N
on-c
onne
cted
C
onne
cted
UD ex.Multicast
UD Not used for RDMA
RD
RC ex.RDMA
UC
XRC
InfiniBand Fabric Topologies
HOSTS/End Nodes
Leaf/Edge/Line Switches
Spine /Core Switches
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 63
Min Hop algorithm (DEFAULT) • Based on the minimum hops to each node where the path length is optimized. UPDN unicast routing algorithm
• Based on the minimum hops to each node, but it is constrained to ranking rules. This algorithm should be chosen if the subnet is not a pure Fat Tree, and a deadlock may occur due to a loop in the subnet. - Root GUID list file can be specified using the –a option
Fat Tree unicast routing algorithm
• This algorithm optimizes routing for a congestion-free “shift” communication pattern. It should be chosen if a subnet is a symmetrical Fat Tree of various types, not just a K-ary-N-Tree: non-constant K, not fully staffed, and for any CBB ratio. Similar to UPDN, Fat Tree routing is constrained to ranking rules. - Root GUID list file can be specified using the –a option
Addition algorithms • LASH - Uses InfiniBand virtual layers (SL) to provide deadlock-free shortest-path routing. • DOR - This provides deadlock free routes for hypercube and mesh clusters • Table Based - A file method which can load routes from a table.
InfiniBand Route
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 64
InfiniBand Topology
Topologies that are mainly in use for large clusters • Fat-Tree • 3D Torus • Mash
Fat-tree (also known as CBB)
• Flat network, can be set as oversubscribed network or not - In other words, blocking or non blocking
• Typically the lowest latency network
3D Torus
• An oversubscribed network, easier to scale • Fit more applications with locality
0,0
1,0
0,1
1,1
2,0 2,1
0,2
1,2
2,2
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 65
The IB Fabric Basic Building Block
A single 36 ports IB switch chip , is the Basic
Block for every IB switch Module We create A multiple ports switching Module
using Multiple chips In this Example we create 72 ports
switch , using 6 identical chips • 4 chips will function as lines • 2 chips will function as core core
65
Mellanox 36 port
Asic (switch)
Edge/Leaf /line
Spine /Core
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 66
CLOS Topology
Pyramid Shape Topology The switches at the Top of the Pyramid are
called Spines/Core The Core/Spine switches are Interconnected to the Other switch Environments
The switches at the Bottom of the Pyramid are called Leafs/Lines • The Leaf/Lines/Edge are connected to the Fabric Nodes/Hosts In A NON Blocking CLOS Fabric there are Equal Number of External
and internal connections External connections :
• The connections Between the Core and the Line switches Internal Connections
• The Connected of Hosts to the Line Switches In a non Blocking Fabric there is always a Balanced
Bidirectional Bandwidth In Case the Number of Internal Connections is Higher we have Blocking
Configuration
66
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 67
CLOS - 3
The Topology detailed here is called CLOS 3 The path between source to Destination
includes 3 HOPS Example a session between A to B
• One Hop from A to switch L1-1 • Next Hop from switch L1-1 to switch L2-1 • Last Hop from L2-1 to L1-4
In this Example we can see 108 Non blocked Fabric • 108 Hosts are connected to the Line
switches • 108 Links connect between the Line Switches
To the Core witches to enable Non Blocking Interconnection of the Line switches
67
L1-1 L1-4
L2-1 L2-2
18 18 18 18 18 18
9 9 9 9 9
18* 6 = 108
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 68
CLOS - 5
The Topology detailed here is called CLOS 5 The path between source to Destination
includes 5 HOPS Example - a session between A to B
1.One Hop from A to switch L1-1 2.Next Hop from switch L1-1 to switch L2-1 3.Next Hop from L2-1 to L3-1 4.Next Hop from L3-1 to L2-4 5.Next Hop from L2-4 to L1-8
68
L1-1 L1-7
L2-1 L2-2
L3-1 L3-2
L2-4
A B
L1-8
Infiniband Cluster daignostics
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 70 70
Cluster utilities
Integrated diagnostic tools • Queries cluster topology and indicates any port errors, link width, or link speed
mismatch.
• Automates calls to many “low level” operations
Easy to use • Similar flags, logs and reports for both tools
• Report using meaningful names when topology file is provided
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 71
Ib commands list
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 72
Determine if driver is loaded
/etc/init.d/openibd status • HCA driver is loaded • Configured devices - Ib0 - Ib1 - OFED modules are loaded ib_ipoib ib_mthca ib_core ib_srp
SM status • sminfo # sminfo sminfo: sm lid 1 sm guid 0x2c9030002cb6a, activity count 416348 priority 0 state 3 SMINFO_MASTER
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 73 73
Ibdiagnet Tool
Ibdiagnet is an integrated Infiniband fabric diagnostics command line tool. It scans the IB fabric using directed / lid route packets and
extracts the available information regarding its connectivity and devices status It then checks for errors in the following scopes:
• Ports (Counters thresholds, port state) • Nodes (Firmware versions, LID assignmets) • Links (Links speed and width, Cables info) • Fabric (Topology matching, Subnet Manager, Routing)
Errors are reported to screen and saved in a log file
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 74
ibdiagnet
/usr/bin/ibdiagnet
Use –-help or man ibdiagnet for detailed information. The tool runs and prints short report. The detailed reports are under /tmp (or as specified with -o flag) Common usage (example): Run the ibdiagnet: expect DDR links (-ls 5); expect 4x links (-lw 4); dump all files to /tmp/mydir Ibdiagnet -pm -ls 5 –lw 4x -o /tmp/mydir Output files: ibdiagnet.log - A dump of all the application reports generate according to the provided flags ibdiagnet.lst - List of all the nodes, ports and links in the fabric ibdiagnet.fdbs - A dump of the unicast forwarding tables of the fabric switches ibdiagnet.mcfdbs - A dump of the multicast forwarding tables of the fabric switches ibdiagnet.masks - In case of duplicate port/node Guids, these file include the map between masked Guid and real Guids ibdiagnet.sm - List of all the SM (state and priority) in the fabric ibdiagnet.pm - A dump of the pm Counters values, of the fabric links ibdiagnet.pkey - A dump of the the existing partitions and their member host ports ibdiagnet.mcg - A dump of the multicast groups, their properties and member host ports ibdiagnet.db - A dump of the internal subnet database. This file can be loaded in later runs using the load_db option
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 75
Ibdiagnet -i <dev-index> -p <port-num>
• Device index (0..N) and port number connected to the network -o <out-dir>
• Directory to output the reports to -lw <1x|4x|12x> -ls <2.5|5|10>
• Link speed and width checked on every port on the network -pm -pc
• Perform error counters extensive check or clear counters respectively -r
• Extensive additional checks performed. -P
• Sets threshold for error levels. Also checks for errors of counters based on absolute value of the error counter. When not using –P
flag, error thresholds are only triggered based on how many errors were incremented DURING the ibdiagnet run. -c
• Packets to be sent on each link for error level checking
-h –V -v • Help, Verbosity and Revision flags respectively
OFED Tools
Performs InfiniBand fabric diagnostic. Issued on the Linux InfiniBand host. ibdiagnet [-c count][-v][-r][-o outputdir][-t topology][-s system][-I device][-p port][-wt topology][- pm][-pc][-P PM = value][-lw 1x|4x|12x][-ls 2.5|5|10][-skip checks][-load_db file][-h][-V]
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 76 76
Ibdiagnet usage (Fabric Cleaning)
Ibdiagnet is particularly useful in finding misconfigured links (speed/width, topology mismatches, and marginal link/cable issues.
Typical usage: • Clear all port counters using ‘ibdiagnet –pc’ • Stress the cluster • Check cluster using ‘ibdiagnet –lw 4x –ls 5 –P all=1
- Checks for link speed, link width, and port error counters greater than 1
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 77 77
Cluster utilities - ibnetdiscover
Reports a complete topology of cluster
Shows all interconnect connections reporting: • Port LIDs
• Port GUIDs
• Host names
• Link Speed
GUID to name file can be used for more readable topology in regards to switch devices
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 78
ibnetdiscover
78
Vendor Id
Device ID
Image Guid
Chip Guid on Fabric board
Line Board Name
Card Slot 4
Box Type 1
Lid 5 common to all port of this chip
Port no. 1 of a chip on this switch Line board
Chip no.1
Box Type 1
Chip Guid on line board
Port no. 7 of a chip on this switch Fabric board (spine 1)
Fabric Board Name
Fabric Board Spine slot 1
Chip no.1 spine 1
Lid 3 common to all port of this chip
Link Current status
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 79
HCA Device information
ibstat • displays basic information obtained from the local IB driver. • Normal output includes Firmware version, GUIDS, LID, SMLID, port state, link width active, and port physical state. • Has options to list CAs and/or Ports.
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 80
Determine modules that are loaded
lsmod • ib_core • ib_mthca • ib_mad • ib_sa • ib_cm • ib_uverbs • ib_srp • ib_ipoib modinfo ‘module name’
• List all parameters accepted by the module • Module parameter can be added to /etc/modprobe.conf
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 81
iblinkinfo
[root@raven1 ~]# iblinkinfo Switch 0x0008f104003f5d15 ISR2012/ISR2004 Voltaire sLB-2024: LID Port Number 6 1[ ] ==( 4X 5.0 Gbps Active/ LinkUp)==> 7 16[ ] "ISR2004 Voltaire sFB-2004" ( ) 6 2[ ] ==( 4X 5.0 Gbps Active/ LinkUp)==> 7 17[ ] "ISR2004 Voltaire sFB-2004" ( ) 6 3[ ] ==( 4X 5.0 Gbps Active/ LinkUp)==> 7 18[ ] "ISR2004 Voltaire sFB-2004" ( ) 6 4[ ] ==( Down/ Polling)==> [ ] "" ( ) 6 5[ ] ==( Down/ Polling)==> [ ] "" ( ) 6 6[ ] ==( Down/ Polling)==> [ ] "" ( ) 6 7[ ] ==( Down/ Polling)==> [ ] "" ( ) 6 8[ ] ==( Down/ Polling)==> [ ] "" ( ) 6 9[ ] ==( Down/ Polling)==> [ ] "" ( ) 6 10[ ] ==( 4X 5.0 Gbps Active/ LinkUp)==> 1 16[ ] "ISR2004 Voltaire sFB-2004" ( ) 6 11[ ] ==( 4X 5.0 Gbps Active/ LinkUp)==> 1 17[ ] "ISR2004 Voltaire sFB-2004" ( )
Switch 0x0008f104003f5d14 ISR2012/ISR2004 Voltaire sLB-2024: 5 1[ ] ==( 4X 5.0 Gbps Active/ LinkUp)==> 7 13[ ] "ISR2004 Voltaire sFB-2004" ( ) 5 2[ ] ==( 4X 5.0 Gbps Active/ LinkUp)==> 7 14[ ] "ISR2004 Voltaire sFB-2004" ( )
5 12[ ] ==( 4X 5.0 Gbps Active/ LinkUp)==> 1 15[ ] "ISR2004 Voltaire sFB-2004" ( ) 5 13[13] ==( 4X 5.0 Gbps Active/ LinkUp)==> 8 18[ ] "ISR9024D-M Voltaire" ( ) 5 14[14] ==( Down/ Polling)==> [ ] "" ( ) 5 15[15] ==( 4X 5.0 Gbps Active/ LinkUp)==> 10 1[ ] "raven5 HCA-1" ( ) 5 16[16] ==( Down/ Polling)==> [ ] "" ( ) 5 17[17] ==( 4X 5.0 Gbps Active/ LinkUp)==> 21 34[ ] "Voltaire 4036 # 4036-0036" ( ) 5 18[18] ==( Down/ Polling)==> [ ] "" ( )
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 82
ibv_devinfo • Reports similar information to ibstat • Also includes PSID and an extended verbose mode (-v).
OFED Tools
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 83
ibportstate Manages the state and link speed of an InfiniBand port. Issued on the Linux InfiniBand host.
OFED Tools
ibportstate [-d][-D][-e][-G][-h][-s smlid][-v][-C ca_name][-Pca_port][-t timeout] lid|dr_path|guid port [op]
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 84
Perfquery Queries InfiniBand port counters. Issued on the Linux InfiniBand host.
OFED Tools
perfquery [-d][-e][-G][-h][-a][-l][-r][-R][-v][-V][-C ca_name][-P ca_port][-t timeout][lid|guid [[port][reset_mask]]]
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 85
Ibhosts Displays host nodes. Issued on the Linux InfiniBand host.
OFED Tools
ibhosts [-h][topology|-C ca_name][-P ca_port][-t timeout]
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 86
Ibnodes Displays InfiniBand nodes in topology. Issued on the Linux InfiniBand host.
OFED Tools
ibnodes [-h][topology|-C ca_name][-P ca_port][-t timeout]
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 87
Ibswitches Displays InfiniBand switch node in the topology. Issued on the Linux InfiniBand host.
OFED Tools
ibswitches [-h][topology|-C ca_name][-P ca_port][-t timeout]
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 88
sminfo Queries the InfiniBand SMInfo attribute. Issued on the Linux InfiniBand host.
OFED Tools
sminfo [-d][-e] -s state -p priority -a activity [-D][-G][-h][-V][-C ca_name][-P ca_port][-t timeout] smlid|smdr_path
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 89
Clear counter and error report
Ibclearcounters # ibclearcounters # Summary: 74 nodes cleared 0 errors
ibclearerrors # ibclearerrors # Summary: 5 nodes cleared 0 errors
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 90
Performance tests
Run performance tests • /usr/bin/ib_write_bw • /usr/bin/ib_write_lat • /usr/bin/ib_read_bw • /usr/bin/ib_read_lat • /usr/bin/ib_send_bw • /usr/bin/ib_send_lat
Usage • Server: <test name> <options> • Client: <test name> <options> <server IP address>
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 91
Ib_write_bw
Fabric Cleaning & Debug
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 93
Troubleshooting (Cont)
1. Be sure your switch and hosts are powered on
2. Be sure cables are plugged in properly. 3.Check that the SM is running
4.Login to the master Switch CLI
-Run the command sm-info show and make sure that sm mode is enabled and sm state is master
-Run the command sm-info show few times , make sure sm activity counter is progressing
-In case the sm mode is disabled, enable it by typing the sm sm-info mode set enable command
-In case the sm state is not master it means that other switch or node in the fabric is running another SM that may be the master
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 94
Fabric Troubleshooting using IB Tools (Host)
1) Run ibdiagnet command to see if any errors are being reported on the fabric.
2) If errors are detected, run ibclearcounters to clear all counters
3) Run ibclearerrors to clear all reported errors (this creates a clean baseline)
4) Run ibdiagnet again, if no errors are reported run some traffic and re-check. If errors are reported view system logs, isolate and take corrective action.
5) Re-run through steps 1-4 until error free
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 95
Description of Port Counter Fields
Item Description Platform Type ISR9288 ISR9024, HCA400
ModuleType GER-400 – Ethernet to InfiniBand router FCR-400 – Fiber channel to InfiniBand router sLB24 – ISR9288 line board sFB12 – ISR9288 Fabric board System –ISR9024, HCA400
ModuleIndex The number of the module where the port exists. 0 for HCA or ISR9024
Port External port number or “internal#” for internal ports
Name Node name where possible, otherwise the nodes GUID .
NodeIP The IP address associated where this node where possible, otherwise 0.0.0.0
DeviceID Mellanox product device ID
MLID(#JoinedGroups) The multicast group ID that this port belongs to
PeerLID The remote connected Ports LID
PeerIBPort The remote connected Ports internal Port number
PeerPortGUID The remote connected Ports GUID
PeerPlatformType The remote connected Ports platform (ISR9288 ISR9024, HCA400)
PeerName The remote connected Ports node name (node name or GUID)
PeerModuleType This remote connected Ports Module Type (see ModuleType)
PeerModuleIndex The number of the module where the remote connected Port exists (see ModuleIndex)
PeerPort The external port number or “internal#” for internal ports where the remote connection exists.
Status OK or ALERT (port counter exceed threshold, 1X link, etc…)
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 96
Description of Port Error Counters
Counter Description Importance SymbolErrorCounter Total number of symbol errors received on one
or more lanes. This counter can increase without it
indicated a significant problem
LinkErrorRecoveryCounter Total number of times the Port Training state machine has successfully completed the link error recovery process.
If SymbolErrors are increasing quickly AND this counter is increasing, it may be indicating a bad link
LinkDownedCounter Total number of times the Port Training state machine has failed the link error recovery process and downed the link
This counter is typically a true indication of the number of times the port has gone down (usually for valid reasons)
PortRcvErrors Total number of packets containing an error that were received on a port. These errors include:
- Local physical errors (CRC, VCRC, FCCRC and all physical errors that cause entry into the BAD PACKET or BAD PACKET DISCARD states of the packet receiver state machine)
- Malformed data packet errors - Malformed link packet errors - Packets discarded due to buffer overrun
This counter should not be increasing and a constantly increasing number probably indicates a bad link.
PortRcvRemotePhysicalErrors Total number of packets marked with the EBP delimiter received on the port.
This indicates that a problem is occurring ELSEWHERE in the fabric and that this port received a packet that was intentionally corrupted by another switch in the fabric.
PortRcvSwitchRelayErrors Total number of packets received on the port that were discarded because they could not be forwarded by the switch relay. Reasons for this include:
DLID mapping VL mapping Looping (output port = input port)
This counter can increase due to valid event occurring in the network.
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 97
Description of Port Error Counters
Counter Description Importance PortXmitDiscards Total number of outbound packets discarded by the port
because the port is down or congested. Reasons for this include:
Output port is in the inactive state Packet length exceeded neighbor MTU Switch lifetime limit exceeded Switch HOQ limit exceeded
Typically will not increase. If it is, may be an indicator that HOQ or other parameter should be tweaked. Please contact Mellanox Customer Support.
PortXmitConstraintErrors Total number of packets not transmitted from the port for the following reasons:
FilterRawOutbound is true and packet is raw PartitionEnforcementOutbound is true and packet fails
partition check, IP version check, or transport header version check.
Typically will not increase. If it is, may be an indicator that a parameter should be tweaked. Please contact Mellanox Customer Support.
PortRcvConstraintErrors Total number of packets received on the port that are discarded for the following reasons:
FilterRawOutbound is true and packet is raw PartitionEnforcementOutbound is true and packet fails
partition check, IP version check, or transport header version check.
Typically will not increase. If it is, may be an indicator that a parameter should be tweaked. Please contact Mellanox Customer Support
LocalLinkIntegrityErrors The number of times that the frequency of packets containing local physical errors exceeded local_phy_errors.
This counter increasing in number usually indicates a bad link.
ExcessiveBufferOverrunErrors The number of times that overrun_errors consecutive flow control update periods occurred with at least one overrun error in each period (see Table 126 PortInfo on page 665 of IB spec).
Typically will not increase. If it is, may be an indicator that a parameter should be tweaked. Please contact Mellanox Customer Support
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 98
Description of Port Error Counters
Counter Description Importance VL15Dropped Number of incoming VL15 packets dropped
due to resource limitations on port selected by PortSelect (due to lack of buffers)
This counter increasing in small increments is not seen as a problem.
PortXmitData Total number of data octets, divided by 4, transmitted on all VLs from the port selected by PortSelect. This includes all octets between (and not including) the start of packet delimiter and VCRC. It excludes all link packets.
PortRcvData Total number of data octets, divided by 4, received on all VLs from the port selected by PortSelect. This includes all octets between (and not including) the start of packet delimiter and VCRC. It excludes all link packets.
PortXmitPackets Total number of packets, excluding link packets, transmitted on all VLs from the port.
PortRcvPackets Total number of packets, excluding link packets, received on all VLs from the port.
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 99
How to Identify Problems
Status = ALERT Width = 1X Increasing error counters
Counter Importance
SymbolError Can increase without a significant problem present
LinkErrorRecovery Increasing SymbolErrors and LinkErrorRecovery errors may indicate a bad link
LinkDowned Indicates number of times the port has gone down (usually for valid reasons)
PortRcvErrors This counter should not be increasing. Increasing number indicates a bad link
PortRcvRemotePhysicalErrors This indicates that a problem is occurring ELSEWHERE in the fabric and that this port received a packet that was intentionally corrupted by another switch in the fabric
PortRcvSwitchRelayErrors Does not indicate a problem
PortXmitDiscards May indicate HOQ or other parameter should be tweaked
PortXmitConstraintErrors May indicate that a parameter should be tweaked
PortRcvConstraintErrors May indicate that a parameter should be tweaked
LocalLinkIntegrityErrors Counter should not be increasing. Increasing number indicates a bad link
ExcessiveBufferOverrunErrors May indicate that a parameter should be tweaked
VL15Dropped This counter increasing in small increments is not seen as a problem.
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 100
Common Host Problems
Bad cabling
HCA problem
IPOIB Interface problem
Missing Configuration
SM problem
Let’s start by checking the basics
Unified Fabric manager (UFM)
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 102
What Is UFM?
Monitor and Troubleshoot • Monitor and analyze traffic and fabric behavior • Detect and report problems automatically, suggest
corrective actions Operate
• Centralize and simplify fabric and devices operation, show summarized and analyzed information, and conduct group operations
Optimize Fabric performance and utilization • Apply optimal routing based on application requirements,
topology, and load • Most mature and optimized routing algorithms • Manage and visualize congestion and QoS Provision and Automate
• Expose the entire functionality via an extensible API, used for 3rd party integration or for automation and scripting
• Provide fabric and I/O partitioning, and application specific QoS
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 103
Fabric Logical vs. Physical layers
Logical Layer
Fabric Policy
Mon
itorin
g Application Layer
Physical & topology Layer
Application A
Application C
Application Layer
Application B
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 104
UFM Architecture
UFM Server
CLI GUI (Java)
Web Services
IB-SM (OpenSM)
Perf Mng Providers
Device Mng Providers
SQL DB
HA Daemon
User and Application Interfaces Access Control
Central administration
of multiple switches (or hosts)
Hierarchal performance monitoring,
variety of sources
Leverage open source SM
engine
Transparent fail-over
Fast retrieval, historical data
Manage complex relations and
workflows
Policy and role based access
control
Convenient access to fabric data
Plug-ins
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 105
High Availability Throughout the Fabric
Seamless Subnet Manage failover Synchronization mechanism of SM and UFM Database Virtual IP for seamless failover of user interfaces
Synchronization
Heartbeat
Active UFM Server Standby UFM Server
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 106
Fabric Optimization Cycle with UFM
Characterize traffic pattern and priorities
Fabric virtualization and QoS Optimize routing and job placement
Show traffic and congestion information
Feedback and Analysis
Optional Orchestrators & Schedulers
Application Requirements
UFM Optimization UFM Monitoring
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 107
UFM Installation Procedures
Read UFM release notes Obtain a License Download the UFM Software Software Installation Prerequisites Installing UFM Stand alone Installing UFM with High Availability Initial Configuration
• Software Activation • Update UFM Configuration files
Running the UFM Software • Launch UFM GUI
Optional Software Components • UFM Agent Installation Prerequisites • Installing UFM Agent Software • Running the UFM Agent Software
Download Software
License
SW Install PRREQ
Install SA Software
SW Activation
Running SW
Initial Configuration
Define Http/s, Browser ,Java
Launch Gui Session
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 108
Obtaining the license
1. Go to Voltaire‟s Licensing and Download Portal a. http://license.voltaire.com/LMManage/login.aspx
2. Log in as specified in the licensing email you received.
3. If you did not receive your Voltaire Licensing and Download Portal login
information, contact your product reseller. a. If you purchased UFM directly from Voltaire and you did not receive the login
information, contact [email protected].
4. Click the License tab. The list of software product serial licenses you own is displayed , as well as software product license information and status.
5. Select the serial number of the product license you want to activate.
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 109
Launching UFM GUI session
1. To launch a UFM GUI session http://<UFM_server_IP>/ufmui or https://<UFM_server_IP>/ufmui
2. On the UFM Welcome Page, click Launch Unified Fabric Manager.
3. In the Login Window, enter User Name
Default admin user password
Default 123456 4. click OK.
Once you have entered your user name and password , the main window opens, showing the UFM Dashboard.
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 110
UFM Agent
Performs Discovery of :
• Host IP address, CPU , memory and other parameters.
Statistics Collection of : • Host CPU, Memory, Disk performance
, Port Counters Remote upgrade of the HCA firmware and OFED IP interface creation per InfiniBand partition
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 111
Task oriented top menu
Dashboard - Fabric central Design - Application correlation View - Discovery & topology context Manage Devices - Sortable online views Config - Event management policy Monitor - Real-time Monitoring Logs - Online search
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 112
Central Dashboard Fabric Central
Alarms Capacity Map
Application “share” of resources
Root Cause Immediate suspect list
Top loaded servers
Oversubscribed Ports
The Entire Fabric in the Palm of Your Hand
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 113
Design
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 114
Create a Logical Network
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 115
Create a Logical Network
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 116
Create a Logical Server Group
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 117
View
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 118
View
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 119
Manage Devices
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 120
Manage Devices
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 121
Manage Devices
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 122
Manage Devices
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 123
Manage Devices
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 124
Config
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 125
Config
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 126
TARAExample - UFM Reduces Congestion
Before: High ingress congestion, low
bandwidth (high latency)
After: NO congestion, high bandwidth (low
latency)
UFM Enables to Maximize Hardware Utilization
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 127
Config
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 128
Config
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 129
Monitor
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 130
Monitor
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 131
Monitor
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 132
Monitor
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 133
Advanced Monitoring Engine
Multiple sessions on demand Aggregation per Logical
Groups – no need to know physical nodes Aggregation per Multiple
Devices Various Graph options
(Linear, Bar, Histogram, Pie Chart) Correlate switch and
host info Formulas (AVG, Max,
SUM Min)
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 134
Log
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 135
GPUDirect的起源
The GPUDirect project was announced Nov 2009 • “NVIDIA Tesla GPUs To Communicate Faster Over Mellanox InfiniBand
Networks”, http://www.nvidia.com/object/io_1258539409179.html
GPUDirect was developed together by Mellanox and NVIDIA
• New interface (API) within the Tesla GPU driver • New interface within the Mellanox InfiniBand drivers • Linux kernel modification to allow direct communication between drivers
GPUDirect availability was announced May 2010
• “Mellanox Scalable HPC Solutions with NVIDIA GPUDirect Technology Enhance GPU-Based HPC Performance and Efficiency”
• “Mellanox was the lead partner in the development of NVIDIA GPUDirect”
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 136
GPUDirect – GPU和网络的接口
CPU
GPU Chip set
GPU Memory
InfiniBand
System Memory 1
2
CPU
GPU Chip set
GPU Memory
InfiniBand
System Memory
1 2
Transmit Receive
CPU
GPU Chip set
GPU Memory
InfiniBand
System Memory 1 CPU
GPU Chip set
GPU Memory
InfiniBand
System Memory
1
GPUDirect
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 137
LAMPS • 3 nodes, 10% gain
Amber – Cellulose
• 8 nodes, 32% gain
Amber – FactorX
• 8 nodes, 27% gain
GPUDirect – 应用性能表现
3 nodes, 1 GPU per node 3 nodes, 3 GPUs per node
© 2011 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 138
rCUDA – GPU成为了服务
GPU CPU
GPU CPU GPU
CPU GPU CPU
GPU CPU
Servers with GPUs
GPU as a Service
CPU VGPU
CPU VGPU
CPU VGPU
GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU
PCIe-equivalent performance
• 56Gb/s bandwidth
• 0.7usec latency
RDMA dwarfs overhead
• Maintains local access model
• Supports memory management
Independent GPU management
• GPU as network-resident service
Thank You