infiniband in-network computing technology · infiniband network technology fundamentals gpu cpu...
Post on 21-Sep-2020
6 Views
Preview:
TRANSCRIPT
July 2020
INFINIBAND IN-NETWORK COMPUTING TECHNOLOGY
2
THE NEW SCIENTIFIC COMPUTING WORLD
NETWORK
EDGE
APPLIANCE
SUPERCOMPUTER
STORAGE
3
HDR 200G INFINIBAND ACCELERATES NEXT GENERATION HPC AND AI SUPERCOMPUTERS (EXAMPLES)
8K HDR NodesDragonfly+ Topology
9 PetaFLOPS3K HDR NodesDragonfly+ Topology
7.5 PetaFLOPS2K HDR NodesDragonfly+ Topology
35.5 PetaFLOPS2K HDR NodesFat-Tree Topology
23 PetaFLOPS5.6K HDR NodesDragonfly+ Topology
HPC/AI CloudHDR InfiniBand
HDR Supercomputers
23.5 PetaFLOPS8K HDR NodesFat-Tree Topology
27.6 PetaFLOPS3K HDR NodesFat-Tree Topology
3K HDR Nodes16 PetaFLOPSDragonfly+ Topology
4
INFINIBAND NETWORK TECHNOLOGY FUNDAMENTALS
GPU
CPU
DPU
Smart End-Point Architected to Scale Centralized Management Standard
5
INFINIBAND ACCELERATED SUPERCOMPUTING
SHARP AI Technology
AI Acceleration Engines
2.5X Higher AI Performance
UFM Cyber AI
Data Center Cyber Intelligence and Analytics
Speed of Light
200Gb/s Data Throughput
RDMA and GPUDirect RDMA
3X Better (Lower) Latency
SHIELD AI Technology
Self Healing Network
1000X Faster Recovery Time
6
THE NEW DATA CENTER Faster Data Speeds and In-Network Computing Enable Higher Performance and Scale
CPU-Centric (Onload)
Must Wait for the Data
Creates Performance Bottlenecks
Security Limitations
Onload Network
CPU
GPU
CPU
GPU
CPU
GPU
CPU
GPU
Data-Centric (Offload)
Analyze Data as it Moves!
Higher Performance and Scale
Secured Supercomputing
In-Network Computing
CPU
GPU
CPU
GPU
CPU
GPU
CPU
GPU
7
Network
Communication
Application High Performance Computing
Data Analysis
Deep Learning
Cyber Security
In-Network Computing
NVMe, Containers, OpenStack
Storage / Other Resource Disaggregation
Full Network Transport Offload
RDMA and GPU-Direct RDMA
SHIELD (Self-Healing Network)
Enhanced Adaptive Routing and Congestion Control
Connectivity Ultimate Software Defined Network
Multi-Host and Socket-Direct Technology
Enhanced and Flexible Topologies
THE SMARTEST INTERCONNECT
8
SCALABLE HIERARCHICAL AGGREGATION AND
REDUCTION PROTOCOL (SHARP)
9
SCALABLE HIERARCHICAL AGGREGATION AND REDUCTION PROTOCOL (SHARP)
In-network Tree based aggregation mechanism
Multiple simultaneous outstanding operations
For HPC (MPI / SHMEM) and Distributed Machine Learning applications
Scalable High Performance Collective Offload
Barrier, Reduce, All-Reduce, Broadcast and more
Sum, Min, Max, Min-loc, max-loc, OR, XOR, AND
Integer and Floating-Point, 16/32/64 bits
DataAggregated
AggregatedResult
Aggregated Result
Data
Switch Switch
Switch
HostHostHost Host Host
10
SHARP ALLREDUCE PERFORMANCE ADVANTAGES Providing Flat Latency, 7X Higher Performance
11
SHARP PERFORMANCE ADVANTAGE OVER ROCE4X Higher Performance
12
INFINIBAND SHARP AI PERFORMANCE ADVANTAGE2.5X Higher Performance
13
INFINIBAND ACCELERATED AI PLATFORMS
NVIDIA DGX A100 SuperPODWorld’s most Advanced AI System
AISTThe AI Bridging Cloud
Infrastructure
Microsoft Azure200 Gigabit HDR InfiniBand Boosts Microsoft Azure High-Performance
Computing Cloud Instances
ContinentalAdvanced Driver Assistance
Systems (ADAS)
14
HDR INFINIBAND
15
HDR 200G INFINIBAND SOLUTIONS (NOW)
Transceivers
Active Optical and Copper Cables
40 HDR (200Gb/s) InfiniBand Ports
80 HDR100 InfiniBand Ports
Modular Switch - 800 HDR (1600 HDR1000) Ports
200Gb/s Adapter
PCIe Gen4
Drivers, Management, Frameworks and Accelerations
UFM, UCX, MPI, SHMEM/PGAS, UPC
System on Chip and SmartNIC
Programmable adapter, Smart Offloads
16
MELLANOX SKYWAY™ INFINIBAND TO ETHERNET GATEWAY
100G EDR / 200G HDR InfiniBand to 100G and 200G Ethernet gateway
400G NDR / 800G XDR InfiniBand speeds ready
Eight EDR/HDR100/HDR InfiniBand ports to eight 100/200G Ethernet
Max throughput of 1.6 Terabit per second
High availability and load balancing
Mellanox Gateway operating system
Scalable and efficient
17
METROX®-2
Seamlessly connects InfiniBand data-centers up to 40 kilometers-apart
Scalability and load balancing across data-centers
Continues compute service in case of data-center failures
Standard HDR and EDR InfiniBand end-to-end
Advanced In-Network Computing
Extending InfiniBand to 40km Reach
18
UFM INFINIBAND CYBER INTELLIGENCE AND
ANALYTICS PLATFORM
19
REVOLUTIONIZING SUPERCOMPUTINGAI-Powered InfiniBand Cyber Intelligence and Analytics Platform
Management and Orchestration
Predictive and Preventive Maintenance
Telemetry and Monitoring Cyber-security and Anomaly Detection
Integration of Real-Time Telemetry with AI Algorithms to Secure Supercomputers, and Enable Predictive Maintenance for OPEX Optimizations
20
UFM PLATFORMS PORTFOLIO
UFM Cyber-AICyber Intelligence and Analytics
(UFM Cyber-AI includes UFM Enterprise)
UFM EnterpriseManagement, Monitoring & Orchestration
(UFM Enterprise includes UFM Telemetry)
UFM Telemetry Real-Time Monitoring
21
UFM DASHBOARD
Secure Cable Management
Network Validation
Performance MonitoringReal-Time AnalysisPrediction Dashboard
Congestion Mapping Health Reports Inventory Mapping
22
SUMMARY
23NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
INFINIBAND DELIVERS HIGHEST PERFORMANCE AND ROI
200G end-to-end, extremely low latency, high message rate, RDMA and GPUDirect
Advanced adaptive routing, congestion control and quality of service for highest network efficiency
In-Network Computing engines for accelerating applications performance and scalability
Self Healing Network with SHIELD for highest network resiliency
Standard - backward and forward compatibility – protecting datacenter investments
InfiniBand
NVMe / Storage
InfiniBand High Speed Network
Advanced In-Network Computing
Extremely Low Latency
Ethernet
NVMe / Storage
High Speed Gateway
InfiniBand to Ethernet
Compute Servers
InfiniBand
Long-Haul InfiniBand
top related