interconnect your future - prace agenda systems (indico) · © 2019 mellanox technologies |...
TRANSCRIPT
© 2019 Mellanox Technologies | Confidential 1
Paving the Road to ExascaleMarch 2019
Interconnect Your Future
© 2019 Mellanox Technologies | Confidential 2
Higher Data SpeedsHigher Data Speeds
Faster Data ProcessingFaster Data Processing
Better Data SecurityBetter Data Security
Adapters SwitchesCables &
Transceivers
SmartNIC System on a Chip
HPC and AI Needs the Most Intelligent Interconnect
© 2019 Mellanox Technologies | Confidential 3
The Need for Intelligent and Faster Interconnect
CPU-Centric (Onload) Data-Centric (Offload)
Must Wait for the DataCreates Performance Bottlenecks
Faster Data Speeds and In-Network Computing
Enable Higher Performance and Scale
GPU
CPU
GPU
CPU
Onload Network In-Network Computing
GPU
CPU
CPU
GPU
GPU
CPU
GPU
CPU
GPU
CPU
CPU
GPU
Analyze Data as it Moves!Higher Performance and Scale
© 2019 Mellanox Technologies | Confidential 4
Data Centric Architecture to Overcome Latency Bottlenecks
CPU-Centric (Onload) Data-Centric (Offload)
Communications Latencies of 30-40us
Intelligent Interconnect Paves the Road to Exascale Performance
GPU
CPU
GPU
CPU
GPU
CPU
CPU
GPU
GPU
CPU
GPU
CPU
GPU
CPU
CPU
GPU
Communications Latencies of 3-4us
© 2019 Mellanox Technologies | Confidential 5
Accelerating All Levels of HPC/AI Frameworks
GPUDirect
RDMA
Network Framework
Communication Framework
Application Framework
Data Analysis Data Analysis
SHARP MPI Tag Matching MPI Rendezvous Software Defined Virtual Devices
SHARP MPI Tag Matching MPI Rendezvous Software Defined Virtual Devices
Network Transport Offload RDMA GPU-Direct SHIELD (self-healing network)
Network Transport Offload RDMA GPU-Direct SHIELD (self-healing network)
© 2019 Mellanox Technologies | Confidential 6
Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)
Reliable Scalable General Purpose Primitive In-network Tree based aggregation mechanism Large number of groups Multiple simultaneous outstanding operations
Applicable to Multiple Use-cases HPC Applications using MPI / SHMEM Distributed Machine Learning applications
Scalable High Performance Collective Offload Barrier, Reduce, All-Reduce, Broadcast and more Sum, Min, Max, Min-loc, max-loc, OR, XOR, AND Integer and Floating-Point, 16/32/64 bits
SHArP Tree
SHARP Tree Aggregation Node (Process running on HCA)
SHARP Tree Endnode(Process running on HCA)
SHARP Tree Root
© 2019 Mellanox Technologies | Confidential 7
SHARP AllReduce Performance Advantages (128 Nodes)
SHARP enables 75% Reduction in LatencyProviding Scalable Flat Latency
SHARP enables 75% Reduction in LatencyProviding Scalable Flat Latency
Scalable Hierarchical Aggregation and
Reduction Protocol
© 2019 Mellanox Technologies | Confidential 8
SHARP AllReduce Performance Advantages 1500 Nodes, 60K MPI Ranks, Dragonfly+ Topology
SHARP Enables Highest PerformanceSHARP Enables Highest PerformanceScalable Hierarchical Aggregation and
Reduction Protocol
© 2019 Mellanox Technologies | Confidential 9
SHARP Performance Advantage for AI
SHARP provides 16% Performance Increase for deep learning, initial results TensorFlow with Horovod running ResNet50 benchmark, HDR InfiniBand (ConnectX-6, Quantum)
16%
© 2019 Mellanox Technologies | Confidential10
InfiniBand Roadmap
© 2019 Mellanox Technologies | Confidential11
Thank You