building the most efficient machine learning...

Mellanox – The Artificial Intelligence Interconnect Company

June 2017

Building the Most Efficient Machine Learning System

2- Mellanox Confidential -2

© 2017 Mellanox Technologies 2

Mellanox Overview

Company Headquarters

Yokneam, IsraelSunnyvale, CaliforniaWorldwide Offices

~2,900 Employees worldwide

Ticker: MLNX



Adapters SwitchesCables &

Transceivers

System on a Chip

Higher

Faster

Better

Data Speeds

Data Processing

Data SecuritySmartNIC

Exponential Data Growth Everywhere



Enabling the Future of Machine Learning Applications

HPC and Machine Learning Share Same Interconnect Needs

Storage

High PerformanceComputing

Financial

EmbeddedAppliances

Database

Hyperscale

Machine Learning

IoT

Healthcare

Manufacturing

Retail

Self-Driving Vehicles



Highest Performance 100 and 200Gb/s Interconnect Solutions

TransceiversActive Optical and Copper Cables (10 / 25 / 40 / 50 / 56 / 100 / 200Gb/s)

40 HDR (200Gb/s) Ports 80 HDR100 (100Gb/s) Ports 16Tb/s Throughput, 15.6 Billion msg/sec

Interconnect

Switch

Adapters200Gb/s, 0.6us Latency 200 Million Messages per Second(10 / 25 / 40 / 50 / 56 / 100 / 200Gb/s)

Switch32 100GbE Ports, 64 25/50GbE Ports(10 / 25 / 40 / 50 / 100GbE)Throughput of 6.4Tb/s

Today’s Datacenters Need the Most Intelligent Interconnect



Mellanox Delivers Best Return on Investment

60% Higher Return on Investment

Up to 50% Savings on Capital and Operation Expenses

World’s Highest Performance, Scalability and Productivity for Deep Learning

Mellanox Unlocks the Power of AI

Chainer

Cognitive Toolkit

http://chainer.org/

http://chainer.org/



Mellanox is Leading Artificial Intelligence (AI)

Health Care, Business Integrity, Business Intelligence

Knowledge Discovery, Security, Customer Support and more

Advancing Technology to Affect Science, Business, and Society

By Enabling Critical and Timely Decision Making

More Data Better Models Faster Interconnect

GPUs

CPUs

FPGAs

Storage

More Data → Faster Interconnect → Better Insight → Competitive Advantage



Enabling Most Efficient Machine Learning Platforms (Examples)

Highest Performance, Scalability and Productivity for Deep Learning

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&ved=0ahUKEwjF0pfyyM_TAhVPImMKHUH3B7oQjRwIBw&url=http://summit.ustcaagny.org/sponsors/&psig=AFQjCNF8gqplAEuLm06H8yb2H_PLeibTCw&ust=1493758090187070

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&ved=0ahUKEwjF0pfyyM_TAhVPImMKHUH3B7oQjRwIBw&url=http://summit.ustcaagny.org/sponsors/&psig=AFQjCNF8gqplAEuLm06H8yb2H_PLeibTCw&ust=1493758090187070



World’s First PCIe Gen 4

Public Cloud Server for

Cognitive Computing

Enabling Analytics

in Cloud

Sets TeraSort 2016

Benchmark Record5x Faster , 3x Energy Efficient than 2015 Record

http://sortbenchmark.org/TencentSort2016.pdf

Smart Network for Azure

Cloud ServerDesigned for Big Data Analytics & AI

Mellanox Accelerates Machine Learning and Big Data

http://sortbenchmark.org/TencentSort2016.pdf



Mellanox Accelerates Machine Learning and Big Data

Big Sur & Big Basin Facebook Open Source AI Hardware Platform

Only ONE Network of Choice - Mellanox

Powering Self Driving Car2X Faster Training with Paddle Paddle

“…We rely on fast interconnect

technologies and RDMA.”

Andrew Ng, Chief Scientist, Baiduhttps://code.facebook.com/posts/1687861518126048/facebook-to-open-source-ai-hardware-design

https://github.com/caffe2/caffe2/tree/master/caffe2/contrib/fbcollective/vendor/fbcollective

Real Time Fraud Detection14 Million Transactions per Day

4 Billion Database Inserts

Image Recognition~90% Prediction Accuracy

RDMA in Tensorflow and Caffe

Caffe

Caffe2

https://code.facebook.com/posts/1687861518126048/facebook-to-open-source-ai-hardware-design

https://github.com/caffe2/caffe2/tree/master/caffe2/contrib/fbcollective/vendor/fbcollective

http://www.google.com/url?sa=i&source=images&cd=&cad=rja&docid=kqGHk2rZStg0MM&tbnid=3uvyYV6a0huhCM:&ved=0CAgQjRwwAA&url=http://armsaroundthechild.org/ways-to-give/ways-to-give-usa/paypal/&ei=qn1VUZHFBcusrAelxoHIBw&psig=AFQjCNElUWNfaMlajpQ6OMT7WuBwrAmcoQ&ust=1364643626243740

http://www.google.com/url?sa=i&source=images&cd=&cad=rja&docid=kqGHk2rZStg0MM&tbnid=3uvyYV6a0huhCM:&ved=0CAgQjRwwAA&url=http://armsaroundthechild.org/ways-to-give/ways-to-give-usa/paypal/&ei=qn1VUZHFBcusrAelxoHIBw&psig=AFQjCNElUWNfaMlajpQ6OMT7WuBwrAmcoQ&ust=1364643626243740



AI is Changing the Way We Interact with Computers

Automotive and Transportation

Security and Public Safety

Consumer Web, Mobile,

Retail

Medicine and Biology

Broadcast, Media and

Entertainment

Finance, Fraudand Insurance

• Autonomous driving

• Pedestrian detection

• Accident avoidance

• Surveillance

• Image analysis

• Facial recognition

and detection

• Image tagging

• Speech

recognition

• Natural language

processing

• Recommendation

and sentiment

analysis

• Drug discovery

• Diagnostic

assistance

• Cancer cell

detection

• Captioning

• Search

• Recommendations

• Real time

translation

• Real Time Trade

• Credit / Risk

Analysis

• Fraud Detection

and Prevention

Efficient Deep Learning Depends on Mellanox



Deep Learning Demands Highest Performance

TRAINING DATASET

NEW DATA

TRAINING

• Scalability requires ultra-fast networking

• Same hardware needs as HPC

• Faster access to storage

• RDMA

• SHARP

• PeerDirect™, GPUDirect™, ROCm, others

INFERENCING

• Highly transactional / supports many users

• Mellanox ultra-low latency

• Instant network response

• RDMA

• PeerDirect™, GPUDirect™, ROCm, others

Billions of TFLOPS

Billions of FLOPS

Images

Video

Text

Speech

Tabular

Time Series



Exponential Data Growth – The Need for Intelligent and Faster Interconnect

CPU-Centric (Onload) Data-Centric (Offload)

Faster Data Speeds and In-Network Computing Enable Higher Performance and Scale

Must Wait for the Data

Creates Performance BottlenecksAnalyze Data as it Moves!



Data Centric Architecture to Overcome Latency Bottlenecks

HPC / Machine Learning

Communications Latencies of 30-40us

HPC / Machine Learning

Communications Latencies of 3-4us

CPU-Centric (Onload) Data-Centric (Offload)

Network In-Network Computing

Intelligent Interconnect Paves the Road to Exascale Performance



Mellanox Technology Accelerations for Machine Learning

GPU

GPU

CPUCPU

CPU

CPU

CPU

GPU

GPU

In-Network Computing Key for Highest Return on Investment

RDMA

GPUDirect

NVMe over

Fabrics

SHARP

Security



In-Network Computing Enables Deep Learning Frameworks

CUDA

Mellanox Interconnect SolutionsMellanox Accelerations for Machine Learning and Big Data

SHARP

Middleware (MPI, gRPC) - Optional

GPUDirect RDMA NVMe over FabricsrCUDA



Mellanox SHARP for Gradient Computation

CPU in a parameter server becomes the

bottleneck quickly (roughly 4 nodes)

TCP adds a lot of overhead and the

traffic pattern is bursty

• SHARP performs the gradient averaging

• Removes the need for physical parameter server

• Removes all parameter server overhead

SHARP Provides Better Scalability and Reduced Network Traffic



Purpose-built for Acceleration of Deep Learning

PeerDirect™, GPUDirect® RDMA and ASYNC



What is GPUDirect™

Provides significant decrease in communication latency for acceleration devices

Natively supported by Mellanox OFED

Supports peer-to-peer communications between Mellanox adapters and third-party devices

No unnecessary system memory copies & CPU overhead

Enables GPUDirect™ RDMA, GPUDirect™ ASYNC, ROCm and others

InfiniBand and Ethernet

CPU

Chip

set

ChipsetVendor

Device

CPU

Chip

set

ChipsetVendor

Device0101001011

Designed for Deep Learning Acceleration



GPUDirect™ RDMA and GPUDirect ASYNC™

Direct Connectivity GPU - Interconnect



GPU-GPU Internode Latency

Low

er is

Bette

r

GPUDirect™ RDMA Performance

9.3X Better Latency

GPU-GPU Internode Bandwidth

Hig

her

is B

ett

er

10X Better Throughput

Source: Prof. DK Panda

9.3X

2.18 usec

10x

http://www.google.com/url?sa=i&source=images&cd=&cad=rja&docid=l4bgqVY3Z-5H9M&tbnid=VeKX0Kar856WBM:&ved=0CAgQjRwwADgF&url=https://twitter.com/OSUCATS&ei=5a65UcmnGaqG0AWJ1YGwBg&psig=AFQjCNFmgs1A9YUXxMlqqJPS30QSMEHV0Q&ust=1371209829447892

http://www.google.com/url?sa=i&source=images&cd=&cad=rja&docid=l4bgqVY3Z-5H9M&tbnid=VeKX0Kar856WBM:&ved=0CAgQjRwwADgF&url=https://twitter.com/OSUCATS&ei=5a65UcmnGaqG0AWJ1YGwBg&psig=AFQjCNFmgs1A9YUXxMlqqJPS30QSMEHV0Q&ust=1371209829447892



NVIDIA® NCCL 2.0 Near-Linear Scalability

Optimized collective communication library

• Allreduce, Reduce, Broadcast, Reduce-scatter, Allgather

Inter-node communication using InfiniBand verbs and GPUDirect™ RDMA

Multi-rail support, Topology detection

50% performance improvement with NVIDIA® DGX-1™ across 32 NVIDIA Tesla® V100 GPUs

NVIDIA Accelerates Scalable Deep Learning with Mellanox



Performance and Scalability Examples



TensorFlow with Mellanox RDMA

Unmatched Linear Scalability, No Additional Cost

Up to 76% Efficiency and 50% Better Performance versus TCP

Reference Deployment Guide

https://community.mellanox.com/docs/DOC-2852

https://community.mellanox.com/servlet/JiveServlet/showImage/102-2852-11-141138/pastedImage_3.png

https://community.mellanox.com/servlet/JiveServlet/showImage/102-2852-11-141138/pastedImage_3.png



Accelerating TensorFlow™ with gRPC over RDMA

Open source Machine Learning from Google

Distributed training with gRPC framework• Google’s Optimized RPC for distributed network

RDMA Acceleration over UCX• Unified Communication X (UCX)

• Integration with upstream TensorFlow

2X higher Performance with RDMA

>2x FasterLower is better

~2X Acceleration for TensorFlow with RDMA



TensorFlow™ over RDMA in Apache® Spark™ Environment

Yahoo enhanced the TensorFlow C++ layer

to enable RDMA over InfiniBand

InfiniBand provides faster connectivity and

supports accelerated offload capability

Source: http://yahoohadoop.tumblr.com/post/157196317141/open-sourcing-tensorflowonspark-distributed-deep

InfiniBand Provides Near Linear Scalability for Inception Model Training

http://yahoohadoop.tumblr.com/post/157196317141/open-sourcing-tensorflowonspark-distributed-deep



2X Acceleration for Baidu

Machine Learning Software from Baidu

• Usage: word prediction, translation, image processing

RDMA (GPUDirect) speeds training

• Lowers latency, increases throughput

• More cores for training

• Even better results with optimized RDMA

~2X Acceleration for Paddle Training with RDMA



ChainerMN Depends on InfiniBand

ChainerMN depends on MPI for inter-node communication

NVIDIA® NCCL library is then used for intra-node communication between GPUs

Leveraging InfiniBand results in near linear performance

Mellanox InfiniBand allows ChainerMN to achieve ~72% accuracy.

Source: http://chainer.org/general/2017/02/08/Performance-of-Distributed-Deep-Learning-Using-ChainerMN.html

http://chainer.org/

http://chainer.org/

http://chainer.org/general/2017/02/08/Performance-of-Distributed-Deep-Learning-Using-ChainerMN.html



Machine Learning Performance Comparison

InfiniBand Delivers 60% Better Performance with 2X Less Infrastructure

DeepBench measures the performance of

basic operations involved in training deep

neural networks.

60.3%

8 Accelerators

16 Accelerators

32 Accelerators

Lower is Better



Scalable Deep Learning Depends on Mellanox

A Few Solution Examples



NVIDIA® DGX-1™

World’s first purpose-built system for deep

learning

• SaturnV is #28 on the Top500, 3.3Pf with 124 nodes

• SaturnV is also #1 on the Green500

Fully integrated hardware

• 8x Tesla™ P100 (Pascal) w/16GB per GPU

• 28672 CUDA® Cores

• 4x ConnectX-4 EDR 100Gb/s HCAs

Fully integrated software stack

• Major deep learning frameworks

• Drivers, NVIDIA CUDA, NVIDIA Deep Learning SDK

• GPUDirect™ RDMA



NVIDIA® DGX-1™ Deep Learning Server

Deep Learning Supercomputer in a Box

8 x NVIDIA® Tesla® P100 GPUs

5.3TFlops

16nm FinFET

NVLINK

4 x ConnectX®-4 EDR 100G InfiniBand

Adapters

NVIDIA® “SaturnV”NVIDIA® Machine Learning Supercomputer

#28 on the Top500

3.3Pf with 124 DGX-1 nodes

#1 on the Green500



End-to-End Interconnect Solutions for All Platforms

Highest Performance and Scalability for

X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms

X86Open

POWERGPU ARM FPGA

Smart Interconnect to Unleash The Power of All Compute Architectures



Proven Advantages

RDMA delivers 2X performance advantage over traditional TCP

Machine Learning and HPC platforms share the same interconnect needs

Scalable, flexible, high performance, high bandwidth, end-to-end connectivity

Standards-based and supported by the largest eco-system

Supports all compute architectures: x86, Power, ARM, GPU, FPGA etc.

Native Offloading architecture

RDMA, GPUDirect, SHARP and other core accelerations

Backward and future compatible

Scalable Machine Learning Depends on Mellanox

Thank You

building the most efficient machine learning...

Documents