building the most efficient machine learning...

35
Mellanox The Artificial Intelligence Interconnect Company June 2017 Building the Most Efficient Machine Learning System

Upload: others

Post on 05-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Building the Most Efficient Machine Learning Systemon-demand.gputechconf.com/gtc-eu/2017/presentation/... · RDMA delivers 2X performance advantage over traditional TCP Machine Learning

Mellanox – The Artificial Intelligence Interconnect Company

June 2017

Building the Most Efficient Machine Learning System

Page 2: Building the Most Efficient Machine Learning Systemon-demand.gputechconf.com/gtc-eu/2017/presentation/... · RDMA delivers 2X performance advantage over traditional TCP Machine Learning

2- Mellanox Confidential -2

© 2017 Mellanox Technologies 2

Mellanox Overview

Company Headquarters

Yokneam, IsraelSunnyvale, CaliforniaWorldwide Offices

~2,900 Employees worldwide

Ticker: MLNX

Page 3: Building the Most Efficient Machine Learning Systemon-demand.gputechconf.com/gtc-eu/2017/presentation/... · RDMA delivers 2X performance advantage over traditional TCP Machine Learning

3- Mellanox Confidential -3

© 2017 Mellanox Technologies 3

Adapters SwitchesCables &

Transceivers

System on a Chip

Higher

Faster

Better

Data Speeds

Data Processing

Data SecuritySmartNIC

Exponential Data Growth Everywhere

Page 4: Building the Most Efficient Machine Learning Systemon-demand.gputechconf.com/gtc-eu/2017/presentation/... · RDMA delivers 2X performance advantage over traditional TCP Machine Learning

4- Mellanox Confidential -4

© 2017 Mellanox Technologies 4

Enabling the Future of Machine Learning Applications

HPC and Machine Learning Share Same Interconnect Needs

Storage

High PerformanceComputing

Financial

EmbeddedAppliances

Database

Hyperscale

Machine Learning

IoT

Healthcare

Manufacturing

Retail

Self-Driving Vehicles

Page 5: Building the Most Efficient Machine Learning Systemon-demand.gputechconf.com/gtc-eu/2017/presentation/... · RDMA delivers 2X performance advantage over traditional TCP Machine Learning

5- Mellanox Confidential -5

© 2017 Mellanox Technologies 5

Highest Performance 100 and 200Gb/s Interconnect Solutions

TransceiversActive Optical and Copper Cables (10 / 25 / 40 / 50 / 56 / 100 / 200Gb/s)

40 HDR (200Gb/s) Ports 80 HDR100 (100Gb/s) Ports 16Tb/s Throughput, 15.6 Billion msg/sec

Interconnect

Switch

Adapters200Gb/s, 0.6us Latency 200 Million Messages per Second(10 / 25 / 40 / 50 / 56 / 100 / 200Gb/s)

Switch32 100GbE Ports, 64 25/50GbE Ports(10 / 25 / 40 / 50 / 100GbE)Throughput of 6.4Tb/s

Today’s Datacenters Need the Most Intelligent Interconnect

Page 6: Building the Most Efficient Machine Learning Systemon-demand.gputechconf.com/gtc-eu/2017/presentation/... · RDMA delivers 2X performance advantage over traditional TCP Machine Learning

6- Mellanox Confidential -6

© 2017 Mellanox Technologies 6

Mellanox Delivers Best Return on Investment

60% Higher Return on Investment

Up to 50% Savings on Capital and Operation Expenses

World’s Highest Performance, Scalability and Productivity for Deep Learning

Mellanox Unlocks the Power of AI

Chainer

Cognitive Toolkit

Page 7: Building the Most Efficient Machine Learning Systemon-demand.gputechconf.com/gtc-eu/2017/presentation/... · RDMA delivers 2X performance advantage over traditional TCP Machine Learning

7- Mellanox Confidential -7

© 2017 Mellanox Technologies 7

Mellanox is Leading Artificial Intelligence (AI)

Health Care, Business Integrity, Business Intelligence

Knowledge Discovery, Security, Customer Support and more

Advancing Technology to Affect Science, Business, and Society

By Enabling Critical and Timely Decision Making

More Data Better Models Faster Interconnect

GPUs

CPUs

FPGAs

Storage

More Data → Faster Interconnect → Better Insight → Competitive Advantage

Page 9: Building the Most Efficient Machine Learning Systemon-demand.gputechconf.com/gtc-eu/2017/presentation/... · RDMA delivers 2X performance advantage over traditional TCP Machine Learning

9- Mellanox Confidential -9

© 2017 Mellanox Technologies 9

World’s First PCIe Gen 4

Public Cloud Server for

Cognitive Computing

Enabling Analytics

in Cloud

Sets TeraSort 2016

Benchmark Record5x Faster , 3x Energy Efficient than 2015 Record

http://sortbenchmark.org/TencentSort2016.pdf

Smart Network for Azure

Cloud ServerDesigned for Big Data Analytics & AI

Mellanox Accelerates Machine Learning and Big Data

Page 10: Building the Most Efficient Machine Learning Systemon-demand.gputechconf.com/gtc-eu/2017/presentation/... · RDMA delivers 2X performance advantage over traditional TCP Machine Learning

10- Mellanox Confidential -10

© 2017 Mellanox Technologies 10

Mellanox Accelerates Machine Learning and Big Data

Big Sur & Big Basin Facebook Open Source AI Hardware Platform

Only ONE Network of Choice - Mellanox

Powering Self Driving Car2X Faster Training with Paddle Paddle

“…We rely on fast interconnect

technologies and RDMA.”

Andrew Ng, Chief Scientist, Baiduhttps://code.facebook.com/posts/1687861518126048/facebook-to-open-source-ai-hardware-design

https://github.com/caffe2/caffe2/tree/master/caffe2/contrib/fbcollective/vendor/fbcollective

Real Time Fraud Detection14 Million Transactions per Day

4 Billion Database Inserts

Image Recognition~90% Prediction Accuracy

RDMA in Tensorflow and Caffe

Caffe

Caffe2

Page 11: Building the Most Efficient Machine Learning Systemon-demand.gputechconf.com/gtc-eu/2017/presentation/... · RDMA delivers 2X performance advantage over traditional TCP Machine Learning

11- Mellanox Confidential -11

© 2017 Mellanox Technologies 11

AI is Changing the Way We Interact with Computers

Automotive and Transportation

Security and Public Safety

Consumer Web, Mobile,

Retail

Medicine and Biology

Broadcast, Media and

Entertainment

Finance, Fraudand Insurance

• Autonomous driving

• Pedestrian detection

• Accident avoidance

• Surveillance

• Image analysis

• Facial recognition

and detection

• Image tagging

• Speech

recognition

• Natural language

processing

• Recommendation

and sentiment

analysis

• Drug discovery

• Diagnostic

assistance

• Cancer cell

detection

• Captioning

• Search

• Recommendations

• Real time

translation

• Real Time Trade

• Credit / Risk

Analysis

• Fraud Detection

and Prevention

Efficient Deep Learning Depends on Mellanox

Page 12: Building the Most Efficient Machine Learning Systemon-demand.gputechconf.com/gtc-eu/2017/presentation/... · RDMA delivers 2X performance advantage over traditional TCP Machine Learning

12- Mellanox Confidential -12

© 2017 Mellanox Technologies 12

Deep Learning Demands Highest Performance

TRAINING DATASET

NEW DATA

TRAINING

• Scalability requires ultra-fast networking

• Same hardware needs as HPC

• Faster access to storage

• RDMA

• SHARP

• PeerDirect™, GPUDirect™, ROCm, others

INFERENCING

• Highly transactional / supports many users

• Mellanox ultra-low latency

• Instant network response

• RDMA

• PeerDirect™, GPUDirect™, ROCm, others

Billions of TFLOPS

Billions of FLOPS

Images

Video

Text

Speech

Tabular

Time Series

Page 13: Building the Most Efficient Machine Learning Systemon-demand.gputechconf.com/gtc-eu/2017/presentation/... · RDMA delivers 2X performance advantage over traditional TCP Machine Learning

13- Mellanox Confidential -13

© 2017 Mellanox Technologies 13

Exponential Data Growth – The Need for Intelligent and Faster Interconnect

CPU-Centric (Onload) Data-Centric (Offload)

Faster Data Speeds and In-Network Computing Enable Higher Performance and Scale

Must Wait for the Data

Creates Performance BottlenecksAnalyze Data as it Moves!

Page 14: Building the Most Efficient Machine Learning Systemon-demand.gputechconf.com/gtc-eu/2017/presentation/... · RDMA delivers 2X performance advantage over traditional TCP Machine Learning

14- Mellanox Confidential -14

© 2017 Mellanox Technologies 14

Data Centric Architecture to Overcome Latency Bottlenecks

HPC / Machine Learning

Communications Latencies of 30-40us

HPC / Machine Learning

Communications Latencies of 3-4us

CPU-Centric (Onload) Data-Centric (Offload)

Network In-Network Computing

Intelligent Interconnect Paves the Road to Exascale Performance

Page 15: Building the Most Efficient Machine Learning Systemon-demand.gputechconf.com/gtc-eu/2017/presentation/... · RDMA delivers 2X performance advantage over traditional TCP Machine Learning

15- Mellanox Confidential -15

© 2017 Mellanox Technologies 15

Mellanox Technology Accelerations for Machine Learning

GPU

GPU

CPUCPU

CPU

CPU

CPU

GPU

GPU

In-Network Computing Key for Highest Return on Investment

RDMA

GPUDirect

NVMe over

Fabrics

SHARP

Security

Page 16: Building the Most Efficient Machine Learning Systemon-demand.gputechconf.com/gtc-eu/2017/presentation/... · RDMA delivers 2X performance advantage over traditional TCP Machine Learning

16- Mellanox Confidential -16

© 2017 Mellanox Technologies 16

In-Network Computing Enables Deep Learning Frameworks

CUDA

Mellanox Interconnect SolutionsMellanox Accelerations for Machine Learning and Big Data

SHARP

Middleware (MPI, gRPC) - Optional

GPUDirect RDMA NVMe over FabricsrCUDA

Page 17: Building the Most Efficient Machine Learning Systemon-demand.gputechconf.com/gtc-eu/2017/presentation/... · RDMA delivers 2X performance advantage over traditional TCP Machine Learning

17- Mellanox Confidential -17

© 2017 Mellanox Technologies 17

Mellanox SHARP for Gradient Computation

CPU in a parameter server becomes the

bottleneck quickly (roughly 4 nodes)

TCP adds a lot of overhead and the

traffic pattern is bursty

• SHARP performs the gradient averaging

• Removes the need for physical parameter server

• Removes all parameter server overhead

SHARP Provides Better Scalability and Reduced Network Traffic

Page 18: Building the Most Efficient Machine Learning Systemon-demand.gputechconf.com/gtc-eu/2017/presentation/... · RDMA delivers 2X performance advantage over traditional TCP Machine Learning

18- Mellanox Confidential -18

© 2017 Mellanox Technologies 18

Purpose-built for Acceleration of Deep Learning

PeerDirect™, GPUDirect® RDMA and ASYNC

Page 19: Building the Most Efficient Machine Learning Systemon-demand.gputechconf.com/gtc-eu/2017/presentation/... · RDMA delivers 2X performance advantage over traditional TCP Machine Learning

19- Mellanox Confidential -19

© 2017 Mellanox Technologies 19

What is GPUDirect™

Provides significant decrease in communication latency for acceleration devices

Natively supported by Mellanox OFED

Supports peer-to-peer communications between Mellanox adapters and third-party devices

No unnecessary system memory copies & CPU overhead

Enables GPUDirect™ RDMA, GPUDirect™ ASYNC, ROCm and others

InfiniBand and Ethernet

CPU

Chip

set

ChipsetVendor

Device

CPU

Chip

set

ChipsetVendor

Device0101001011

Designed for Deep Learning Acceleration

Page 20: Building the Most Efficient Machine Learning Systemon-demand.gputechconf.com/gtc-eu/2017/presentation/... · RDMA delivers 2X performance advantage over traditional TCP Machine Learning

20- Mellanox Confidential -20

© 2017 Mellanox Technologies 20

GPUDirect™ RDMA and GPUDirect ASYNC™

Direct Connectivity GPU - Interconnect

Page 22: Building the Most Efficient Machine Learning Systemon-demand.gputechconf.com/gtc-eu/2017/presentation/... · RDMA delivers 2X performance advantage over traditional TCP Machine Learning

22- Mellanox Confidential -22

© 2017 Mellanox Technologies 22

NVIDIA® NCCL 2.0 Near-Linear Scalability

Optimized collective communication library

• Allreduce, Reduce, Broadcast, Reduce-scatter, Allgather

Inter-node communication using InfiniBand verbs and GPUDirect™ RDMA

Multi-rail support, Topology detection

50% performance improvement with NVIDIA® DGX-1™ across 32 NVIDIA Tesla® V100 GPUs

NVIDIA Accelerates Scalable Deep Learning with Mellanox

Page 23: Building the Most Efficient Machine Learning Systemon-demand.gputechconf.com/gtc-eu/2017/presentation/... · RDMA delivers 2X performance advantage over traditional TCP Machine Learning

23- Mellanox Confidential -23

© 2017 Mellanox Technologies 23

Performance and Scalability Examples

Page 24: Building the Most Efficient Machine Learning Systemon-demand.gputechconf.com/gtc-eu/2017/presentation/... · RDMA delivers 2X performance advantage over traditional TCP Machine Learning

24- Mellanox Confidential -24

© 2017 Mellanox Technologies 24

TensorFlow with Mellanox RDMA

Unmatched Linear Scalability, No Additional Cost

Up to 76% Efficiency and 50% Better Performance versus TCP

Reference Deployment Guide

Page 25: Building the Most Efficient Machine Learning Systemon-demand.gputechconf.com/gtc-eu/2017/presentation/... · RDMA delivers 2X performance advantage over traditional TCP Machine Learning

25- Mellanox Confidential -25

© 2017 Mellanox Technologies 25

Accelerating TensorFlow™ with gRPC over RDMA

Open source Machine Learning from Google

Distributed training with gRPC framework• Google’s Optimized RPC for distributed network

RDMA Acceleration over UCX• Unified Communication X (UCX)

• Integration with upstream TensorFlow

2X higher Performance with RDMA

>2x FasterLower is better

~2X Acceleration for TensorFlow with RDMA

Page 26: Building the Most Efficient Machine Learning Systemon-demand.gputechconf.com/gtc-eu/2017/presentation/... · RDMA delivers 2X performance advantage over traditional TCP Machine Learning

26- Mellanox Confidential -26

© 2017 Mellanox Technologies 26

TensorFlow™ over RDMA in Apache® Spark™ Environment

Yahoo enhanced the TensorFlow C++ layer

to enable RDMA over InfiniBand

InfiniBand provides faster connectivity and

supports accelerated offload capability

Source: http://yahoohadoop.tumblr.com/post/157196317141/open-sourcing-tensorflowonspark-distributed-deep

InfiniBand Provides Near Linear Scalability for Inception Model Training

Page 27: Building the Most Efficient Machine Learning Systemon-demand.gputechconf.com/gtc-eu/2017/presentation/... · RDMA delivers 2X performance advantage over traditional TCP Machine Learning

27- Mellanox Confidential -27

© 2017 Mellanox Technologies 27

2X Acceleration for Baidu

Machine Learning Software from Baidu

• Usage: word prediction, translation, image processing

RDMA (GPUDirect) speeds training

• Lowers latency, increases throughput

• More cores for training

• Even better results with optimized RDMA

~2X Acceleration for Paddle Training with RDMA

Page 28: Building the Most Efficient Machine Learning Systemon-demand.gputechconf.com/gtc-eu/2017/presentation/... · RDMA delivers 2X performance advantage over traditional TCP Machine Learning

28- Mellanox Confidential -28

© 2017 Mellanox Technologies 28

ChainerMN Depends on InfiniBand

ChainerMN depends on MPI for inter-node communication

NVIDIA® NCCL library is then used for intra-node communication between GPUs

Leveraging InfiniBand results in near linear performance

Mellanox InfiniBand allows ChainerMN to achieve ~72% accuracy.

Source: http://chainer.org/general/2017/02/08/Performance-of-Distributed-Deep-Learning-Using-ChainerMN.html

Page 29: Building the Most Efficient Machine Learning Systemon-demand.gputechconf.com/gtc-eu/2017/presentation/... · RDMA delivers 2X performance advantage over traditional TCP Machine Learning

29- Mellanox Confidential -29

© 2017 Mellanox Technologies 29

Machine Learning Performance Comparison

InfiniBand Delivers 60% Better Performance with 2X Less Infrastructure

DeepBench measures the performance of

basic operations involved in training deep

neural networks.

60.3%

8 Accelerators

16 Accelerators

32 Accelerators

Lower is Better

Page 30: Building the Most Efficient Machine Learning Systemon-demand.gputechconf.com/gtc-eu/2017/presentation/... · RDMA delivers 2X performance advantage over traditional TCP Machine Learning

30- Mellanox Confidential -30

© 2017 Mellanox Technologies 30

Scalable Deep Learning Depends on Mellanox

A Few Solution Examples

Page 31: Building the Most Efficient Machine Learning Systemon-demand.gputechconf.com/gtc-eu/2017/presentation/... · RDMA delivers 2X performance advantage over traditional TCP Machine Learning

31- Mellanox Confidential -31

© 2017 Mellanox Technologies 31

NVIDIA® DGX-1™

World’s first purpose-built system for deep

learning

• SaturnV is #28 on the Top500, 3.3Pf with 124 nodes

• SaturnV is also #1 on the Green500

Fully integrated hardware

• 8x Tesla™ P100 (Pascal) w/16GB per GPU

• 28672 CUDA® Cores

• 4x ConnectX-4 EDR 100Gb/s HCAs

Fully integrated software stack

• Major deep learning frameworks

• Drivers, NVIDIA CUDA, NVIDIA Deep Learning SDK

• GPUDirect™ RDMA

Page 32: Building the Most Efficient Machine Learning Systemon-demand.gputechconf.com/gtc-eu/2017/presentation/... · RDMA delivers 2X performance advantage over traditional TCP Machine Learning

32- Mellanox Confidential -32

© 2017 Mellanox Technologies 32

NVIDIA® DGX-1™ Deep Learning Server

Deep Learning Supercomputer in a Box

8 x NVIDIA® Tesla® P100 GPUs

5.3TFlops

16nm FinFET

NVLINK

4 x ConnectX®-4 EDR 100G InfiniBand

Adapters

NVIDIA® “SaturnV”NVIDIA® Machine Learning Supercomputer

#28 on the Top500

3.3Pf with 124 DGX-1 nodes

#1 on the Green500

Page 33: Building the Most Efficient Machine Learning Systemon-demand.gputechconf.com/gtc-eu/2017/presentation/... · RDMA delivers 2X performance advantage over traditional TCP Machine Learning

33- Mellanox Confidential -33

© 2017 Mellanox Technologies 33

End-to-End Interconnect Solutions for All Platforms

Highest Performance and Scalability for

X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms

X86Open

POWERGPU ARM FPGA

Smart Interconnect to Unleash The Power of All Compute Architectures

Page 34: Building the Most Efficient Machine Learning Systemon-demand.gputechconf.com/gtc-eu/2017/presentation/... · RDMA delivers 2X performance advantage over traditional TCP Machine Learning

34- Mellanox Confidential -34

© 2017 Mellanox Technologies 34

Proven Advantages

RDMA delivers 2X performance advantage over traditional TCP

Machine Learning and HPC platforms share the same interconnect needs

Scalable, flexible, high performance, high bandwidth, end-to-end connectivity

Standards-based and supported by the largest eco-system

Supports all compute architectures: x86, Power, ARM, GPU, FPGA etc.

Native Offloading architecture

RDMA, GPUDirect, SHARP and other core accelerations

Backward and future compatible

Scalable Machine Learning Depends on Mellanox

Page 35: Building the Most Efficient Machine Learning Systemon-demand.gputechconf.com/gtc-eu/2017/presentation/... · RDMA delivers 2X performance advantage over traditional TCP Machine Learning

Thank You