interconnect acceleration for publication for machine ... · deep learning is not new. o. 1943:...

51
#vmworld #vmworld Interconnect Acceleration for Machine Learning, Big Data, and HPC Adit Ranadive, VMware, Inc. Aviad Shaul Yehezkel, Mellanox VAP2807BU #VAP2807BU VMworld 2018 Content: Not for publication or distribution

Upload: others

Post on 17-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

#vmworld#vmworld

Interconnect Accelerationfor Machine Learning,

Big Data, and HPCAdit Ranadive, VMware, Inc.

Aviad Shaul Yehezkel, Mellanox

VAP2807BU

#VAP2807BU

VMworld 2018 Content: Not for publication or distribution

Page 2: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

Disclaimer

2©2018 VMware, Inc.

This presentation may contain product features orfunctionality that are currently under development.

This overview of new technology represents no commitment from VMware to deliver these features in any generally available product.

Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.

Technical feasibility and market demand will affect final delivery.

Pricing and packaging for any new features/functionality/technology discussed or presented, have not been determined.

VMworld 2018 Content: Not for publication or distribution

Page 3: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

3©2018 VMware, Inc. 3

Datacenter Interconnects bandwidths growing – 100 Gbps and beyond

Network devices with hardware offloaded protocols• Reduce CPU usage• Access application data directly without involving OS/Hypervisor - lower latencies

Really – interconnect accelerators

Big Data, Machine Learning, High Performance Computing Apps• Speed up benefit from increased network bandwidths and lower latencies• Distributed workloads across VMs accessing a shared high performance fabric

This talk aims to answer the following questions in a vSphere environment• What are the hardware protocols for interconnect acceleration?• How do we enable VM access to interconnect acceleration protocols?• How do interconnect accelerators help with application performance?

PreambleAccelerate my Interconnect - What and Why?

VMworld 2018 Content: Not for publication or distribution

Page 4: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

Agenda

4©2018 VMware, Inc.

Agenda

4©2018 VMware, Inc.

Introduction to RDMA

Direct Device Access Technologies

Paravirtual RDMA (PVRDMA) in vSphere

Machine Learning over RDMA

Big Data over RDMA

HPC Applications over RDMA

Summary / Key Takeaways

VMworld 2018 Content: Not for publication or distribution

Page 5: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

5©2018 VMware, Inc. 5

Introduction to Remote Direct Memory Access (RDMA)Hardware protocol to accelerate data

VMworld 2018 Content: Not for publication or distribution

Page 6: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

A hardware transport protocolo Optimized for moving data to/from memory and

across interconnectso Industry Standard – InfiniBand Trade Association

(IBTA)o Examples: InfiniBand, RoCE (talk focus)

Extreme performanceo 600ns application-to-application latencieso 100Gbps throughputo Negligible CPU overheads

RDMA Use-caseso Storage (iSER, NFS-RDMA, NVMoF, Lustre)o HPC (MPI, SHMEM)o Big data and analytics (Hadoop, Spark)o Machine Learning (Tensor flow, Horovod)

Remote Direct Memory Access (RDMA)

6

VMworld 2018 Content: Not for publication or distribution

Page 7: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

Traditional network stack challengesoPer message / packet / byte overheadsoUser-kernel crossingsoMemory copies

RDMA provides in hardware:o Isolation between applicationsoTransport

o Packetizing messageso Reliable delivery

oAddress translation

User-level networkingoDirect hardware access for data pathoHardware performs DMA to/from memory

How does RDMA achieve high performance?

7

Kernel

User

RDMA-capablehardware*

NVMeF iSERBuf

Buf

Buf

AppA AppB

BufBuf

* Host Channel Adapter (HCA)

VMworld 2018 Content: Not for publication or distribution

Page 8: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

o InfiniBand is a centrally managed, lossless high performance network architectureo Provides specification for the RDMA transport

o RoCE adapts the efficient RDMA transport to run over Ethernet networks

o Standard Ethernet management

EthernetLink layer

IP

UDP

RDMATransport

Ethernet/IPManagement

ROCEv2

Eth L2 IP UDP BTH+ Payload iCRC FCSRoCEv2 Packet

RDMA over Converged Ethernet (RoCE)

LRH GRH BTH+ Payload iCRC vCRCInfiniBand Packet

InfiniBandLink layer

IB Network Layer

RDMATransport

InfiniBandManagement

InfiniBand

VMworld 2018 Content: Not for publication or distribution

Page 9: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

9©2018 VMware, Inc. 9

Direct Device Access TechnologiesAccessing PCI devices from VMs with maximum performance

VMworld 2018 Content: Not for publication or distribution

Page 10: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

10©2018 VMware, Inc. 10

VMware ESXi

VM Direct Path I/O

Allows PCI devices to be accessed directly by guest OS• Examples: GPUs for computation (GPGPU), ultra-low latency

interconnects like InfiniBand and RoCE

Downsides: No vMotion, No Snapshots, etc. (PVRDMA in ESX 6.5 discussed later, supports vMotion)

Full device is made available to a single VM – no sharing (PVRDMA allows sharing)

No ESXi driver required – just the standard vendor device driver

Virtual Machine

Guest OS Kernel

Application

DirectPath I/OVMworld 2018 Content: Not for publication or distribution

Page 11: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

11©2018 VMware, Inc. 11

The PCI standard includes a specification for Single Root I/O Virtualization (SR-IOV)

A single PCI device can present as multiple logical devices (Virtual Functions or VFs) to ESX and to VMs

Downsides: No vMotion, No Snapshots (PVRDMA helps again!)

An ESXi driver and a guest driver are required for SR-IOV

Mellanox Technologies supports ESXi SR-IOV for both InfiniBand and RoCE interconnects

Device Partitioning (SR-IOV)

SR-IOV

Virtual Machine

Guest OSKernel

Application

PF VF

vSwitch

VM

XN

ET

3

VMworld 2018 Content: Not for publication or distribution

Page 12: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

12©2018 VMware, Inc. 12

Paravirtual RDMA (PVRDMA)Accelerating VM data transfers while retaining all virtualization benefits

VMworld 2018 Content: Not for publication or distribution

Page 13: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

13©2018 VMware, Inc. 13

o VMo Expose a virtual PCIe deviceo Guest Support – Driver/Library

o Implements RDMA APIs

o PVRDMA backendo Mediates guest access to HCAo Exposes a RDMA resource space for each guesto Invokes ESXi RDMA APIs in response to guest

RDMA operationso Physical RDMA resources created for each guest

o Physical HCA services all VMs on the host

PVRDMA ArchitectureVM2VM1

App

RDMA stack

PVRDMA driver

ESXi

App

RDMA stack

PVRDMA driver

RDMA stack

HCA driver

HCA

PVRDMA backend

VMworld 2018 Content: Not for publication or distribution

Page 14: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

14©2018 VMware, Inc. 14

o Applications register buffers to be usedo PVRDMA registers these buffers with the HCA

o PVRDMA takes the application request (Work Request) to read/write from a particular guest address, sizeo Issues the request to the HCAo No data included here

o HCA does all the work of packetizing data to/from application memoryo Bypassing Guest OS/Hypervisoro Enables direct zero-copy data transfers in HW!

Accelerating VM Data Transfers

VMkernel

Guest OS 1

PVRDMA NIC

Application buffer

HCATo RDMA

peer

API call

Data transfer

HCA Device Driver

ESXi RDMA Stack

VMworld 2018 Content: Not for publication or distribution

Page 15: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

15©2018 VMware, Inc. 15

o Challengeso Considerable connection state contained within RDMA hardwareo RDMA connection needs to be maintained before/during/after vMotion

o PVRDMA backend creates a management network across clustero Exchange RDMA metadata between RDMA peerso Maintain virtual RDMA connection between peerso Re-establish virtual & physical RDMA connection after VM moved to new host

o Supporto vMotiono Snapshotso Suspend/Resumeo High Availability

Supporting Virtualization Benefits

VMworld 2018 Content: Not for publication or distribution

Page 16: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

16©2018 VMware, Inc. 16

Drawbacks of current approach

o Cannot connect to bare metal hosts

o Good for small scale system

o Increased metadata exchanged

o Decreased performance for application

Improving PVRDMA Scalability

Future Improvements

o Shrink PVRDMA backend management functionality• Stop sending metadata updates between peers

(evaluated in this talk)• Unify resource spaces between backend and HCA

o Create the exact hardware RDMA state for the VM after vMotion• Specific RDMA APIs to be added• RDMA hardware must keep alive connection for duration of

vMotion

o Allows connection to bare metal hosts• While supporting vMotion, Snapshots, High Availability!

VMworld 2018 Content: Not for publication or distribution

Page 17: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

17©2018 VMware, Inc. 17

o Configure ESXi host for PVRDMA

o Add PVRDMA Device to Linux VM through VC

o Install PVRDMA Driver and Libraryo OFED 4.8.2o Upstream

o Linux 4.10 or latero rdma-core v13 or later

o Inboxo RHEL 7.5o SLES 15o Ubuntu 18.04

o PVRDMA support is now part of most Linux distributions

Enabling PVRDMA for a VMvSphere 6.5 and later

VMworld 2018 Content: Not for publication or distribution

Page 18: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

18©2018 VMware, Inc. 18

Machine LearningAccelerate training time by scale out architecture over RDMA networks

VMworld 2018 Content: Not for publication or distribution

Page 19: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

19©2018 VMware, Inc. 19

Machine Learning Is Everywhere!

Fraud DetectionVMworld 2018 Content: Not for publication or distribution

Page 20: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

20©2018 VMware, Inc. 20

What Is Machine Learning

Machine Learning

Machine learning is the subfield of computer science that, according to Arthur Samuel in 1959, “uses statistical techniques to give computers the ability to learn with data without being explicitly programmed.”

Source: https://en.wikipedia.org/wiki/Machine_learning

VMworld 2018 Content: Not for publication or distribution

Page 21: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

21©2018 VMware, Inc. 21

Driven by Deep Neural Networks (DNN)• Subset of Artificial Neural Networks (ANN)

Deep Learning

Deep Learning

Deep Learning is a subfield of machine learning concerned with algorithms, inspired by the structure and function of the brain, called artificial neural networks

Source: http://machinelearningmastery.com/what-is-deep-learning/

VMworld 2018 Content: Not for publication or distribution

Page 22: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

22©2018 VMware, Inc. 22

Training and Inference

Source: Mellanox TechnologiesVMworld 2018 Content: Not for publication or distribution

Page 23: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

23©2018 VMware, Inc. 23

Deep Learning allows difficult problems to be solvedo In some cases problems that can’t be solve in other ways

Deep Learning is not newo 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

So why now?

Infrastructureo Recent development in GPU and network technology make the approach practical

Datao More data is generated than ever. Critical for the training process

Software o Wave of open source machine learning frameworks

Why Deep Learning And Why Now?

Cognitive ToolkitVMworld 2018 Content: Not for publication or distribution

Page 24: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

24©2018 VMware, Inc. 24

SONAR~10-100KB Per/Sec

CAMERA~20-40MB Per/sec

GPS~50KB Per/Sec

Data will grow by a factor of 10 over the next decade to 160 Zeta Bytes in 2025 (source: IDC)

Faster Data processing requires faster Interconnect speeds

RADAR~10-100KB Per/Sec

Light Detection & Ranging~10-70MB Per/Sec

Data is Growing Faster than EverAutonomous vehicle generates 4000GByte per day

Source: Cruise RP-1/Youtube

VMworld 2018 Content: Not for publication or distribution

Page 25: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

25©2018 VMware, Inc. 25

Neural Networks Complexity Growth

2014 2015 2016 2017

DeepSpeech DeepSpeech-2

DeepSpeech-3

30X

2013 2014 2015 2016

AlexNet GoogleNetResNet

Inception-V2

350X

Inception-V4

Image Recognition

SpeechRecognition

PolyNet

Source: Mellanox Technologies

VMworld 2018 Content: Not for publication or distribution

Page 26: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

26©2018 VMware, Inc. 26

Training with large data sets and networks can take a long timeo In some cases even weeks

In many cases training needs to happen frequentlyo Model development and tuningo Real life use cases may require retraining regularly

Accelerate training time by scaling out architectureo Add workers (nodes) to reduce training time

Popular types of parallelismo Data parallelismo Model parallelism

Training Challenges

The network is a critical element to accelerate Distributed Training!VMworld 2018 Content: Not for publication or distribution

Page 27: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

27©2018 VMware, Inc. 27

Model and Data Parallelism

Data Parallelism

Main Model/Parameter Server/Allreduce

LocalModel

Mini Batch

Mini Batch

Mini Batch

Mini Batch

Mini Batch

LocalModel

LocalModel

LocalModel

LocalModel

LocalModel

Mini BatchData Data

Source: Mellanox Technologies

Model Parallelism

VMworld 2018 Content: Not for publication or distribution

Page 28: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

28©2018 VMware, Inc. 28

RDMA and GPU Direct Accelerates Distributed TrainingGPUDirect RDMA Technology

With GPUDirect RDMA in vSphere we get ~100Gbps for

RDMA bandwidth

VMworld 2018 Content: Not for publication or distribution

Page 29: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

29©2018 VMware, Inc. 29

TensorFlow: Several implementations upstreamo Native (verbs) -

https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/verbs

o MPI, Horovod – Donated by Uber among others

Caffe2: Over MPI or Gloo library

Microsoft Cognitive Toolkit: Native support

NVIDIA NCCL2: Native support in NCCL

All Major Machine Learning Frameworks Support RDMA

Cognitive ToolkitVMworld 2018 Content: Not for publication or distribution

Page 30: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

30©2018 VMware, Inc. 30

Distributed training framework for TensorFlow

Inspired by work of Baidu, Facebook, et al.

Uses bandwidth-optimal communication protocolso Makes use of RDMA (RoCE, InfiniBand) if available

Seamlessly installs on top of TensorFlow via ‘pip install horovod’

Horovod

Source: Horovod: fast and easy distributed deep learning in TensorFlow: https://arxiv.org/pdf/1802.05799.pdfVMworld 2018 Content: Not for publication or distribution

Page 31: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

31©2018 VMware, Inc. 31

Horovod Setup – SR-IOV

8 Hosts, 1 VM per Host, 1 GPU (Full Passthrough) per VM

Mellanox SN2700 100GigE Spectrum Switch

Host:o 2x Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz (12 cores per socket)o 256GB RAMo Mellanox ConnectX-5 100GbE (SR-IOV RoCE capable)o 8 x Nvidia Tesla P100 GPU (PCI, 16GB)o ESXi 6.7 GAo ConnectX Driver 4.17.13.8

VM:o 12 vCPUo 48GB memoryo Ubuntu 16.04 x64o Passthrough: SR-IOV RoCE VF + Nvidia Tesla P100 GPUo MLNX OFED 4.4-2.0.7.0o TensorFlow 1.9o CUDA SDK 9.0o cuDNN 7.0o Horovod – master from Githubo Docker v18.09VMworld 2018 Content: Not for publication or distribution

Page 32: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

32©2018 VMware, Inc. 32

Horovod PerformanceVGG 16 Results

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

9.00

2 3 4 5 6 7 8

SPE

ED

UP

(X

)

GPUS

TCP SR-IOV TCP SR-IOV RoCE IdealVMworld 2018 Content: Not for publication or distribution

Page 33: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

33©2018 VMware, Inc. 33

SPARK Big Data AnalyticsAccelerating time to solution with shared, high-performance interconnect

VMworld 2018 Content: Not for publication or distribution

Page 34: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

34©2018 VMware, Inc. 34

Spark is a data analysis platform with implicit data parallelism and fault-tolerance

Initial release: May, 2014

Originally developed at UC Berkeley’s AMPLab

Donated as open source to the Apache Software Foundation

Most active Apache open source project

50% of production systems are in public clouds

Notable users:

Apache Spark: Quick Facts

VMworld 2018 Content: Not for publication or distribution

Page 35: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

35©2018 VMware, Inc. 35

A programming model for processing big data sets in a distributed manner

Comprises 3 stageso Mapo Shuffleo Reduce

Map-Reduce

VMworld 2018 Content: Not for publication or distribution

Page 36: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

36©2018 VMware, Inc. 36

Applications frequently reuse data in a Map-Reduce pipelineo Iterative algorithms (e.g., machine learning, graphs)o Interactive data-mining and streaming

Persisting each iteration in stable storage is inefficient

The Spark solution: Resilient Distributed Datasets (RDDs)o In-memory data representationo Preserves and enhances the appealing properties

of Map-Reduce:o Fault toleranceo Data localityo Scalability

o Reuses in-memory data set in each iterationo Mostly network I/O only perfect match for RDMA!

Spark and Map-Reduce

Step Step Step

Step Step Step

VMworld 2018 Content: Not for publication or distribution

Page 37: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

37©2018 VMware, Inc. 37

Host:o 2x Intel Xeon E5-2697v3 @ 2.60 GHz (14 cores per socket)o 256GB RAMo RDMA Adapters – Mellanox ConnectX-5 100GbE (PFC enabled)o ESXi Experimental Build (includes PVRDMA optimization #1)

VM:o 20 vCPU, 200 GB RAM, Centos 7.4 x64o SR-IOV RoCE Passthrough VF / PVRDMAo Mellanox OFED Linux 4.4-1.0.0.0o Spark Benchmark - HiBench/TeraSort over Mellanox SparkRDMAo Hadoop 2.4.0

Name Node Host:o 2x Intel(R) Xeon(R) CPU E5-2680 0o Mellanox ConnectX-5 100GbE

Name Node Server VM:o 12 vCPU, 64GB RAM, Centos 7.4o SR-IOV RoCE Passthrough VF / PVRDMA

Spark Test Setup – RoCE SR-IOV/PVRDMA

8 ESXi hosts1 Spark VM

per host

1 Server used as Named Node

Mellanox SN2700 100GigE Spectrum Switch

90 GB TeraSort

VMworld 2018 Content: Not for publication or distribution

Page 38: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

Results – vSphere with RoCE SR-IOV

Runtime samples SR-IOV TCP SR-IOV RDMA Improvement

Average 127 seconds 91 seconds 28%

Min 126 seconds 88 seconds 30%

Max 130 seconds 96 seconds 26%

Lower is better

0

20

40

60

80

100

120

140

Average Min Max

SRIOV TCP SRIOV RoCE

28%

VMworld 2018 Content: Not for publication or distribution

Page 39: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

Results – vSphere with PVRDMA

Runtime samples TCP (VMXNet3) PVRDMA Improvement

Average 129 seconds 99 seconds 23%

Min 127 seconds 96 seconds 24%

Max 132 seconds 101 seconds 23%

Lower is better

0

20

40

60

80

100

120

140

Average Min Max

TCP (VMXNet3) PVRDMA

23%

VMworld 2018 Content: Not for publication or distribution

Page 40: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

40©2018 VMware, Inc. 40

High Performance ComputingAccelerating virtualized compute intensive applications over shared, high performance fabric

VMworld 2018 Content: Not for publication or distribution

Page 41: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

41©2018 VMware, Inc. 41

HPC Workloads

o Scientific or technical workloads

o Often floating-point intensiveo Often storage intensiveo Often parallelo Run on server-class systems

o Mechanical Design / Draftingo Chemical Engineeringo Economics/Financialo Weathero Electronic Design Automation

(EDA)o Geoscienceso Defenseo Computer-Aided Engineering

(CAE)o Bioscience – Molecular Dynamicso Government Labo University/Academic

MPI (Message Passing Interface)

VMworld 2018 Content: Not for publication or distribution

Page 42: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

42©2018 VMware, Inc. 42

o Molecular Dynamics package designed for simulations of proteins, lipids and nucleic acids

• Simulate the Newtonian equations of motion for systems with hundreds to millions of particles

o Good to simulate interaction of chemicals/polymers before actually mixing them!

o Runs on CPUs, GPUs

o Parallel Execution via Message Passing Interface (MPI)

o Loads molecular configuration from an initial file• Simulates the trajectory or movement of the atoms over time

GROMACSGROningen MAchine for Chemical Simulations

Source: https://en.m.wikipedia.org/wiki/GROMACS,https://wiki.archlinux.org/index.php/GROMACS

Source: www.researchgate.netVMworld 2018 Content: Not for publication or distribution

Page 43: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

43©2018 VMware, Inc. 43

HPC Test Setup – RoCE SR-IOV/PVRDMA

Host:o 2x Intel Xeon E5-2697v3 @ 2.60 GHz (14 cores per socket)o 256GB RAMo RDMA Adapters – Mellanox ConnectX-5 100GbEo ESXi Experimental Build (includes PVRDMA optimization #1)

VM:o 20 vCPU, 200 GB RAM, Centos 7.4 x64o SR-IOV RoCE Passthrough VF / PVRDMAo Intel MPIo Mellanox OFED Linux 4.4-1.0.0.0

GROMACS Inputo Ion_channel - pentameric chloride channel embedded in a lipid bilayero Simulation of 150,000 atoms interactingo Ns/day – Nanoseconds of simulation time in 1 dayo Importance in pharmaceutical applications

8 ESXi hosts1 VM per

host

Mellanox SN2700 100GigE Spectrum Switch

VMworld 2018 Content: Not for publication or distribution

Page 44: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

44©2018 VMware, Inc. 44

Results – vSphere with RoCE SR-IOV

0

10

20

30

40

50

60

70

80

2 4 8

ns/d

ay

Number of Nodes

SR-IOV TCP SR-IOV RoCE

Higher is better

#nodes #processes SR-IOV TCP (ns/day)

SR-IOV RoCE

(ns/day)

1 20 10.13 10.13

2 40 17.25 18.44

4 80 27.80 32.81

8 160 44.49 71.60

VMworld 2018 Content: Not for publication or distribution

Page 45: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

45©2018 VMware, Inc. 45

Results – vSphere with PVRDMA

0

10

20

30

40

50

60

2 4 8

ns/d

ay

Number of Nodes

TCP (VMXNet3) PVRDMA

Higher is better

#nodes #processes TCP VMXNet3 (ns/day)

PVRDMA (ns/day)

1 20 9.85 9.85

2 40 9.28 12.34

4 80 11.51 24.22

8 160 14.29 49.72

2X

3X

VMworld 2018 Content: Not for publication or distribution

Page 46: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

46©2018 VMware, Inc. 46

New categories of applications demand more network I/O• Big Data, High Performance Computing, Machine Learning• 100 GigE RDMA Networks and beyond

VMware vSphere provides flexibility and high performance in sharing such fabrics• Paravirtual RDMA• Full Passthrough• SR-IOV

Summary

VMworld 2018 Content: Not for publication or distribution

Page 47: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

47©2018 VMware, Inc. 47

Virtualized data-intensive applications accelerated through these technologies• RDMA outperforms standard TCP for all these workloads• Machine Learning workloads scale linearly with SR-IOV RoCE• PVRDMA gives more than 90% of SR-IOV RoCE performance for Big Data Analytics• HPC workloads on PVRDMA vs TCP improve by almost 3x

Key Takeaways

0102030405060708090

100

Spark GROMACS

No

rmal

ized

to

SR

-IOV

R

oC

EP

erfo

rman

ce

PVRDMA SR-IOV TCP TCP (VMXNet3)

VMworld 2018 Content: Not for publication or distribution

Page 48: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

48©2018 VMware, Inc. 48

CTO2390BU Virtualize and Accelerate HPC/Big Data with SR-IOV, vGPU and RDMA

VIN2085BU Accelerating Performance of Mission-Critical Workloads with PVRDMA

HCI2476BU Tech Preview: RDMA and Next-Gen Storage Technologies for vSAN

VIN2062BU vSphere Networking: What’s New and What’s Next

CTO3693BUS Optimize your Virtualized Environment with Hardware Accelerators

Extreme RDMA Series – Las Vegas

VMworld 2018 Content: Not for publication or distribution

Page 49: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

PLEASE FILL OUTYOUR SURVEY.Take a survey and enter a drawingfor a VMware company store gift card.

#vmworld #VAP2807BU

VMworld 2018 Content: Not for publication or distribution

Page 50: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

THANK YOU!

#vmworld #VAP2807BU

VMworld 2018 Content: Not for publication or distribution

Page 51: Interconnect Acceleration for publication for Machine ... · Deep Learning is not new. o. 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts

51©2018 VMware, Inc. 51

RDMA Operation

Asynchronous I/O• Application Memory management• Zero-copy I/O for all operations

Main transport objects• Queue Pair (QP)

– Comprises Send and Receive queues– Service Work Request Entries (WQEs)

• Completion Queue (CQ)

Semantics• Channel (message passing)• RDMA (Write / Read / Atomics)

Send WQE

Address Space

Send buffer

Address Space

Receive buffer

CQE

SendQ

RecvQ

CQ

RecvWQE

CQE

SendQ

RecvQ

CQ

RDMA-W WQE

Address Space

Initiatorbuffer

Address Space

Targetbuffer

CQE

SendQ

RecvQ

CQ

SendQ

RecvQ

CQVMworld 2018 Content: Not for publication or distribution