synergievon hpc & deep learning mitnvidia gpus · other accelerated applications • need...

SYNERGIE VON HPC & DEEP LEARNING MIT NVIDIA GPUS

Axel Koehler, Principal Solution Architect

ZKI$AK$Supercomputing,$22.$Maerz 2018,$Muenster

2

FACTORS DRIVING CHANGES IN HPC

End$of$Dennard$Scaling$places$a$cap$on$single$threaded$performance

Increasing$application$performance$will$require$fine$grain$parallel$code$with$significant$computational$intensity

AI$and$Data$Science$emerging$as$important$new$components$of$scientific$discovery

Dramatic$improvements$in$accuracy,$completeness$and$response$time$yield$increased$insight$from$huge$volumes$of$data

Cloud$based$usage$models,$in?situ$execution$and$visualization$emerging$as$new$workflows$critical$to$the$science$process$and$productivity

Tight$coupling$of$interactive$simulation,$visualization,$data$analysis/AI

Service$Oriented$Architectures$(SOA)

3

TESLA PLATFORMONE Data Center Platform for Accelerating HPC and AI

TESLA GPU & SYSTEMS

NVIDIA SDK

INDUSTRY FRAMEWORKS & TOOLS

APPLICATIONS

FRAMEWORKS

INTERNET SERVICES

DEEP LEARNING SDK

CLOUDTESLA GPU NVIDIA DGX-1 /DGX-Station NVIDIA HGX-1

ENTERPRISE APPLICATIONS

Manufacturing

Automotive

Healthcare Finance

Retail

Defense…

DeepStream SDKNCCL cuBLAS

cuSPARSEcuDNN TensorRT

ECOSYSTEM TOOLS

HPC

+450 Applications

COMPUTEWORKS

CUDA C/C++ FORTRAN

SYSTEM OEM CLOUDNVIDIA HGX-1

4

CONTINUED DEMAND FOR COMPUTE POWER

Comprehensive$Earth$System$

Model

Coupled$simulation$of$entire$cells

Simulation$of$combustion$for$new$high?efficiency,$low?emision engines.

Predictive$calculations$for$supernovae

2016

Baidu Deep$Speech$2Superhuman$Voice$

Recognition

2015

Microsoft$ResNetSuperhuman$Image$

Recognition

2017

Google$Neural$Machine$Translation

Near$Human$Language$Translation

100 ExaFLOPS8700 Million Parameters



Neural$Network$complexity$is$ExplodingEver?increasing$compute$power$Demand$ in$HPC

5

GPUS FOR HPC AND DEEP LEARNING

NVIDIA Tesla V100

5120 energy efficient cores + TensorCores7.8 TF Double Precision (fp64), 15.6 TF Single Precision (fp32) , 125 Tensor TFLOP/s mixed-precision

Huge requirement on communication and memory bandwidth NVLink

6 links per GPU a 50 GB/s bi-directional for maximum scalability between GPU’s

CoWoS with HBM2

900 GB/s Memory Bandwidth Unifying Compute & Memory in Single Package

Huge$requirement$on$compute$power$(FLOPS)

NCCL

High-performance multi-GPU and multi-node collective communication primitives optimized for NVIDIA GPUs

GPU Direct / GPU Direct RDMA

Direct communication between GPUs by eliminating the CPU from the critical path

6

TENSOR COREMixed Precision Matrix Math - 4x4 matrices

New CUDA TensorOp instructions & data formats

4x4 matrix processing array

D[FP32] = A[FP16] * B[FP16] + C[FP32]

Using Tensor cores via

• Volta optimized frameworks and libraries (cuDNN, CuBLAS, TensorRT, ..)

• CUDA C++ Warp Level Matrix Operations

7

0

1

2

3

4

5

6

7

8

9

10

512 1024 2048 4096

Relative2Perform

ance

Matrix2Size2(M=N=K)

cuBLAS Mixed2Precision2(FP162Input,2FP322compute)

P1002(CUDA28)

V1002Tensor2Cores22(CUDA29)

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

1,8

2

512 1024 2048 4096

Relative2Perform

ance

Matrix2Size2(M=N=K)

cuBLAS Single2Precision2(FP32)

P1002(CUDA28)

V1002(CUDA29)

cuBLAS GEMMS FOR DEEP LEARNINGV100 Tensor Cores + CUDA 9: over 9x Faster Matrix-Matrix Multiply

9.3x1.8x

Note: pre-production Tesla V100 and pre-release CUDA 9. CUDA 8 GA release.

8

COMMUNICATION BETWEEN GPUSLarge scale models:• Some models are too big for a single GPU and need to be spread across multiple devices and multiple nodes

• The size of the model will further increase in the future

Data$parallel$training• Each$worker$trains$the$same$layers$on$a$different$data$batch• NVLINK$allows$the$separation$of$data$loading$and$gradient$averaging

Model$parallel$training• All$workers$train$on$same$batch[$workers$communicate$as$frequently$as$network$allows

• NVLINK$allows$the$separation$of$data$loading$and$exchanges$for$activation http://mxnet.io/how_to/multi_devices.html

9

NVLINK AND MULTI-GPU SCALING

PCIe Switch

CPU

PCIe Switch

CPU

0

32

1 5

67

4

• Data loading over PCIe (red)• Gradient averaging over NVLink (blue)• No sharing of communication resources:

No congestion

PCIe Switch

CPU

PCIe Switch

CPU

0

32

1 5

67

4

QPI Link

• Data loading over PCIe• Gradient averaging over PCIe and QPI• Data loading and gradient averaging share

communication resources: Congestion

PCIe based system NVLINK$based system

For Data Parallel Training

10

NVLINK AND CNTK MULTI-GPU SCALING

11

NVIDIA Collective Communications Library (NCCL) 2Multi-GPU and multi-node collective communication primitives

developer.nvidia.com/nccl

High-performance multi-GPU and multi-node collective communication primitives optimized for NVIDIA GPUs

Fast routines for multi-GPU multi-node acceleration that maximizes inter-GPU bandwidth utilization

Easy to integrate and MPI compatible. Uses automatic topology detection to scale HPC and deep learning applications over PCIe and NVink

Accelerates leading deep learning frameworks such as Caffe2, Microsoft Cognitive Toolkit, MXNet, PyTorch and more

Multi-Node:InfiniBand verbs IP Sockets

Multi-GPU:NVLinkPCIe

Automatic Topology Detection

VOLTA MULTI-PROCESS SERVICE

Hardware Accelerated

Work Submission

Hardware Isolation

VOLTA MULTI-PROCESS SERVICE

Volta GV100

A B C

CUDA MULTI-PROCESS SERVICE CONTROLCPU Processes

GPU Execution

Volta MPS Enhancements:

• MPS clients submit work directly to the work queues within the GPU

• Reduced launch latency• Improved launch throughput

• Improved isolation amongst MPS clients

• Address isolation with independent address spaces

• Improved quality of service (QoS)

• 3x more clients than Pascal

A B C

Efficient inference deployment without batching system

Single Volta Client,No Batching,

No MPS

VOLTA MPS FOR INFERENCERe

snet

50 Im

ages

/sec

, 7m

s la

tenc

y

Multiple Volta Clients,No Batching,

Using MPS

Volta withBatching System

7x faster

60% of perf with batching

V100 measured on pre-production hardware.

14

DEEP LEARNING IS A HPC WORKLOADHPC expertise is important for success

• HPC and Deep Learning require a huge amount of compute power (FLOPS)

• Mainly Double Precision arithmetic for HPC

• Single, half or 8b precision for Deep Learning Training/Inference

• HPC and Deep Learning are using inherently parallel algorithms

• HPC needs less memory per FLOPS than Deep Learning

• HPC is more demanding on network bandwidth than Deep Learning

• Data scientists like GPU dense systems (as much GPUs as possible per node)

• HPC has more demand for scalability than Deep Learning up to now

• Distributed training frameworks like Horovod (Uber) are meanwhile available

15

• Current DIY deep learning environments are complex and time consuming to build, test and maintain

• Same issues affect HPC and other accelerated applications

• Need multiple jobs from different users to co-exist on the same servers

NVIDIA Libraries

NVIDIA Docker

NVIDIA Driver

NVIDIA GPU

Open Source Frameworks

SOFTWARE CHALLENGES

16

NVIDIA GPU CLOUD REGISTRY

Deep LearningAll major frameworks with multi-GPU optimizations Uses NCCL for NVLINK data exchange Multi-threaded I/O to feed the GPUs

Caffe, Caffe2,CNTK, mxnet, PyTorch, Tensorflow, Theano, Torch

HPCNAMD, Gromacs, LAMMPS, GAMESS, Relion

HPC VisualizationParaview with Optix, Index and Holodeck with OpenGL visualization base on NVIDIA Docker 2.0

Single NGC AccountFor use on GPUs everywhere - https://ngc.nvidia.com

Common Software stack across NVIDIA GPUs

NVIDIA GPU Cloud containerizes GPU-optimized frameworks, applications, runtimes, libraries, and operating system, available at no charge

17

NVIDIA SATURN VAI supercomputer with 660 x DGX-1V

40$PF$Peak$FP64$Performance$,$660$PF$DL$Tensor$Performance

• Primarily research focused

• Used internally for Deep Learning applied research

• Many using testing algorithms, networks, new approaches

• Embedded, robotic, auto, hyperscale, HPC

• Partner with university research and industry collaborations

• Study convergence of data science and HPC

18

DEEP LEARNING DATA CENTER

Reference Architecture

http://www.nvidia.com/object/dgx1?multi?node?scaling?whitepaper.html

19

COMBINING THE STRENGTHS OF HPC AND AI

• Implement$inference$models$with$real$time$interactivity$

• Train$inference$models$to$improve$accuracy$and$comprehend$more$of$the$physical$parameter$space

• Analyze$data$sets$that$are$simply$intractable$with$classic$statistical$models

• Control$and$manage$complex$scientific$experiments

HPC• 40+$years$of$Algorithms$based$on$first$principles$theory

• Proven$statistical$models$for$accurate$results$in$multiple$science$domains

• Develop$training$data$sets$using$first$principal$models

• Incorporate$AI$models$in$semi?empirical$style$applications$to$improve$throughput

• Validate$new$findings$from$AI

• New$methods$to$improve$predictive$accuracy,$insight$into$new$phenomena$and$response$time

AI

20

MULTI-MESSENGERASTROPHYSICS

Despite2the2latest2development2in2computational2power,2there2is2still2a2large2gap2in2linking2relativistic2theoretical2models2to2observations.2Max$Plank$Institute

BackgroundThe aLIGO (Advanced Laser Interferometer Gravitational Wave Observatory) experiment successfully discovered signals proving Einstein’s theory of General Relativity and the existence of cosmic Gravitational Waves. While this discovery was by itself extraordinary it is seen to be highly desirable to combine multiple observational data sources to obtain a richer understanding of the phenomena.

ChallengeThe initial a LIGO discoveries were successfully completed using classic data analytics. The processing pipeline used hundreds of CPU’s where the bulk of the detection processing was done offline. Here the latency is far outside the range needed to activate resources, such as the Large Synaptic Space survey Telescope (LSST) which observe phenomena in the electromagnetic spectrum in time to “see” what aLIGO can “hear”.

SolutionA DNN was developed and trained using a data set derived from the CACTUS simulation using the Einstein Toolkit. The DNN was shown to produce better accuracy with latencies 1000x better than the original CPU based waveform detection.

ImpactFaster and more accurate detection of gravitational waves with the potential to steer other observational data sources.

21

BackgroundDeveloping a new drug costs $2.5B and takes 10-15 years. Quantum chemistry (QC) simulations are important to accurately screen millions of potential drugs to a few most promising drug candidates.

ChallengeQC simulation is computationally expensive so researchers use approximations, compromising on accuracy. To screen 10M drug candidates, it takes 5 years to compute on CPUs.

SolutionResearchers at the University of Florida and the University of North Carolina leveraged GPU deep learning to develop ANAKIN-ME, to reproduce molecular energy surfaces with super speed (microseconds versus several minutes), extremely high (DFT) accuracy, and at 1-10/millionths of the cost of current computational methods.

ImpactFaster, more accurate screening at far lower cost

AI Quantum Breakthrough

22

SUMMARY• Same GPU technology enabling powerful science is also enabling

the revolution in deep learning

• Deep learning is enabling many usages in science (eg. Image recognition, classification, ..)

• Applications can use DL to train neural networks with already simulated data and DL network can predict about the output

• GPU is the right technology for HPC and DL

Axel Koehler ([email protected])

SYNERGIE VON HPC & DEEP LEARNING MIT NVIDIA GPUS

synergievon hpc & deep learning mitnvidia gpus · other accelerated applications • need...

Documents