the convergence of hpc and deep learning · 2020. 1. 15. · deepstream sdk nccl cublas cusparse...

27
THE CONVERGENCE OF HPC AND DEEP LEARNING Axel Koehler, Principal Solution Architect HPC Advisory Council 2018, April 10th 2018, Lugano

Upload: others

Post on 25-Jan-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

  • THE CONVERGENCE OF HPC AND DEEP LEARNING

    Axel Koehler, Principal Solution Architect

    HPC$Advisory$Council$2018,$April$10th$$2018,$Lugano

  • 3

    FACTORS DRIVING CHANGES IN HPC

    End$of$Dennard$Scaling$places$a$cap$on$single$threaded$performance

    Increasing$application$performance$will$require$fine$grain$parallel$code$with$significant$computational$intensity

    AI$and$Data$Science$emerging$as$important$new$components$of$scientific$discovery

    Dramatic$improvements$in$accuracy,$completeness$and$response$time$yield$increased$insight$from$huge$volumes$of$data

    Cloud$based$usage$models,$in?situ$execution$and$visualization$emerging$as$new$workflows$critical$to$the$science$process$and$productivity

    Tight$coupling$of$interactive$simulation,$visualization,$data$analysis/AI

    Service$Oriented$Architectures$(SOA)

  • 4

    Multiple Experiments Coming or Upgrading In the Next 10 Years

    15$TB/Day

    10X$Increase$in$Data$Volume

    Exabyte/Day

    30X$Increase$in$power

    Personal$Genomics

    Cryo$EM

  • 5

    TESLA PLATFORMONE Data Center Platform for Accelerating HPC and AI

    TESLA GPU & SYSTEMS

    NVIDIA SDK

    INDUSTRY FRAMEWORKS & TOOLS

    APPLICATIONS

    FRAMEWORKS

    INTERNET SERVICES

    DEEP LEARNING SDK

    CLOUDTESLA GPU NVIDIA DGX /DGX-Station NVIDIA HGX-1

    ENTERPRISE APPLICATIONS

    Manufacturing

    Automotive

    Healthcare Finance

    Retail

    Defense…

    DeepStream SDKNCCL cuBLAS

    cuSPARSEcuDNN TensorRT

    ECOSYSTEM TOOLS

    HPC

    +450 Applications

    COMPUTEWORKS

    CUDA C/C++ FORTRAN

    SYSTEM OEM CLOUDNVIDIA HGX-1

  • 6

    GPUS FOR HPC AND DEEP LEARNING

    NVIDIA Tesla V100

    5120 energy efficient cores + TensorCores7.8 TF Double Precision (fp64), 15.6 TF Single Precision (fp32) , 125 Tensor TFLOP/s mixed-precision

    Huge requirement on communication and memory bandwidth NVLink

    6 links per GPU a 50 GB/s bi-directional for maximum scalability between GPU’s

    CoWoS with HBM2

    900 GB/s Memory Bandwidth Unifying Compute & Memory in Single Package

    Huge$requirement$on$compute$power$(FLOPS)

    NCCL

    High-performance multi-GPU and multi-node collective communication primitives optimized for NVIDIA GPUs

    GPU Direct / GPU Direct RDMA

    Direct communication between GPUs by eliminating the CPU from the critical path

  • 7

    TENSOR COREMixed Precision Matrix Math - 4x4 matrices

    New CUDA TensorOp instructions & data formats

    4x4x4 matrix processing array

    D[FP32] = A[FP16] * B[FP16] + C[FP32]

    Using Tensor cores via

    • Volta optimized frameworks and libraries (cuDNN, CuBLAS, TensorRT, ..)

    • CUDA C++ Warp Level Matrix Operations

  • 8

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    512 1024 2048 4096

    Relative2Perform

    ance

    Matrix2Size2(M=N=K)

    cuBLAS Mixed2Precision2(FP162Input,2FP322compute)

    P1002(CUDA28)

    V1002Tensor2Cores22(CUDA29)

    0

    0,2

    0,4

    0,6

    0,8

    1

    1,2

    1,4

    1,6

    1,8

    2

    512 1024 2048 4096

    Relative2Perform

    ance

    Matrix2Size2(M=N=K)

    cuBLAS Single2Precision2(FP32)

    P1002(CUDA28)

    V1002(CUDA29)

    cuBLAS GEMMS FOR DEEP LEARNINGV100 Tensor Cores + CUDA 9: over 9x Faster Matrix-Matrix Multiply

    9.3x1.8x

    Note: pre-production Tesla V100 and pre-release CUDA 9. CUDA 8 GA release.

  • 9

    COMMUNICATION BETWEEN GPUSLarge scale models:• Some models are too big for a single GPU and need to be spread across multiple devices and multiple nodes• The size of the model will further increase in the future

    Data$parallel$training• Each$worker$trains$the$same$layers$on$a$different$data$batch• NVLINK$allows$the$separation$of$data$loading$and$gradient$averaging

    Model$parallel$training• All$workers$train$on$same$batchX$workers$communicate$as$frequently$as$network$allows

    • NVLINK$allows$the$separation$of$data$loading$and$exchanges$for$activation http://mxnet.io/how_to/multi_devices.html

  • 10

    NVLINK AND MULTI-GPU SCALING

    PCIe Switch

    CPU

    PCIe Switch

    CPU

    0

    32

    1 5

    67

    4

    • Data loading over PCIe (red)• Gradient averaging over NVLink (blue)• No sharing of communication resources:

    No congestion

    PCIe Switch

    CPU

    PCIe Switch

    CPU

    0

    32

    1 5

    67

    4

    QPI Link

    • Data loading over PCIe• Gradient averaging over PCIe and QPI• Data loading and gradient averaging share

    communication resources: Congestion

    PCIe based system NVLINK$based system

    For Data Parallel Training

  • 11

    NVLINK AND CNTK MULTI-GPU SCALING

  • 12

    NVIDIA Collective Communications Library (NCCL) 2Multi-GPU and multi-node collective communication primitives

    developer.nvidia.com/nccl

    High-performance multi-GPU and multi-node collective communication primitives optimized for NVIDIA GPUs

    Fast routines for multi-GPU multi-node acceleration that maximizes inter-GPU bandwidth utilization

    Easy to integrate and MPI compatible. Uses automatic topology detection to scale HPC and deep learning applications over PCIe and NVink

    Accelerates leading deep learning frameworks such as Caffe2, Microsoft Cognitive Toolkit, MXNet, PyTorch and more

    Multi-Node:InfiniBand verbs IP Sockets

    Multi-GPU:NVLinkPCIe

    Automatic Topology Detection

  • 13

    NVIDIA DGX-2

    1

    2$

    3

    5

    4

    6 Two Intel Xeon Platinum CPUs

    7$$1.5 TB System Memory

    13

    30 TB NVME SSDs Internal Storage

    NVIDIA Tesla V100 32GB

    Two GPU Boards8 V100 32GB GPUs per board6 NVSwitches per board512GB Total HBM2 Memoryinterconnected byPlane Card

    Twelve NVSwitches2.4 TB/sec bi-section

    bandwidth

    Eight EDR Infiniband/100 GigE1600 Gb/sec Total Bi-directional Bandwidth

    PCIe Switch Complex

    8

    9

    9Dual 10/25 Gb/secEthernet

  • 1414

    • 18 NVLINK ports

    • @50 GB/s per port bi-directional

    • 900 GB/s total bi-directional

    • Fully connected crossbar

    • X4 PCIe Gen2 Management port

    • GPIO

    • I2C

    • 2 billion transistors

    NVSWITCH

  • 15

    FULL NON-BLOCKING BANDWIDTH

  • 16

    UNIFIED MEMORY PROVIDES• Single memory view shared by all GPUs• Automatic migration of data between GPUs• User control of data locality

    NVLINK PROVIDES• All-to-all high-bandwidth peer mapping

    between GPUs• Full inter-GPU memory interconnect

    (incl. Atomics)

    NVSWITCH

  • VOLTA MULTI-PROCESS SERVICE

    Hardware Accelerated

    Work Submission

    Hardware Isolation

    VOLTA MULTI-PROCESS SERVICE

    Volta GV100

    A B C

    CUDA MULTI-PROCESS SERVICE CONTROLCPU Processes

    GPU Execution

    Volta MPS Enhancements:

    • MPS clients submit work directly to the work queues within the GPU

    • Reduced launch latency• Improved launch throughput

    • Improved isolation amongst MPS clients

    • Address isolation with independent address spaces

    • Improved quality of service (QoS)

    • 3x more clients than Pascal

    A B C

  • Efficient inference deployment without batching system

    Single Volta Client,No Batching,

    No MPS

    VOLTA MPS FOR INFERENCERe

    snet

    50 Im

    ages

    /sec

    , 7m

    s la

    tenc

    y

    Multiple Volta Clients,No Batching,

    Using MPS

    Volta withBatching System

    7x faster

    60% of perf with batching

    V100 measured on pre-production hardware.

  • 20

    DEEP LEARNING IS A HPC WORKLOADHPC expertise is important for success

    • HPC and Deep Learning require a huge amount of compute power (FLOPS)

    • Mainly Double Precision arithmetic for HPC

    • Single, half or 8b precision for Deep Learning Training/Inference

    • HPC and Deep Learning are using inherently parallel algorithms

    • HPC needs less memory per FLOPS than Deep Learning

    • HPC is more demanding on network bandwidth than Deep Learning

    • Data scientists like GPU dense systems (as much GPUs as possible per node)

    • HPC has more demand for scalability than Deep Learning up to now

    • Distributed training frameworks like Horovod (Uber) are meanwhile available

  • 21

    • Current DIY deep learning environments are complex and time consuming to build, test and maintain

    • Same issues affect HPC and other accelerated applications

    • Need multiple jobs from different users to co-exist on the same servers

    NVIDIA Libraries

    NVIDIA Docker

    NVIDIA Driver

    NVIDIA GPU

    Open Source Frameworks

    SOFTWARE CHALLENGES

  • 22

    NVIDIA GPU CLOUD REGISTRY

    Deep LearningAll major frameworks with multi-GPU optimizations Uses NCCL for NVLINK data exchange Multi-threaded I/O to feed the GPUs

    Caffe, Caffe2,CNTK, mxnet, PyTorch, Tensorflow, Theano, Torch

    HPCNAMD, Gromacs, LAMMPS, GAMESS, Relion, Chroma, MILC

    HPC VisualizationParaview with Optix, Index and Holodeck with OpenGL visualization base on NVIDIA Docker 2.0, IndeX, VMD

    Single NGC AccountFor use on GPUs everywhere - https://ngc.nvidia.com

    Common Software stack across NVIDIA GPUs

    NVIDIA GPU Cloud containerizes GPU-optimized frameworks, applications, runtimes, libraries, and operating system, available at no charge

  • 23

    NVIDIA SATURN VAI supercomputer with 660 x DGX-1V

    40$PF$Peak$FP64$Performance$,$660$PF$DL$Tensor$Performance

    • Primarily research focused• Used internally for Deep Learning applied

    research

    • Many using testing algorithms, networks, new approaches

    • Embedded, robotic, auto, hyperscale, HPC• Partner with university research and industry

    collaborations

    • Study convergence of data science and HPC• All jobs are containerized

  • 24

    DEEP LEARNING DATA CENTERReference Architecture

    http://www.nvidia.com/object/dgx1?multi?node?scaling?whitepaper.html

  • 25

    COMBINING THE STRENGTHS OF HPC AND AI

    • Implement$inference$models$with$real$time$interactivity$

    • Train$inference$models$to$improve$accuracy$and$comprehend$more$of$the$physical$parameter$space

    • Analyze$data$sets$that$are$simply$intractable$with$classic$statistical$models

    • Control$and$manage$complex$scientific$experiments

    HPC

    • Proven$algorithms$based$on$first$principles$theory• Proven$statistical$models$for$accurate$results$in$multiple$science$domains

    • Develop$training$data$sets$using$first$principal$models

    • Incorporate$AI$models$in$semi?empirical$style$applications$to$improve$throughput

    • Validate$new$findings$from$AI

    • New$methods$to$improve$predictive$accuracy,$insight$into$new$phenomena$and$response$time

    AI

  • 26

    MULTI-MESSENGERASTROPHYSICS

    Despite2the2latest2development2in2computational2power,2there2is2still2a2large2gap2in2linking2relativistic2theoretical2models2to2observations.2Max$Plank$Institute

    BackgroundThe aLIGO (Advanced Laser Interferometer Gravitational Wave Observatory) experiment successfully discovered signals proving Einstein’s theory of General Relativity and the existence of cosmic Gravitational Waves. While this discovery was by itself extraordinary it is seen to be highly desirable to combine multiple observational data sources to obtain a richer understanding of the phenomena.

    ChallengeThe initial a LIGO discoveries were successfully completed using classic data analytics. The processing pipeline used hundreds of CPU’s where the bulk of the detection processing was done offline. Here the latency is far outside the range needed to activate resources, such as the Large Synaptic Space survey Telescope (LSST) which observe phenomena in the electromagnetic spectrum in time to “see” what aLIGO can “hear”.

    SolutionA DNN was developed and trained using a data set derived from the CACTUS simulation using the Einstein Toolkit. The DNN was shown to produce better accuracy with latencies 1000x better than the original CPU based waveform detection.

    ImpactFaster and more accurate detection of gravitational waves with the potential to steer other observational data sources.

  • 27

    BackgroundDeveloping a new drug costs $2.5B and takes 10-15 years. Quantum chemistry (QC) simulations are important to accurately screen millions of potential drugs to a few most promising drug candidates.

    ChallengeQC simulation is computationally expensive so researchers use approximations, compromising on accuracy. To screen 10M drug candidates, it takes 5 years to compute on CPUs.

    SolutionResearchers at the University of Florida and the University of North Carolina leveraged GPU deep learning to develop ANAKIN-ME, to reproduce molecular energy surfaces with super speed (microseconds versus several minutes), extremely high (DFT) accuracy, and at 1-10/millionths of the cost of current computational methods.

    ImpactFaster, more accurate screening at far lower cost

    AI Quantum Breakthrough

  • 28

    SUMMARY• Same GPU technology enabling powerful science is also enabling

    the revolution in deep learning

    • Deep learning is enabling many usages in science (eg. Image recognition, classification, ..)

    • Applications can use DL to train neural networks with already simulated data and DL network can predict about the output

    • GPU is the right technology for HPC and DL

  • Axel Koehler ([email protected])

    THE CONVERGENCE OF HPC AND DEEP LEARNING