deep learning on the saturnv cluster

The World’s Greenest SupercomputerHPC Advisory Council Lugano 2017 Gunter Roeth SA

DL ON SATURNV CLUSTER

2

A DECADE OF SCIENTIFIC COMPUTING WITH GPUS

2006 2008 2012 20162010 2014

Fermi: World’s First HPC GPU

Oak Ridge Deploys World’s Fastest Supercomputer w/ GPUs

World’s First Atomic Model of HIV Capsid

GPU-Trained AI Machine Beats World Champion in Go

Stanford Builds AI Machine using GPUs

World’s First 3-D Mapping of Human Genome

CUDA Launched

World’s First GPU Top500 System

Google Outperform Humans in ImageNet

Discovered How H1N1 Mutates to Resist Drugs

AlexNet beats expert code by huge margin using GPUs

3

“…around 2008 my group at Stanford started advocating

shifting deep learning to GPUs (this was really

controversial at that time; but now everyone does it); and

I'm now advocating shifting to HPC (High

Performance Computing/Supercomputing) tactics for

scaling up deep learning. Machine learning should

embrace HPC. These methods will make researchersmore efficient and help accelerate the progress of our

whole field.” –Andrew Ng, Feb 2016

++ +Neural Network

Design ExpertiseHPC Design

ExpertiseGPU Programming

Expertise

Image & Video RecognitionExpertise

DEEP LEARNING is the Next Frontier for HPC

4

MCBDA2016 – Louis Capps, [email protected] – Louis Capps, [email protected]

Hyperscale

Collects and analyzes data

HPC

Parallel simulations produce

huge data

BIG DATA – FROM HPC TO HYPERSCALE TO ...

Insight Autonomy

Huge

data out

Big

Data inSmall

Data in

Small

Data out

Discovery Prediction

Create data Analyze data

Similar engine

Two different workload

- HPC is parallel simulation that produces huge datasets

- Hyperscale collects data and produces analytics

Can they converge?

- One possible future

- HPC feeds Hyperscale

- Hyperscale analysis affects HPC model/data

- Move workloads onto same system (stacks side by side)

- Eventually merge stack

Big

ComputeHuge

Compute

Huge

Storage

Big

Storage

5

http://nvidianews.nvidia.com/news/nvidia-and-microsoft-boost-ai-cloud-computing-with-launch-of-industry-standard-hyperscale-gpu-accelerator

MICROSOFT AND NVIDIA HGX-1 ANNOUNCEMENT

6

CLOUD CHALLENGES

1 SKU, Multiple Instances

Integration into Existing Datacenter

INSTANCES

Granular, Latency Sensitive

High Throughput Batch

HPC: different CPU:GPU ratios

DevOps / Development

Production Deployment

PROJECT OLYMPUS HGX-1 HYPERSCALE GPU ACCELERATOR PARTNERSHIP + INTEROPERABILITY

7

Project Olympus

HGX-1Hyperscale GPU

Accelerator

Configurable PCIe Cable to host + Expansion slots

NVIDIA P100 GPU

NVLink Hybrid Cube Mesh Fabric

20 Gbyte/sec per link Duplex

Adapters for other GPUs

8

DEEP LEARNING

2 CPU : 8 GPU8x P100 SXM2 | 4x x16 PCIe


CPUCPU CPUCPU

9

HPC



CPUCPU CPUCPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

10


8 CPU : 32 GPU32x P4 PCIe | 8x x16 PCIe

INFERENCE

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPUCPUCPU

11

65x in 3 Years

K40

K80 + cuDNN

1

M40 + cuDNN4

P100 + cuDNN5

0x

10x

20x

30x

40x

50x

60x

70x

2013 2014 2015 2016

AlexNet Training Performance

WHY THE EXCITEMENT?GPUs as Enablers of Breakthrough Results

Paper: H.Zhang et al. StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks, arXiv:1612.03242

We can generate photorealistic images from textual descriptions now!

Achieve super-human accuracy in classification

And we are getting faster fast

12

72%

74%

84%

88%

93%

96%

2010 2011 2012 2013 2014 2015

“SUPERHUMAN” RESULTSSPARK HYPERSCALE ADOPTION

Deep Learning

ImageNet — Accuracy %

Cloud Services with AI Powered by NVIDIA

Alibaba/Aliyun Amazon Baidu eBay Facebook

Flickr Google iFLYTEK iQIYI JD.com

Orange Periscope Pinterest Qihoo 360 Shazam

Skype Sogou Twitter Yahoo Supermarket Yandex YelpHand-coded CV

Human

74%76%

1313

DEEP LEARNING EVERYWHERE

INTERNET & CLOUD

Image ClassificationSpeech Recognition

Language TranslationLanguage ProcessingSentiment AnalysisRecommendation

MEDIA & ENTERTAINMENT

Video CaptioningVideo Search

Real Time Translation

AUTONOMOUS MACHINES

Pedestrian DetectionLane Tracking

Recognize Traffic Sign

SECURITY & DEFENSE

Face DetectionVideo SurveillanceSatellite Imagery

MEDICINE & BIOLOGY

Cancer Cell DetectionDiabetic GradingDrug Discovery

14

GPUS IN ARTIFICIAL INTELLIGENCE

Replace hand-tuned parameters of the feature extraction steps (e.g. in voice and image recognition)

Deep learning is a subset of machine learning that refers to artificial neural networks that are composed of many layers.

Artificial Neural Networks inspired by human brain and need lots of training data (ideal for Big Data).

NVIDIA GPUs and cuDNN software broadly adopted for machine learning.

Machine Learning

Neural

Networks

Deep

Learning

15

NVIDIA DEEP LEARNING SDKHigh Performance GPU-Acceleration for Deep Learning

COMPUTER VISION SPEECH AND AUDIO BEHAVIOR

Object Detection Voice Recognition TranslationRecommendation

EnginesSentiment Analysis

DEEP LEARNING

cuDNN

MATH LIBRARIES

cuBLAS cuSPARSE

MULTI-GPU

NCCL

cuFFT

Mocha.jl

Image Classification

DEEP LEARNING

SDK

FRAMEWORKS

APPLICATIONS

16

Platform Tensorflow CNTK MXNet Caffe Theano Torch

Release Date 2016 2016 2015 2014 2010 2011

Core Language C++ C++ C++ C++ C++ C

API C++

Python

NDL C++

Python, R, Scala,

Matlab, Javascript,

Go, Julia

Python, Matlab Python Lua

Synchronisation

Model

Sync or async Sync Sync or async Sync Async Sync

Communication

Mode

Parameter server MPI Parameter server N/A

(Spark/Custom)

N/A N/A

Multi-GPU ✓ ✓ ✓ ✓ ✓ ✓

Multi-node ✓ ✓ ✓ ✗ ✗ ✗

Data Parallelism ✓ ✓ ✓ ✓ ✓ ✓

Model Parallelism ✓ N/A ✓ ✗ ✓ ✓

Fault Tolerance Checkpoint and

recovery

Checkpoint and

resume

Checkpoint and

resume

N/A Checkpoint and

resume

Checkpoint and

resume

Visualisation Graph

(interactive),

training monitor,

debugging tools

Graph (static) None Summary Statistics

(custom)

Graph (static) Plots

Fox, James, Yiming Zou, and Judy Qiu. "Software Frameworks for Deep Learning at Scale."

17

NVID

IA

DEEP LEARNING PERFORMANCE GUIDE

18

TensorFlowDeep Learning Training

An open-source software library for numerical computation using data flow graphs.

VERSION1.0

ACCELERATED FEATURESFull framework accelerated

SCALABILITYMulti-GPU and multi-node

More Informationhttps://www.tensorflow.org/

TensorFlow Deep Learning FrameworkTraining on 8x P100 GPU Server vs 8 x K80 GPU Server

-

1.0

2.0

3.0

4.0

5.0

Spee

du

p v

s. S

erve

r w

ith

8 x

K8

0

AlexNet GoogleNet ResNet-50 ResNet-152 VGG16

2.5x Avg. Speedup

3x Avg. Speedup

GPU Servers: Single Xeon E5-2690 [email protected] with GPUs configs as shownUbuntu 14.04.5, CUDA 8.0.42, cuDNN 6.0.5; NCCL 1.6.1, data set: ImageNet; batch sizes: AlexNet (128), GoogleNet (256), ResNet-50 (64), ResNet-152 (32), VGG-16 (32)

Server with 8x P100 16GB NVLink

Server with 8x P100 PCIe 16GB

19

CNTK Deep Learning FrameworkTraining on 8x P100 GPU Server vs 8 x K80 GPU Server

-

1.0

2.0

3.0

4.0

Spee

du

p v

s. S

erve

r w

ith

8 x

K8

0

AlexNet ResNet-50

CNTKDeep Learning Training

A free, easy-to-use, open-source, commercial-grade toolkit that trains deep learning algorithms to learn like the human brain.

VERSION1.0

ACCELERATED FEATURESFull framework accelerated

SCALABILITYMulti-GPU and multi-node

More Informationwww.microsoft.com/en-us/research/product/cognitive-toolkit/GPU Servers: Single Xeon E5-2690 [email protected] with GPUs configs as shown

Ubuntu 14.04.5, CUDA 8.0.42, cuDNN 6.0.5; NCCL 1.6.1, data set: ImageNet; batch sizes: AlexNet (128), ResNet-50 (64)

1.6x Avg. Speedup

2.7x Avg. Speedup

Server with 8x P100 16GB NVLink

Server with 8x P100 PCIe 16GB

20

Deep Learning Performance

INFERENCE

21

-

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

Spee

du

p v

s. v

s. S

ingl

e-So

cke

t C

PU

Ser

ver

AlexNet GoogleNet ResNet-152 VGG-19

Deep Learning Inference1 x GPU Server Throughput Performance vs. Single-Socket CPU Server

CPU Server: Single Xeon E5-2690 [email protected] Servers: Single Xeon E5-2690 [email protected] with x1 P100 16GB or x1 P4 GPUUbuntu 14.04.5, TensorRT 2.0, CUDA 8.0.42, cuDNN 6.0.5; NCCL 1.6.1, batch size 128, precision as indicated.

Deep Learning Inference using TensorRT on CAFFE

Popular GPU-accelerated framework using NVIDIA’s TensorRT 1.0 Inference Engine

VERSIONCAFFE 1.0TensorRT 1.0

ACCELERATED FEATURESCPU & GPU versions available

SCALABILITYMulti-GPU

More Informationhttp://caffe.berkeleyvision.org/

https://developer.nvidia.com/tensorrt

1xP100 (FP32)

1xP100 (FP16)

1xP4 (INT8)

1xP4 (FP32)

6x Avg. Speedup

20x Avg. Speedup

13x Avg. Speedup

25x Avg. Speedup

22

Training

Device

Datacenter

GPU DEEP LEARNING IS A NEW COMPUTING MODEL

TRAINING

Billions of Trillions of Operations

GPU train larger models, accelerate

time to market

DATACENTER INFERENCING

10s of billions of image, voice, video

queries per day

GPU inference for fast response,

maximize datacenter throughput

Datacenter

23

WHAT IS SATURNV?

24

Fastest AI Supercomputer in TOP5004.9 Petaflops Peak FP6419.6 Petaflops Peak FP1613 DGX-1 to get into Top500

Most Energy Efficient Supercomputer#1 Green5009.5 GFLOPS per Watt

Rocket for Cancer MoonshotCANDLE Development Platform Common platform with DOE labs – ANL, LLNL,ORNL, LANL

NVIDIA DGX SATURNVGiant Leap Towards Exascale AI

25

HOW DID WE BUILD SATURNV?

26

Innovation needs a deep learning supercomputer!

Deep Learning scalability; move outside the box

Focus on research

Used internally for Deep Learning applied research

Multiple users, algorithms, networks, new approaches

Embedded, robotic, auto, hyperscale, HPC

Partner with university research, government and industry collaborations

Study convergence of data science and HPC

WHY NVIDIA DGX SATURNV124 Node Supercomputing Cluster

27

NVIDIA DGX-1 DEEP LEARNING SYSTEM

28

GPU-Accelerated Server AlexNet TrainingDGX-1 Faster than 128 Knights Landing Servers

GTC-P: Plasma TurbulenceDGX-1 Faster than 64 Knights Landing Servers

ONE ARCHITECTURE BUILT FOR BOTHDATA SCIENCE & COMPUTATIONAL SCIENCE

GTC-P, Grid Size A, Systems: NVIDIA DGX-1, 8xP100,

Intel KNL 7250 68 core Flat-Quadrant mode, Omnipath

Based on AlexNet Batch size 256, weak scaling up to 32 KNL servers, 64 & 128 estimated based on ideal scaling, Xeon Phi 7250 Nodes

0x

10x

20x

30x

40x

1 4 8 16 32 64 128

Speed-u

p v

s 1x K

NL S

erv

er

Knights Landing Servers 1x DGX1

0x

1x

2x

3x

4x

5x

6x

7x

8x

9x

1 4 8 16 32 64

Knights Landing Servers 1x DGX1

Speed-u

p v

s 1x K

NL S

erv

er

NVIDIA DGX-1

29

124 NVIDIA DGX-1 Nodes – 992 P100 GPUs

8x NVIDIA Tesla P100 SXM GPUs – NVLINK CubeMesh

2x Intel Xeon 20 core GPUs

512TB DDR4 System Memory

SSD – 7 TB scratch + 0.5 TB OS

Mellanox 36 port EDR L1 and L2 switches

4 ports per system

Partial Fat tree topology

Ubuntu 14.04, CUDA 8, OpenMPI 1.10.3

NVIDIA GPU BLAS + Intel MKL (NVIDIA GPU HPL)

Deep Learning applied research

Many users, frameworks, algorithms, networks, new approaches

Embedded, robotic, auto, hyperscale, HPC

NVIDIA DGX SATURNV124 node Cluster

nvidia.com/dgx1

30

DEEP LEARNING CLUSTER REFERENCE ARCHITECTURE

31

NOV2016 TOP GREEN500 SYSTEM

Green500.org Top500.org

SATURNV produced groundbreaking 9.4 GF/W at full scale

--> Sets the stage for future Exascale class computing

32

Monitoring Effects of Carbon and Greenhouse Gas Emissions

Minute-by-minute AI Weather Forecasting

insideHPC.com SurveyNovember 2016

92%believe AI will impact their work

93%using deep learning seeing positive results

DEEP LEARNING IS VITAL TO HPC

33

RESOURCESFor Executives, Developers and Data Scientists

CASE STUDIES

PARTNER COURSES

INTRO MATERIALS

ON-SITE WORKSHOPS

SELF-PACED LABS

TECHNICAL BLOGS

34

NVIDIA DEEP LEARNING INSTITUTE

Training organizations and individuals to solve challenging problems using Deep Learning

On-site workshops and online courses presented by certified experts

Covering complete workflows for proven application use casesSelf-driving cars, recommendation engines, medical image classification, intelligent video analytics and more

www.nvidia.com/dli

https://www.nvidia.com/en-us/deep-learning -ai/education/

Hands-on Training for Data Scientists and Software Engineers

http://www.nvidia.com/dli

https://www.nvidia.com/en-us/deep-learning-ai/education/

QUESTIONS?

deep learning on the saturnv cluster

Technology