ocp meet up cern low-latency accelerated computing on gpus dr. christoph angerer devtech, nvidia
TRANSCRIPT
![Page 1: OCP MEET UP CERN Low-Latency Accelerated Computing on GPUs Dr. Christoph Angerer DevTech, NVIDIA](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e845503460f94b85670/html5/thumbnails/1.jpg)
OCP MEET UP CERN
Low-Latency Accelerated Computing on GPUs
Dr. Christoph AngererDevTech, NVIDIA
![Page 2: OCP MEET UP CERN Low-Latency Accelerated Computing on GPUs Dr. Christoph Angerer DevTech, NVIDIA](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e845503460f94b85670/html5/thumbnails/2.jpg)
OCP MEET UP CERN
Accelerated ComputingHigh Performance & High Energy Efficiency
for Throughput Tasks
CPUSerial Tasks
GPU AcceleratorParallel Tasks
![Page 3: OCP MEET UP CERN Low-Latency Accelerated Computing on GPUs Dr. Christoph Angerer DevTech, NVIDIA](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e845503460f94b85670/html5/thumbnails/3.jpg)
OCP MEET UP CERN
Accelerating Insights
Now You Can Build Google’s $1M Artificial Brain on the Cheap
“ “
GOOGLE DATACENTER
1,000 CPU Servers
2,000 CPUs • 16,000 cores
600 kWatts$5,000,000
STANFORD AI LAB
3 GPU-Accelerated Servers
12 GPUs • 18,432 cores
4 kWatts$33,000
Deep learning with COTS HPC systems, A. Coates, B. Huval, T. Wang, D. Wu, A. Ng, B. Catanzaro ICML 2013
![Page 4: OCP MEET UP CERN Low-Latency Accelerated Computing on GPUs Dr. Christoph Angerer DevTech, NVIDIA](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e845503460f94b85670/html5/thumbnails/4.jpg)
OCP MEET UP CERN
Government Supercomputing FinanceHigher EdOil & Gas Consumer Web
From HPC to Enterprise Data Center
Air Force ResearchLaboratory
Naval ResearchLaboratory
Tokyo Institute of Technology
![Page 5: OCP MEET UP CERN Low-Latency Accelerated Computing on GPUs Dr. Christoph Angerer DevTech, NVIDIA](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e845503460f94b85670/html5/thumbnails/5.jpg)
OCP MEET UP CERN
DevelopmentData Center Infrastructure
Optimized Systems
Communication
Solutions
Infrastructure
Management
Programming
Languages
Development Tools
Software Applications
Tesla Accelerated Computing Platform
GPU Accelerato
rs
Interconnect
System Managemen
t
Compiler Solutions
Profile and Debug Libraries
Partner Ecosystem
Enterprise Services
Tesla: Platform for Accelerated Datacenters
![Page 6: OCP MEET UP CERN Low-Latency Accelerated Computing on GPUs Dr. Christoph Angerer DevTech, NVIDIA](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e845503460f94b85670/html5/thumbnails/6.jpg)
OCP MEET UP CERN
Common Programming Models Across Multiple CPUs
X86
Libraries
Programming Languages
CompilerDirectives
AmgXcuBLAS
/
![Page 7: OCP MEET UP CERN Low-Latency Accelerated Computing on GPUs Dr. Christoph Angerer DevTech, NVIDIA](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e845503460f94b85670/html5/thumbnails/7.jpg)
OCP MEET UP CERN
GPU Roadmap
Norm
alized
Perf
orm
an
ce
2012
2014
2008
2010
2016
TeslaCUDA
FermiFP64
KeplerDynamic Parallelism
MaxwellDX12
PascalUnified Memory3D MemoryNVLink
20
16
12
8
0
4
![Page 8: OCP MEET UP CERN Low-Latency Accelerated Computing on GPUs Dr. Christoph Angerer DevTech, NVIDIA](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e845503460f94b85670/html5/thumbnails/8.jpg)
GPUDirect
![Page 9: OCP MEET UP CERN Low-Latency Accelerated Computing on GPUs Dr. Christoph Angerer DevTech, NVIDIA](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e845503460f94b85670/html5/thumbnails/9.jpg)
OCP MEET UP CERN
Multi-GPU: Unified Virtual Addressing Single Partitioned Address Space
SystemMemor
y
CPU GPU0
GPU0Memor
y
GPU1
GPU1Memor
y
PCI-e
0x0000
0xFFFF
![Page 10: OCP MEET UP CERN Low-Latency Accelerated Computing on GPUs Dr. Christoph Angerer DevTech, NVIDIA](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e845503460f94b85670/html5/thumbnails/10.jpg)
OCP MEET UP CERN
GPUDirect TechnologiesGPUDirect Shared GPU-Sysmem for Inter-node copy optimization
How: Use GPUDirect-aware 3rd party network drivers
GPUDirect P2P Transfers for on-node GPU-GPU memcpy
How: Use CUDA APIs directly in application
How: use P2P-aware MPI implementation
GPUDirect P2P Access for on-node inter-GPU LD/ST access
How: Access remote data by address directly in GPU device code
GPUDirect RDMA for Inter-node copy optimization
What: 3rd party PCIe devices can read and write GPU memory
How: Use GPUDirect RDMA-aware 3rd party network drivers and MPI implementations or custom device drivers for
other hardware
![Page 11: OCP MEET UP CERN Low-Latency Accelerated Computing on GPUs Dr. Christoph Angerer DevTech, NVIDIA](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e845503460f94b85670/html5/thumbnails/11.jpg)
OCP MEET UP CERN
GPUDirect P2P: GPU-GPU Direct Access
PCIeGPU
GPUMemory
CPU/IOH
CPUMemory
GPU
GPUMemory
PCIe
![Page 12: OCP MEET UP CERN Low-Latency Accelerated Computing on GPUs Dr. Christoph Angerer DevTech, NVIDIA](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e845503460f94b85670/html5/thumbnails/12.jpg)
OCP MEET UP CERN
P2P Goal: Improve intra-node programming model
Improve CUDA programming model
How?
Transfer data between two GPUs quickly/easily
int main() {
double *cpuTmp, *gpu0Data, gpu1Data;
setup (gpu0Data, gpu1Data);
cudaSetDevice (0);
kernel <<< … >>> (gpu0Data);
cudaMemcpy (cpuTmp, gpu0Data);
cudaMemcpy (gpu1Data, cpuTmp);
cudaSetDevice (1);
kernel <<< … >>> (gpu0Data);
}
![Page 13: OCP MEET UP CERN Low-Latency Accelerated Computing on GPUs Dr. Christoph Angerer DevTech, NVIDIA](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e845503460f94b85670/html5/thumbnails/13.jpg)
OCP MEET UP CERN
GPUDirect P2P: Common use cases
MPI implementation optimization for Intra-Node communication
HPC applications that fit on a single nodeYou get much better efficiency with GPUDirect P2P (compared to MPI)Can use between ranks with cudaIpc() APIs
![Page 14: OCP MEET UP CERN Low-Latency Accelerated Computing on GPUs Dr. Christoph Angerer DevTech, NVIDIA](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e845503460f94b85670/html5/thumbnails/14.jpg)
OCP MEET UP CERN
GPUDirect RDMA
GPU PCIe
GPUMemory
CPU/IOH
CPUMemory
PCIe
Third Party
Hardware
![Page 15: OCP MEET UP CERN Low-Latency Accelerated Computing on GPUs Dr. Christoph Angerer DevTech, NVIDIA](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e845503460f94b85670/html5/thumbnails/15.jpg)
OCP MEET UP CERN
GPUDirect RDMA Goal
Inter-node Latency and Bandwidth
How?
Transfer data between GPU and third party device (e.g. NIC) with possibly zero host-side copies
int main() {double *cpuData, *cpuTmp,
*gpuData;
setup (gpuData);
kernel <<< … >>> (gpuData);
cudaDeviceSynchronize ();cudaMemcpy (cpuTmp,
gpuData);memcpy (cpuData,
cpuTmp);MPI_Send (gpuData)
}
![Page 16: OCP MEET UP CERN Low-Latency Accelerated Computing on GPUs Dr. Christoph Angerer DevTech, NVIDIA](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e845503460f94b85670/html5/thumbnails/16.jpg)
OCP MEET UP CERN
GPUDirect RDMA: What does it get you?
Latency Reduction
MPI_Send latency of 25µs with Shared GPU-Sysmem*No overlap possibleBidirectional transfer is difficult
MPI_Send latency of 5µs with RDMADoes not affect running kernelsUnlimited concurrencyRDMA possible!
MPI-3 One sided of 3µs
![Page 17: OCP MEET UP CERN Low-Latency Accelerated Computing on GPUs Dr. Christoph Angerer DevTech, NVIDIA](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e845503460f94b85670/html5/thumbnails/17.jpg)
OCP MEET UP CERN
GPUDirect RDMA: Common use casesInter-Node MPI communication
Transfer data between GPU and a remote NodeUse CUDA-aware MPI
Interface with third party hardwareRequires adopting GPUDirect-Interop API in vendor driver stack
LimitationGPUDirect RDMA does not work with CUDA Unified Memory today
![Page 18: OCP MEET UP CERN Low-Latency Accelerated Computing on GPUs Dr. Christoph Angerer DevTech, NVIDIA](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e845503460f94b85670/html5/thumbnails/18.jpg)
NVLINK
![Page 19: OCP MEET UP CERN Low-Latency Accelerated Computing on GPUs Dr. Christoph Angerer DevTech, NVIDIA](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e845503460f94b85670/html5/thumbnails/19.jpg)
OCP MEET UP CERN
Pascal GPU FeaturesNVLINK and Stacked Memory
NVLINKGPU high speed interconnect80-200 GB/s
3D Stacked Memory4x Higher Bandwidth (~1 TB/s)3x Larger Capacity4x More Energy Efficient per bit
![Page 20: OCP MEET UP CERN Low-Latency Accelerated Computing on GPUs Dr. Christoph Angerer DevTech, NVIDIA](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e845503460f94b85670/html5/thumbnails/20.jpg)
OCP MEET UP CERN
NVLink Unleashes Multi-GPU Performance
20
3D FFT, ANSYS: 2 GPU configuration, AMBER Cellulose (256x128x128), FFT problem size (256^3), 4 GPU configuration
TESLAGPU
TESLAGPU
CPU
5x Faster than PCIe Gen3 x16
PCIe Switch
GPUs Interconnected with NVLink
ANSYS Fluent AMBER 3D FFT1.00x
1.50x
2.00x
Over 2x Application Performance SpeedupWhen Next-Gen GPUs Connect via NVLink Versus PCIe
Speedup v
s PC
Ie b
ase
d S
erv
er
![Page 21: OCP MEET UP CERN Low-Latency Accelerated Computing on GPUs Dr. Christoph Angerer DevTech, NVIDIA](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e845503460f94b85670/html5/thumbnails/21.jpg)
OCP MEET UP CERN
NVLink High-Speed GPU InterconnectExample: 8-GPU Server with NVLink
![Page 22: OCP MEET UP CERN Low-Latency Accelerated Computing on GPUs Dr. Christoph Angerer DevTech, NVIDIA](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e845503460f94b85670/html5/thumbnails/22.jpg)
Machine Learning
![Page 23: OCP MEET UP CERN Low-Latency Accelerated Computing on GPUs Dr. Christoph Angerer DevTech, NVIDIA](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e845503460f94b85670/html5/thumbnails/23.jpg)
OCP MEET UP CERN
Input Result
Hinton et al., 2006; Bengio et al., 2007; Bengio & LeCun, 2007; Lee et al., 2008; 2009
Visual Object Recognition Using Deep Convolutional Neural NetworksRob Fergus (New York University / Facebook) http://on-demand-gtc.gputechconf.com/gtcnew/on-demand-gtc.php#2985
Machine Learning using Deep Neural Networks
![Page 24: OCP MEET UP CERN Low-Latency Accelerated Computing on GPUs Dr. Christoph Angerer DevTech, NVIDIA](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e845503460f94b85670/html5/thumbnails/24.jpg)
OCP MEET UP CERN
3 Drivers for Deep Learning
More Data Better Models Powerful GPUAccelerators
![Page 25: OCP MEET UP CERN Low-Latency Accelerated Computing on GPUs Dr. Christoph Angerer DevTech, NVIDIA](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e845503460f94b85670/html5/thumbnails/25.jpg)
OCP MEET UP CERN
Image Detection
Face Recognition
Gesture Recognition
Video Search & Analytics
Speech Recognition & Translation
Recommendation Engines
Indexing & Search
Use CasesEarly Adopters
Image Analytics for
Creative Cloud
Image Classificatio
n
Speech/Image Recognition
Recommendation
Hadoop
Search Rankings
Talks @ GTC
Broad use of GPUs in Deep learning
![Page 26: OCP MEET UP CERN Low-Latency Accelerated Computing on GPUs Dr. Christoph Angerer DevTech, NVIDIA](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e845503460f94b85670/html5/thumbnails/26.jpg)
OCP MEET UP CERN
What is Next?Analyzing Unstructured Data
Anomaly Detection
Behavior Prediction
Diagnostic Support
….
Language Analysis
“Any product that excites you over the next five years and makes you think: ‘That is magical, how
did they do that?’, is probably based on this [deep learning].”
Steve Jurvetson, Partner DFJ Venture
![Page 27: OCP MEET UP CERN Low-Latency Accelerated Computing on GPUs Dr. Christoph Angerer DevTech, NVIDIA](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e845503460f94b85670/html5/thumbnails/27.jpg)
OCP MEET UP CERN
Beyond Deep Learning
Graph Analytics Database Acceleration
Real Time Analytics
Visualization of social network analysis by Martin Grandjean is licensed under CC BY-SA 3.0
![Page 28: OCP MEET UP CERN Low-Latency Accelerated Computing on GPUs Dr. Christoph Angerer DevTech, NVIDIA](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e845503460f94b85670/html5/thumbnails/28.jpg)
OCP MEET UP CERN
GTC 2015 had manyDeep Learning Sessions
Adobe Google
Alibaba iFlytek, Ltd
Baidu NUANCE
Carnegie Mellon Stanford Univ
Facebook UC Berkeley
Flickr / Yahoo Univ of Toronto
Check GTC on-demandhttp://on-demand-gtc.gputechconf.com/gtcnew/on-demand-gtc.php
![Page 29: OCP MEET UP CERN Low-Latency Accelerated Computing on GPUs Dr. Christoph Angerer DevTech, NVIDIA](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e845503460f94b85670/html5/thumbnails/29.jpg)
NVIDIA in OCP
![Page 30: OCP MEET UP CERN Low-Latency Accelerated Computing on GPUs Dr. Christoph Angerer DevTech, NVIDIA](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e845503460f94b85670/html5/thumbnails/30.jpg)
OCP MEET UP CERN
Unlocking Access to the Tesla Platform
Engaging with OEMs, end customers and technology partners to include NVIDIA Accelerators in the OCP Platform
NVLINK based designs for maximum performance
Standard PCIe designs for scale out
![Page 31: OCP MEET UP CERN Low-Latency Accelerated Computing on GPUs Dr. Christoph Angerer DevTech, NVIDIA](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e845503460f94b85670/html5/thumbnails/31.jpg)
OCP MEET UP CERN
NVLinkHigh-Speed GPU
Interconnect
NVLink
POWER CPU
X86, ARM64, POWER CPU
X86, ARM64, POWER CPU
PASCAL GPUKEPLER GPU
20162014
PCIe PCI
e
NVLink
Enabling NVLink GPU-CPU connections