accélération de calculs de simulations des matériaux sur gpu · 2016. 9. 2. · april 15-18,...
TRANSCRIPT
![Page 1: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/1.jpg)
1
NVIDIA TEGRA K1 Mar 2, 2014
NVIDIA Confidential
Francois Courteille Senior Solution Architect, Accelerated Computing
Accélération de calculs de simulations des matériaux sur GPU
![Page 2: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/2.jpg)
2
AGENDA
Tesla, the NVIDIA compute platform
Quantum Chemistry Applications Overview
Molecular Dynamics Applications Overview
Material Science Applications at the start of the Pascal Era
![Page 3: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/3.jpg)
3
THE WORLD LEADER IN VISUAL COMPUTING
GAMING ENTERPRISE TECHNOLOGY HPC & CLOUD AUTO
![Page 4: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/4.jpg)
4
OUR TEN YEARS IN HPC
2006 2008 2012 2016 2010 2014
Fermi: World’s First HPC GPU
Oak Ridge Deploys World’s Fastest Supercomputer w/ GPUs
World’s First Atomic Model of HIV Capsid
GPU-Trained AI Machine Beats World Champion in Go
Stanford Builds AI Machine using GPUs
World’s First 3-D Mapping of Human Genome
CUDA Launched
World’s First GPU Top500 System
Google Outperform Humans in ImageNet
Discovered How H1N1 Mutates to Resist Drugs
AlexNet beats expert code by huge margin using GPUs
![Page 5: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/5.jpg)
5
CREDENTIALS BUILT OVER TIME
300K CUDA Developers, 4x Growth in 4 years
Majority of HPC Applications are GPU-Accelerated, 410 and Growing
100% of Deep Learning Frameworks are Accelerated
113
206
242
370
410
0
50
100
150
200
250
300
350
400
450
2011 2012 2013 2014 2015 2016
287
# of Applications Academia Games Finance
Manufacturing Internet Oil & Gas
National Labs Automotive Defense
M & E
300K
TORCH
THEANO
CAFFE
MATCONVNET
PURINEMOCHA.JL
MINERVA MXNET*
BIG SUR TENSORFLOW
WATSON CNTK
www.nvidia.com/appscatalog
![Page 6: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/6.jpg)
6
END-TO-END TESLA PRODUCT FAMILY
HYPERSCALE HPC
Tesla M4, M40
MIXED-APPS HPC
Tesla K80
STRONG-SCALING HPC
Tesla P100
FULLY INTEGRATED DL SUPERCOMPUTER
DGX-1
For customers who need to get going now with fully
integrated solution
Hyperscale & HPC data centers running apps that
scale to multiple GPUs
HPC data centers running mix of CPU and GPU workloads
Hyperscale deployment for DL training, inference, video &
image processing
![Page 7: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/7.jpg)
8
TESLA FOR SIMULATION
LIBRARIES
TESLA ACCELERATED COMPUTING
LANGUAGES DIRECTIVES
ACCELERATED COMPUTING TOOLKIT
![Page 8: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/8.jpg)
10
MD: All key codes are GPU-accelerated
Great multi-GPU performance
Focus on dense (up to 16) GPU nodes &/or large # of
GPU nodes
ACEMD*, AMBER (PMEMD)*, BAND, CHARMM, DESMOND, ESPResso,
Folding@Home, GPUgrid.net, GROMACS, HALMD, HOOMD-Blue*,
LAMMPS, Lattice Microbes*, mdcore, MELD, miniMD, NAMD,
OpenMM, PolyFTS, SOP-GPU* & more
QC: All key codes are ported or optimizing
Focus on using GPU-accelerated math libraries,
OpenACC directives
GPU-accelerated and available today:
ABINIT, ACES III, ADF, BigDFT, CP2K, GAMESS, GAMESS-
UK, GPAW, LATTE, LSDalton, LSMS, MOLCAS, MOPAC2012,
NWChem, OCTOPUS*, PEtot, QUICK, Q-Chem, QMCPack,
Quantum Espresso/PWscf, QUICK, TeraChem*
Active GPU acceleration projects:
CASTEP, GAMESS, Gaussian, ONETEP, Quantum
Supercharger Library*, VASP & more
green* = application where >90% of the workload is on GPU
OVERVIEW OF LIFE & MATERIAL ACCELERATED APPS
![Page 9: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/9.jpg)
11
MD VS. QC ON GPUS
“Classical” Molecular Dynamics Quantum Chemistry (MO, PW, DFT, Semi-Emp)
Simulates positions of atoms over time; chemical-biological or
chemical-material behaviors
Calculates electronic properties; ground state, excited states, spectral properties,
making/breaking bonds, physical properties
Forces calculated from simple empirical formulas (bond rearrangement generally forbidden)
Forces derived from electron wave function (bond rearrangement OK, e.g., bond energies)
Up to millions of atoms Up to a few thousand atoms
Solvent included without difficulty Generally in a vacuum but if needed, solvent treated classically (QM/MM) or using implicit methods
Single precision dominated Double precision is important
Uses cuBLAS, cuFFT, CUDA Uses cuBLAS, cuFFT, OpenACC
Geforce (Accademics), Tesla (Servers) Tesla recommended
ECC off ECC on
![Page 10: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/10.jpg)
12
GPU-Accelerated Quantum Chemistry Apps
Abinit
ACES III
ADF
BigDFT
CP2K
GAMESS-US
Gaussian
GPAW
LATTE
LSDalton
MOLCAS
Mopac2012
NWChem
Green Lettering Indicates Performance Slides Included
GPU Perf compared against dual multi-core x86 CPU socket.
Quantum SuperCharger Library
TeraChem
VASP
WL-LSMS
Octopus
ONETEP
Petot
Q-Chem
QMCPACK
Quantum Espresso
![Page 11: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/11.jpg)
ABINIT
![Page 12: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/12.jpg)
FROM RESEARCH TO INDUSTRY
6th International ABINIT Developer Workshop
April 15-18, 2013 - Dinard, France
Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France
Florent Dahm C-S – 22 av. Galilée. 92350 Le Plessis-Robinson
With a contribution from Yann Pouillon for the build system
www.cea.fr
USING ABINIT ON GRAPHICS PROCESSING UNITS (GPU)
![Page 13: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/13.jpg)
ABINIT ON GPU
![Page 14: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/14.jpg)
OUTLINE
Introduction
Introduction to GPU computing Architecture, programming model
Density-Functional Theory with plane waves How to port it on GPU
ABINIT on GPU
Implementation
Performances of ABINIT v7.0
How To…
Compile
Use
ABINIT on GPU| April 15, 2013 | PAGE 13
![Page 15: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/15.jpg)
!
!
!
!
!
LOCAL OPERATOR: FAST FOURIER TRANSFORMS
g r g r g v dr e v r
c e
'
eff eff g
g'
FFT
ABINIT on GPU| April 15, 2013 | PAGE 14
Included in Cuda package : cuFFT library
Interfaced with c and Fortran
A system dependent efficiency is expected
Small FFT underuse the GPU and are dominated by
the time required to transfer the data to/from the GPU
Within PAW, small plane wave basis are used
~ =
i
~ ( )
( i
)
FTT
Our choice (for a first version):
Replace the MPI plane wave level
by a FFT computation on GPU
![Page 16: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/16.jpg)
!
!
!
!
!
LOCAL OPERATOR: FAST FOURIER TRANSFORMS
fourwf – option 2
Performance highly dependent on FFT size
Speedup
FFT size
ABINIT on GPU| April 15, 2013 | PAGE 15
TEST CASE
107 gold atoms cell
TGCC-Curie
(Intel Westmere + Nvidia M2090)
Implementation
Interface Cuda / Fortran
New routines that encapsulate cufft calls
Development of 3 small cuda kernels
Optimization of memory copy to overlap
computation and data transfer
![Page 17: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/17.jpg)
!
!
LOCAL OPERATOR: FAST FOURIER TRANSFORMS
Improvement : multiple band optimization
107 gold atoms 31 copper atoms
ABINIT on GPU| April 15, 2013 | PAGE 16
TEST CASE
TGCC-Curie
(Intel Westmere + Nvidia M2090)
bandpp fourwf on GPU
1
2
4
10
20
100
36,02
20,79
14,14
9,16
9,50
7,94
bandpp fourwf on GPU
1
2
4
8
18
72
112,94
84,55
70,55
61,91
54,79
50,55
The Cuda kernels are more intensively used
Requires less transfers
See bandpp variable When Locally Optimal Block Preconditioned Conjugate
Gradient algorithm is used
![Page 18: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/18.jpg)
!
!
!
! d
NON LOCAL OPERATOR
~p
~p = ∑ g ' g ' ψ ψ Development of 2 Cuda Kernels for g ' ~ ψ ψ
i , j
• Non-local operator
•
•
•
Overlap operator
Forces and stress tensor
PAW occupation matrix
In addition to all MPI
parallelization levels !
• Need a large number of PW
• Need a large number of NL projectors
TGCC-Curie
(Intel Westmere + Nvidia M2090) ABINIT on GPU| April 15, 2013 | PAGE 17
Performances (NL operator only)
nonlop CPU nonlop GPU Speedup time (s) time (s) GPU vs CPU
107 gold atoms
3142
1122
2.8
31 copper atoms
85
42
2.0
3 UO2 atoms
63
36
1.8
Implementation
i
i
Development of 2 Cuda Kernels for g VNL = ∑ g pi Dij ~p j
Using collaborative memory zones as much as possible (threads linke
to the same plane wave lie in the same collaborative zone) Used for : • Non-l
![Page 19: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/19.jpg)
Matrix-matrix and matrix-vector products
! These products are not parallelized
! Size of vectors/matrix increases with the
number of band/plane-wave CPU cores.
! Our choice : use of cuBlas
Algorithm 1 lobpcg
Require: 0 = { 0 , . . . , 0 } close to the minimum and K a pre- 1 m conditioner; P = {P
(0) , . . . , P0 } is initialized to 0. 1 m
1: for i=0,1, , do . . . 2: ⌥(i) = ⌥( (i) )
3: R(i) = H (i) ⌥(i) O (i)
4: W(i) = KR(i)
5: The Rayleigh-Ritz method is applied within the subspace ⌅ = n o P(i) (i) (i) (i) (i) (i)
1 , . . . , Pm , 1 , . . . , m , W1 , . . . , Wm
6: (i+1) = (i) (i) + ⇤(i) W(i) + (i) P(i)
7: P(i+1) = ⇤(i) W(i) + (i) P(i) 8: end for
! Size of matrix increases when b
parallelization is used
___ .z - ! Need for a free LAPACK packag
Locally Optimal Block
Preconditioned Conjugate
Gradient
ABINIT on GPU| April 15, 2013 | PAGE 18
Diagonalization/othogonalization within blocks
and
e on GPU
Our choice : use of MAGMA
bandpp Total tim e Tim e Line ar Alge bra
1 674 119
2 612 116
10 1077 428
50 5816 5167
![Page 20: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/20.jpg)
! A project by Innovating Computing Laboratory, University of Tenessee
! “Free of use” project
! First stable release 25-08-2011; Current version 1.3
! Computation is cut in order to offload consuming parts on GPU
! Goal : address hybrid environments
Matrix Algebra on GPU and
Multicore Architectures Architectures 2013-03-
12
The MAGMA project aims 1o oeveop a dense linear agebra library Downloads
similar 1o LAPACK bu1 for heterogeneouslhybrd architectures,
2012-11-14
People
Dcx:umentatbn
ABINIT on GPU| April 15, 2013 | PAGE 19
enable app'catons 1o fully expoit the power 1hat each of the hybrd
components offers.
lver27 2013
License
Copyrqht 2013 Th.. Uni ,:.r.;it of T..-,nn.. - e , All rt}h re ..
d. in
conditbns are met:
ing
Home
Overview Matrix Algebra on GPU and Multicore Latest MAGMA News
News
MAGMA MIC 1.0 Beta for Intel Xeon
Phi Coprocessors Re~ased
Publcations
starting with current 'Multcore-tGPU' systems.
Partners The MAGMA research is based on 1he dlea 1hat, 1o address the MAGMA 1.3 Released
complex challenges of 1heemerging hybrd environments, optimal
software solutbns will themselves have to hybrdize, combining the 2012-11-13 User Forum
strengths of differen1 agornhms wnhin a single framework. Buitling MAGMA MIC 0.3 for Intel Xeon Phi
on this dlea, we aim to oesqn linearagebraagornhms and Coprocessors Released
frameworks for hybrd manycore and GPU systems that can
2012-10-24
clMAGMA 1.0 Released
2012-06-29
MAGMA 1.2.1 Released
ICL{>l3r
Sponsored By: Industry Support From: AMD:I @ Mlaoeolt ,..............
![Page 21: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/21.jpg)
ABINIT on GPU| April 15, 2013 | PAGE 20
!
!
ITERATIVE EIGENSOLVER
Performances
TGCC-Curie
(Intel Westmere + Nvidia M2090)
Highly dependent on the physical system
Depends only on the size of blocks used in the LOBPCG algorithm
Speedup of cuBLAS/MAGMA code Speedup on a single MPI process
4,00
3,50
3,00
2,50 215 silicon atoms
2,00 107 gold atoms
1,50 31 copper atoms
1,00 3 UO2 atoms
0,50
0,00 bandpp=
2 4 10 factor related to the size of blocks (bands) size
![Page 22: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/22.jpg)
OUTLINE
Introduction
Introduction to GPU computing Architecture, programming model
Density-Functional Theory with plane waves How to port it on GPU
ABINIT on GPU
Implementation
Performances of ABINIT v7.0
How To…
Compile
Use
ABINIT on GPU| April 15, 2013 | PAGE 21
![Page 23: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/23.jpg)
!
!
PERFORMANCES ON A SINGLE GPU
Comparing architectures…
TGCC-Titane
Intel Nehalem+ NVidia S1070 (Tesla)
TGCC-Curie
Intel Westmere + NVidia M2090 (Fermi)
ABINIT on GPU| April 15, 2013 | PAGE 22
Strongly dependent on architecture
Titane : fast CPU+ old GPU
BaTiO3 case is too small to take benefit from GPU
Time proc 0 (s) Titane Curie CPU CPU+GPU Speedup CPU CPU+GPU Speedup
107 gold atoms
31 copper atoms
5 BaTiO3 atoms
4230
153
85
1857
102
98
2,3
1,5
0,9
5154
235
86
621
64
73
8.3
3,7
1,2
![Page 24: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/24.jpg)
!
PERFORMANCES ON MULTIPLE GPU
TGCC-Curie
Intel Westmere + NVidia M2090 (Fermi)
8 CPU (MPI only) + 8 GPU
200 CPU (MPI only) + 200 GPU
ABINIT on GPU| April 15, 2013 | PAGE 23
Less efficient when the number of MPI processes increases
(the load decreases on the GPU)
Time proc 0 (s) Curie CPU CPU+GPU Speedup
215 silicon atoms (45 it
er.)
25856 6309 4,1
Time proc 0 (s) Curie CPU CPU+GPU Speedup
215 silicon atoms (1 iter
.)
107 gold atoms
31 copper atoms
16160
866
49
3680
621
64
4,4
4,7
1,5
![Page 25: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/25.jpg)
!
!
!
!
!
PERFORMANCES VS OTHER CODES
[1] Harju et al, App. Parallel and Sc. Comp. Lectures Notes in Computer Science 7782, p3 (2013)
[2] Spiga et al, pdp, 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing, p368 (2012)
[3] Mainz et al, Computer Physics Communication 182, p1421 (2011)
[4] Hacene et al,, Journal of Computational Chemistry 33, p2581(2012)
[5] Genovese et al, Journal of Chemical Physics, 131, 034103 (2009)
[6] Hakala et al, PARA 2012. LNCS, vol 7782, p63, Springer, Heidelberg (2013)
ABINIT on GPU| April 15, 2013 | PAGE 24
Other DFT codes have been ported on GPU.
Speedup of plane wave codes are similar
Quantum espresso (plane waves) : x3/x4 (sequential), x2/x3 (parallel) [1,2]
VASP (plane waves) : x3/x4 [3,4]
BigDFT (wavelets) : x5/x7 [5]
GPAW (real space) : x10/x15 [1,6]
![Page 26: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/26.jpg)
BigDFT
![Page 27: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/27.jpg)
Gaussian
![Page 28: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/28.jpg)
April 4-7, 2016 | Silicon Valley
Roberto Gomperts (NVIDIA), Michael Frisch (Gaussian, Inc.), Giovanni Scalmani (Gaussian, Inc.), Brent Leback (NVIDIA/PGI)
ENABLING THE ELECTRONIC STRUCTURE PROGRAM GAUSSIAN ON GPGPUS USING OPENACC
![Page 29: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/29.jpg)
33
PREVIOUSLY Earlier Presentations
GRC Poster 2012
ACS Spring 2014
GTC Spring 2014 ( recording at http://on-demand.gputechconf.com/gtc/2014/video/S4613-enabling-gaussian-09-gpgpus.mp4 )
WATOC Fall 2014
GTC Spring 2016 (this full recording at http://mygtc.gputechconf.com/quicklink/4r13O5r; requires registration)
![Page 30: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/30.jpg)
34
CURRENT STATUS Single Node
Implemented
Energies for Closed and Open Shell HF and DFT (less than a handful of XC-functionals missing)
First derivatives for the same as above
Second derivatives for the same as above
Using only
OpenACC
CUDA library calls (BLAS)
6/29/2016
![Page 31: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/31.jpg)
35
GAUSSIAN PARALLELISM MODEL
CPU Cluster
OpenMP
CPU Node
GPU
OpenACC
![Page 32: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/32.jpg)
36
CLOSING REMARKS
Significant Progress has been made in enabling Gaussian on GPUs with OpenACC
OpenACC is increasingly becoming more versatile
Significant work lies ahead to improve performance
Expand feature set:
PBC, Solvation, MP2, ONIOM, triples-Corrections
![Page 33: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/33.jpg)
EARLY PERFORMANCE RESULTS (CCSD(T))
24.8 24.8 22.7 2.1 3.7 3.6 1.3 2.3 0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
18.0
Total l913: "triples" Iterative CCSD
Ibuprofen CCSD(t) Calculation Speed Ups Relative to CPU-Only Full Node
20 Threads 0 GPUs 6 Threads 6 GPUs
Labels: Time in hrs
System: 2 Sockets E5-2690 V2 (2x 10 Cores @ 3.0 GHz); 128 GB RAM (DD3-1600); Used 108 GB
GPUs: 6 Tesla K40m (15 SMPs @ 875 MHz); 12 GB Global Memory
Method CCSD(t)
No. of Atoms 33
Basis Set 6-31G(d,p)
No. of Basis Funcs 315
No. Occ Orbitals 41
No. Virt Orbitals 259
No. of Cycles 15
No. CCSD iters 16
![Page 34: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/34.jpg)
VASP
![Page 35: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/35.jpg)
GPU VASP Collaboration
Collaborators
2013-2014 Project Scope Minimization algorithms to calculate electronic ground state
Blocked Davidson (ALGO = NORMAL & FAST)
RMM-DIIS (ALGO = VERYFAST & FAST)
K-Points
Optimization for critical step in exact exchange calculations
Earlier work Speeding up plane-wave electronic-structure calculations using graphics-processing units, Maintz, Eck, Dronskowski
VASP on a GPU: application to exact-exchange calculations of the stability of elemental boron, Hutchinson, Widom
Accelerating VASP Electronic Structure Calculations Using Graphic Processing Units, Hacene, Anciaux-Sedrakian,
Rozanska, Klahr, Guignon, Fleurat-Lessard
U of Chicago
![Page 36: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/36.jpg)
Target Workloads
Silica (“medium”) 7 Å thick slab of amorphous silica, 240 atoms (Si68O148H24)
RMM-DIIS (ALGO = VERYFAST)
Nial-MD (“large”) Liquid metal molecular dynamics sample of Nickel-based superalloy
500 atoms, 9 chemical species total
Blocked Davidson (ALGO = NORMAL)
INTERFACE (“large”) Interface of platinum metal with water
108 Pt atoms, and 120 water molecules (468 atoms)
Blocked Davidson & RMM-DIIS (ALGO = FAST)
![Page 37: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/37.jpg)
RUNTIME DISTRIBUTION FOR SILICA
Time in sec for 1 K40 GPU + 1 IvyBridge core
0 500 1000 1500 2000 2500 3000 3500
Optimized GPU port
original GPU port
CPU
Memcopy
Gemm
FFT
Other
![Page 38: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/38.jpg)
March 2016
VASP 5.4.1 w/ Patch#1(Haswell & K80)
![Page 39: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/39.jpg)
43
VASP Interface Benchmark
Running VASP version 5.4.1
The blue node contains Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs
The green nodes contain Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs + Tesla K80 (autoboost) GPUs
“[Old Data]” = pre-Bugfix: patch #1 for vasp.5.4.1.05Feb15 which yield up to
2.7X faster calculations. (patch #1 available at VASP.at)
Blocked Davidson + RMM-DIIS (ALGO=Fast)
828.27
580.63
506.25
387.26
284.06
0
100
200
300
400
500
600
700
800
900
1 Node 1 Node +1x K80 per node
1 Node + 2x K80per node
[Old Data]
1 Node +2x K80 per node
1 Node +4x K80 per node
Ela
pse
d t
ime (
s)
Interface
1.4X 1.6X
2.1X 3.0X
*Lower is better
![Page 40: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/40.jpg)
44
VASP Silica IFPEN Benchmark
542.87
432.48
372.74
258.14
210.47
0
100
200
300
400
500
600
1 Node 1 Node +1x K80 per node
1 Node + 2x K80 pernode
[Old Data]
1 Node +2x K80 per node
1 Node +4x K80 per node
Ela
pse
d t
ime (
s)
Silica IFPEN
1.3X
1.5X
2.1X
2.6X
*Lower is better
Running VASP version 5.4.1
The blue node contains Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs
The green nodes contain Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs + Tesla K80 (autoboost) GPUs
“[Old Data]” = pre-Bugfix: patch #1 for vasp.5.4.1.05Feb15 which yield up to
2.7X faster calculations. (patch #1 available at VASP.at)
RMM-DIIS (ALGO=Veryfast)
![Page 41: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/41.jpg)
45
VASP Si-Huge Benchmark
6464.42
4296.15
2871.09 2865.92
1987.94
0
1000
2000
3000
4000
5000
6000
7000
1 Node 1 Node +1x K80 per node
1 Node + 2x K80 pernode
[Old Data]
1 Node +2x K80 per node
1 Node +4x K80 per node
Ela
pse
d t
ime (
s)
Si-Huge
1.5X
2.3X 2.3X
3.3X
*Lower is better Running VASP version 5.4.1
The blue node contains Dual Intel Xeon
E5-2698 [email protected] (Haswell) CPUs
The green nodes contain Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs + Tesla K80 (autoboost) GPUs
“[Old Data]” = pre-Bugfix: patch #1 for vasp.5.4.1.05Feb15 which yield up to
2.7X faster calculations. (patch #1 available at VASP.at)
Blocked Davidson + RMM-DIIS (ALGO=Fast)
![Page 42: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/42.jpg)
46
VASP SupportedSystems Benchmark
310.63
252.43
228.90
164.21
129.99
0
50
100
150
200
250
300
350
1 Node 1 Node +1x K80 per node
1 Node + 2x K80 pernode
[Old Data]
1 Node +2x K80 per node
1 Node +4x Tesla K80 per
node
Ela
pse
d t
ime (
s)
SupportedSystems
1.2X
1.4X
1.9X
2.4X
*Lower is better Running VASP version 5.4.1
The blue node contains Dual Intel Xeon
E5-2698 [email protected] (Haswell) CPUs
The green nodes contain Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs + Tesla K80 (autoboost) GPUs
“[Old Data]” = pre-Bugfix: patch #1 for vasp.5.4.1.05Feb15 which yield up to
2.7X faster calculations. (patch #1 available at VASP.at)
Blocked Davidson + RMM-DIIS (ALGO=Fast)
![Page 43: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/43.jpg)
47
VASP NiAl-MD Benchmark
722.82
237.02
197.19 184.93
142.98
0
100
200
300
400
500
600
700
800
1 Node 1 Node + 1x K80 pernode
1 Node + 2x K80 pernode
[Old Data]
1 Node +2x K80 per node
1 Node +4x K80 per node
Ela
pse
d t
ime (
s)
NiAl-MD
3.0X 3.7X 3.9X 5.1X
*Lower is better Running VASP version 5.4.1
The blue node contains Dual Intel Xeon
E5-2698 [email protected] (Haswell) CPUs
The green nodes contain Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs + Tesla K80 (autoboost) GPUs
“[Old Data]” = pre-Bugfix: patch #1 for vasp.5.4.1.05Feb15 which yield up to
2.7X faster calculations. (patch #1 available at VASP.at)
Blocked Davidson + RMM-DIIS (ALGO=Fast)
![Page 44: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/44.jpg)
48
VASP B.hR105 Benchmark
1196.86
604.05
334.44 343.69
236.13
0
200
400
600
800
1000
1200
1400
1 Node 1 Node + 1x K80 pernode
1 Node + 2x K80 pernode
[Old Data]
1 Node +2x K80 per node
1 Node +4x K80 per node
Ela
pse
d t
ime (
s)
B.hR105
2.0X
3.6X 3.5X 5.1X
*Lower is better
Running VASP version 5.4.1
The blue node contains Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs
The green nodes contain Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs + Tesla K80 (autoboost) GPUs
“[Old Data]” = pre-Bugfix: patch #1 for vasp.5.4.1.05Feb15 which yield up to
2.7X faster calculations. (patch #1 available at VASP.at)
Hybrid Functional with blocked Davicson (ALGO=Normal)
![Page 45: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/45.jpg)
49
VASP B.aP107 Benchmark
27993.62
7494.36 7632.31
4432.31 4513.40
0
5000
10000
15000
20000
25000
30000
1 Node 1 Node +1x K80 per node
1 Node + 2x K80 pernode
[Old Data]
1 Node +2x K80 per node
1 Node +4x K80 per node
Ela
pse
d t
ime (
s)
B.aP107
3.7X 3.7X 6.3X 6.2X
*Lower is better
Running VASP version 5.4.1
The blue node contains Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs
The green nodes contain Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs + Tesla K80 (autoboost) GPUs
“[Old Data]” = pre-Bugfix: patch #1 for vasp.5.4.1.05Feb15 which yield up to
2.7X faster calculations. (patch #1 available at VASP.at)
Hybrid functional calculation (exact exchange) with blocked Davidson. No KPoint
parallelization.
Hybrid Functional with blocked Davicson (ALGO=Normal)
![Page 46: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/46.jpg)
50
GPU-ACCELERATED MOLECULAR DYNAMICS APPS
ACEMD
AMBER
CHARMM
DESMOND
ESPResSO
Folding@Home
GPUGrid.net
GROMACS
HALMD
HOOMD-Blue
LAMMPS
mdcore
Green Lettering Indicates Performance Slides Included
GPU Perf compared against dual multi-core x86 CPU socket.
MELD
NAMD
OpenMM
PolyFTS
![Page 47: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/47.jpg)
NAMD 2.11
![Page 48: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/48.jpg)
53
NEW GPU FEATURES IN NAMD 2.11
• GPU-accelerated simulations up to twice as fast as NAMD 2.10
• Pressure calculation with fixed atoms on GPU works as on CPU
• Improved scaling for GPU-accelerated particle-mesh Ewald calculation
• CPU-side operations overlap better and are parallelized across cores.
• Improved scaling for GPU-accelerated simulations
• Nonbonded force calculation results are streamed from the GPU for better overlap.
• NVIDIA CUDA GPU-acceleration binaries for Mac OS X
Selected Text from the NAMD website
![Page 49: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/49.jpg)
54
NAMD 2.11 IS UP TO 2X FASTER
0
5
10
15
20
25
1 Node 2 Nodes 4 Nodes
Sim
ula
ted T
ime (
ns/
day)
APoA1 (92,224 atoms)
1.2X
1.6X 2.0X
NAMD 2.10 & NAMD 2.11 contain Dual Intel E5-2697 [email protected] (IvyBridge) CPUs + 2x Tesla K80 (autoboost) GPUs
![Page 50: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/50.jpg)
55
NAMD 2.11 APOA1 ON 1 AND 2 NODES
Running NAMD version 2.11
The blue nodes contain Dual Intel E5-2698 [email protected] (Haswell) CPUs
The green nodes contain Dual Intel E5-2698 [email protected] (Haswell) CPUs + Tesla
K80 (autoboost) GPUs
2.77
11.67
16.99
5.22
19.73
24.31
0
5
10
15
20
25
1 Node 1 Node +1x K80
1 Node +2x K80
2 Nodes 2 Nodes +1x K80
2 Nodes +2x K80
Sim
ula
ted T
ime (
ns/
day)
APoA1 (92,224 atoms)
4.2X
6.1X
3.8X
4.7X
![Page 51: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/51.jpg)
56
NAMD 2.11 APOA1 ON 4 AND 8 NODES
Running NAMD version 2.11
The blue nodes contain Dual Intel E5-2698 [email protected] (Haswell) CPUs
The green nodes contain Dual Intel E5-2698 [email protected] (Haswell) CPUs + Tesla
K80 (autoboost) GPUs
10.27
20.64
23.52
16.85
27.83 27.74
0
5
10
15
20
25
30
4 Nodes 4 Nodes +1x K80
4 Nodes +2x K80
8 Nodes 8 Nodes +1x K80
8 Nodes +2x K80
Sim
ula
ted T
ime (
ns/
day)
APoA1 (92,224 atoms)
2.0X
2.3X 1.7X 1.6X
![Page 52: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/52.jpg)
57
NAMD 2.11 STMV ON 1 AND 2 NODES
Running NAMD version 2.11
The blue nodes contain Dual Intel E5-2698 [email protected] (Haswell) CPUs
The green nodes contain Dual Intel E5-2698 [email protected] CPUs (Haswell) + Tesla
K80 (autoboost) GPUs
0.23
1.03
1.75
0.46
1.98
3.27
0
1
2
3
4
1 Node 1 Node +1x K80
1 Node +2x K80
2 Nodes 2 Nodes +1x K80
2 Nodes +2x K80
Sim
ula
ted T
ime (
ns/
day)
STMV (1,066,628 atoms)
4.5X
7.6X 4.3X
7.1X
![Page 53: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/53.jpg)
58
NAMD 2.11 STMV ON 4 AND 8 NODES
Running NAMD version 2.11
The blue nodes contain Dual Intel E5-2698 [email protected] (Haswell) CPUs
The green nodes contain Dual Intel E5-2698 [email protected] CPUs (Haswell) + Tesla
K80 (autoboost) GPUs
0.90
3.61
4.54
1.74
5.86
6.24
0
2
4
6
8
4 Nodes 4 Nodes +1x K80
4 Nodes +2x K80
8 Nodes 8 Nodes +1x K80
8 Nodes +2x K80
Sim
ula
ted T
ime (
ns/
day)
STMV (1,066,628 atoms)
4.0X
5.0X
3.4X
3.6X
![Page 54: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/54.jpg)
AMBER 14
![Page 55: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/55.jpg)
60
JAC on K40s, K80s and M6000s
Running AMBER version 14
The blue node contains Dual Intel E5-2698 [email protected], 3.6GHz Turbo CPUs
The green nodes contain Dual Intel E5-2698 [email protected], 3.6GHz Turbo CPUs + either NVIDIA Tesla K40@875Mhz, Tesla K80@562Mhz (autoboost), or Quadro
M6000@987Mhz GPUs
37.38
134.82
121.30
174.34
161.53
200.34
225.34 219.83
0
50
100
150
200
250
1 HaswellNode
1 CPU Node+ 1x K40
1 CPU Node+ 0.5x K80
1 CPU Node+ 1x K80
1 CPU Node+ 1x M6000
1 CPU Node+ 2x K40
1 CPU Node+ 2x K80
1 CPU Node+ 2x M6000
Sim
ula
ted T
ime (
ns/
day)
PME-JAC_NVE
3.2X
4.7X
4.3X
5.4X
6.0X 5.9X
3.6X
![Page 56: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/56.jpg)
61
Cellulose on K40s, K80s and M6000s
Running AMBER version 14
The blue node contains Dual Intel E5-2698 [email protected], 3.6GHz Turbo CPUs
The green nodes contain Dual Intel E5-2698 [email protected], 3.6GHz Turbo CPUs + either NVIDIA Tesla K40@875Mhz, Tesla K80@562Mhz (autoboost), or Quadro
M6000@987Mhz GPUs
1.93
8.96
7.87
11.76
10.49
13.67
15.38 14.90
0
4
8
12
16
20
1 HaswellNode
1 CPU Node+ 1x K40
1 CPU Node+ 0.5x K80
1 CPU Node+ 1x K80
1 CPU Node+ 1x M6000
1 CPU Node+ 2x K40
1 CPU Node+ 2x K80
1 CPU Node+ 2x M6000
Sim
ula
ted T
ime (
ns/
day)
PME-Cellulose_NVE
4.1X
6.1X
5.4X
8.0X 7.7X
4.6X
7.1X
![Page 57: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/57.jpg)
62
Factor IX on K40s, K80s and M6000s
Running AMBER version 14
The blue node contains Dual Intel E5-2698 [email protected], 3.6GHz Turbo CPUs
The green nodes contain Dual Intel E5-2698 [email protected], 3.6GHz Turbo CPUs + either NVIDIA Tesla K40@875Mhz, Tesla K80@562Mhz (autoboost), or Quadro
M6000@987Mhz GPUs
9.68
40.48
33.59
50.70 47.80
61.18 60.93
66.89
0
10
20
30
40
50
60
70
80
1 HaswellNode
1 CPU Node+ 1x K40
1 CPU Node+ 0.5x K80
1 CPU Node+ 1x K80
1 CPU Node+ 1x M6000
1 CPU Node+ 2x K40
1 CPU Node+ 2x K80
1 CPU Node+ 2x M6000
Sim
ula
ted T
ime (
ns/
day)
PME-FactorIX_NVE
3.5X
5.2X 5.0X
6.4X 6.3X
7.0X
4.2X
![Page 58: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/58.jpg)
63
JAC on M40s
21.11
157.68
230.18
246.15
0
50
100
150
200
250
300
1 Node 1 Node +1x M40 per node
1 Node +2x M40 per node
1 Node +4x M40 per node
Sim
ula
ted T
ime (
ns/
Day)
PME - JAC_NVE
7.5X
10.9X
11.7X
Running AMBER version 14
The blue node contain Single Intel Xeon E5-2698 [email protected] (Haswell) CPUs
The green nodes contain Single Intel
Xeon E5-2697 [email protected] (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs
![Page 59: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/59.jpg)
64
Cellulose on M40s
Running AMBER version 14
The blue node contain Single Intel Xeon E5-2698 [email protected] (Haswell) CPUs
The green nodes contain Single Intel
Xeon E5-2697 [email protected] (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs
1.07
10.12
14.40
15.90
0
2
4
6
8
10
12
14
16
18
1 Node 1 Node +1x M40 per node
1 Node +2x M40 per node
1 Node +4x M40 per node
Sim
ula
ted T
ime (
ns/
Day)
PME - Cellulose_NPT
9.5X
13.5X 14.9X
![Page 60: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/60.jpg)
65
Myoglobin on M40s
9.83
232.20
300.86
322.09
0
50
100
150
200
250
300
350
1 Node 1 Node +1x M40 per node
1 Node +2x M40 per node
1 Node +4x M40 per node
Sim
ula
ted T
ime (
ns/
Day)
GB - Myoglobin
23.6X
30.6X 32.8X
Running AMBER version 14
The blue node contain Single Intel Xeon E5-2698 [email protected] (Haswell) CPUs
The green nodes contain Single Intel
Xeon E5-2697 [email protected] (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs
![Page 61: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/61.jpg)
66
Nucleosome on M40s
0.13
4.67
9.05
16.11
0
2
4
6
8
10
12
14
16
18
1 Node 1 Node +1x M40 per node
1 Node +2x M40 per node
1 Node +4x M40 per node
Sim
ula
ted T
ime (
ns/
Day)
GB - Nucleosome
35.9X
69.6X
123.9X
Running AMBER version 14
The blue node contain Single Intel Xeon E5-2698 [email protected] (Haswell) CPUs
The green nodes contain Single Intel
Xeon E5-2697 [email protected] (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs
![Page 62: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/62.jpg)
October 2015
GROMACS 5.1
![Page 63: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/63.jpg)
68
3M Waters on K40s and K80s
Running GROMACS version 5.1
The blue node contains Dual Intel E5-2698 [email protected] CPUs
The green nodes contain Dual Intel E5-2698 [email protected] CPUs + either NVIDIA
Tesla K40@875Mhz or Tesla K80@562Mhz (autoboost) GPUs
0.81
1.32
1.88 1.85
2.72 2.76
3.23
0
1
2
3
4
1 HaswellNode
1 CPU Node +1x K40
1 CPU Node +1x K80
1 CPU Node +2x K40
1 CPU Node +2x K80
1 CPU Node +4x K40
1 CPU Node +4x K80
Sim
ula
ted T
ime (
ns/
day)
Water [PME] 3M
1.6X
2.3X 2.3X
3.4X 3.4X
4.0X
![Page 64: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/64.jpg)
69
3M Waters on Titan X
Running GROMACS version 5.1
The blue node contains Dual Intel E5-2698 [email protected] CPUs
The green nodes contain Dual Intel E5-2698 [email protected] CPUs + GeForce GTX
TitanX@1000Mhz GPUs 0.81
1.53
2.36
2.99
0
1
2
3
4
1 Haswell Node 1 CPU Node + 1x TitanX 1 CPU Node + 2x TitanX 1 CPU Node + 4x TitanX
Sim
ula
ted T
ime (
ns/
day)
Water [PME] 3M
1.9X
2.9X
3.7X
![Page 65: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/65.jpg)
70
TESLA P100 ACCELERATORS
Tesla P100 for NVLink-enabled Servers
Tesla P100 for PCIe-Based Servers
5.3 TF DP ∙ 10.6 TF SP ∙ 21 TF HP
720 GB/sec Memory Bandwidth, 16 GB
4.7 TF DP ∙ 9.3 TF SP ∙ 18.7 TF HP
Config 1: 16 GB, 720 GB/sec
Config 2: 12 GB, 540 GB/sec
Unified Memory
CPU
Tesla P100
PAGE MIGRATION ENGINE
Simpler Parallel Programming
Virtually Unlimited Data Size
Performance w/ data locality
![Page 66: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/66.jpg)
72
EXTRAORDINARY STRONG SCALING
CPU: Dual Socket Intel E5-2680v3 12 cores, 128 GB DDR4 per node, FDR IB VASP 5.4.1_05Feb16, Si-Huge Dataset. 16, 32 Nodes are estimated based on same scaling from 4 to 8 nodes AMBER 16 Pre-release, CRSPR based on PDB ID 5f9r, 336,898 atoms
One Strong Node Faster Than Lots of Weak Nodes
VASP PERFORMANCE
0x
2x
4x
6x
8x
10x
12x
0 4 8 12 16 20 24 28 32
2x P100
8x P100
Single P100 PCIe Node vs Lots of Weak Nodes
# of CPU Server Nodes
Speed-u
p v
s 1 C
PU
Serv
er
Node
4x P100
32 CPU Nodes
0
5
10
15
20
25
30
35
40
45
50
0 20 40 60 80 100
ns/
day
# of Processors (CPUs and GPUs)
AMBER Simulation of CRISPR, Nature’s Tool for Genome Editing
1 Node with 4 P100 GPUs
48 CPU Nodes Comet
Supercomputer
![Page 67: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/67.jpg)
74
MASSIVE LEAP IN PERFORMANCE
0x
5x
10x
15x
20x
25x
30x
NAMD VASP HOOMD Blue AMBER
2x K80 2x P100 (PCIe) 4x P100 (PCIe)
Speed-u
p v
s D
ual Socket
Bro
adw
ell
CPU: Xeon E5-2697v4, 2.3 GHz, 3.6 GHz Turbo Caffe Alexnet: batch size of 256 Images; VASP, NAMD, HOOMD-Blue, and AMBER average speedup across a basket of tests
![Page 68: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/68.jpg)
76
TCO OVERVIEW
1 RACK
# of Racks (Power Required)
8 (320kW)
6 (240kW)
4 (160kW)
2 (80kW)
10 (400 kW)
960 CPUs / 260 kW
1280 CPUs / 350 kW
20CPUs + 80 P100s / 32 kW
7 RACKS
CAFFE ALEXNET
VASP
Compute Servers,
85%
Non-Compute 15%
EQUAL PERFORMANCE IN LESS RACKS BUDGET: SMALLER, EFFICIENT
Compute Servers,
39%
Rack, Cabling
Infrastructure
Networking
Non-compute,
61%
Sources: Microsoft Research on Datacenter Costs
9 RACKS
![Page 69: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/69.jpg)
77
NVIDIA DGX-1 WORLD’S FIRST DEEP LEARNING SUPERCOMPUTER
42.5 TFLOPS FP64/85 TFLOPS FP32/170 TFLOPS FP16
8x Tesla P100 16GB
NVLink Hybrid Cube Mesh
Accelerates Major AI Frameworks
Dual Xeon
7 TB SSD Deep Learning Cache
Dual 10GbE, Quad IB 100Gb
3RU – 3200W
![Page 70: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm](https://reader035.vdocuments.net/reader035/viewer/2022081617/6025c19b3bb40a5a9309bdcf/html5/thumbnails/70.jpg)
79
THANK YOU! Q&A: WHAT CAN I HELP YOU BUILD?