PerlmutterA 2020 Pre-Exascale GPU-accelerated System for Simulation, Data and Learning
iCAS 2019
September 9, 2019 1
Richard GerberNERSC Senior Science AdvisorHPC Department Head
Many Thanks to:
Nicholas J. WrightPerlmutter Chief Architect
2
NERSC: Mission HPC for DOE Office of Science Research
Bio Energy, Environment Computing
Particle Physics, Astrophysics
Largest funder of physical science research in the U.S.
Nuclear Physics
7,000 users, 800 projects, 700 codes, 50 states, 40 countries, universities & national labs
Materials, Chemistry, Geophysics
Fusion Energy, Plasma Physics
3
NERSC’s Users Produce Groundbreaking ScienceMaterials ScienceRevealing Reclusive Mechanisms for Solar CellsNERSC PI: C. Van de Walle, UC Santa Barbara, ACS Energy Letters
Earth SciencesSimulations Probe Antarctic Ice VulnerabilityNERSC PIs: D. Martin, Berkeley Lab; E. Ng, Berkeley Lab; S. Price, LANL. Geophysical Research Letters
Advanced ComputingScalable Machine Learning in HPCNERSC PI: L. Oliker, Berkeley Lab, 21st International Conference on AI and Statistics
Plasma PhysicsPlasma Propulsion Systems for SatellitesNERSC PI: I. Kaganovich, Princeton Plasma Physics Lab, Physics of Plasmas
High Energy PhysicsShedding Light on Luminous Blue VariablesNERSC PI: Yan-Fei Jiang, UC Santa Barbara. Nature
Nuclear PhysicsEnabling Science Discovery for STARNERSC PI: J. Porter, Berkeley Lab. J. Phys.: Conference Series
2,500 Refereed Publications per Year
4
NERSC Systems
Cori9,600 Intel Xeon Phi “KNL” manycore nodes2,000 Intel Xeon “Haswell” nodes700,000 processor cores, 1.2 PB memoryCray XC40 / Aries Dragonfly interconnect
Ethernet & IB FabricScience Friendly SecurityProduction Monitoring
Power EfficiencyWAN
28 PBScratch
700 GB/s
2 PBBurst Buffer
1.5 TB/s
7PB/project
275 TB/home
136 GB/s
5 GB/s
DTNs, Spin, Gateways
2 x 10 Gb/s2 x 100 Gb/sSDN
HPSS Tape Archive~200 PB
50 GB/s
5
NERSC Systems
Cori9,600 Intel Xeon Phi “KNL” manycore nodes2,000 Intel Xeon “Haswell” nodes700,000 processor cores, 1.2 PB memoryCray XC40 / Aries Dragonfly interconnect
Ethernet & IB FabricScience Friendly SecurityProduction Monitoring
Power EfficiencyWAN
28 PBScratch
700 GB/s
2 PBBurst Buffer
1.5 TB/s
60-250 PB/cfs
275 TB/home
136 GB/s
5 GB/s
DTNs, Spin, Gateways
2 x 10 Gb/s2 x 100 Gb/sSDN
HPSS Tape Archive~200 PB
50 GB/s
6
Allocation of Computing Time 2019
6
6,834
650
650 DOE Mission Science
ALCC
Directors Discretionary
Distributed by DOE Office of Science program managers
Competitive awards run by DOE Advanced Scientific Computing Research Office
80%
10%
10%Strategic awards from NERSC
NERSC hours in millions
NERSC supports important mission science, not just projects that have optimized codes
7
Allocation Breakdown
- 7 -
Top Climate Projects 2019
PI Project NERSC Hrs
Ruby Leung, PNNL
Energy Exascale Earth Systems Modeling (E3SM)
424 M
Travis O’Brien, LBNL
Calibrated and Systematic Characterization Attribution and Detection of Extremes
51 M
F. Hoffman, ORNL
Reducing Uncertainties in Biogeochemical Interactions through Synthesis and Computation (RUBISCO)
48 M
Susan Bates, NCAR
Climate Change Simulations with CESM: Moderate and High Resolution Studies
30 M
Ruby Leung, PNNL
Water Cycle and Climate Extremes Modeling (WACCEM)
21 M
Org NERSC Hrs
PNNL 545 M
Berkeley Lab 73 M
Oak Ridge 49 M
NCAR 32 M
Gil Compo, U. Colorado
Ocean Atmosphere Reanalyses for Climate Applications (OARCA) 1806-2018
60 M Hours Used
Recent CMIP6 Hackathon!
8
NERSC’s Next System: Optimized for Science
● Cray Shasta System providing 3-4x capability of Cori● First NERSC system designed to meet needs of both large scale
simulation and data analysis from experimental facilities○ Includes both NVIDIA GPU-accelerated and AMD CPU-only nodes ○ Cray Slingshot high-performance network will support Terabit rate connections○ Optimized data software stack enabling analytics and ML at scale○ All-Flash filesystem for I/O acceleration
● Robust readiness program for simulation, data and learning applications and complex workflows
● Delivery in late 2020
9
NERSC-9 will be named after Saul Perlmutter
• Shared 2011 Nobel Prize in Physics for discovery of the accelerating expansion of the universe.
• Supernova Cosmology Project, lead by Perlmutter, was a pioneer in using NERSC supercomputers combine large scale simulations with experimental data analysis
• Login “saul.nersc.gov”
10
GPU nodes
4x NVIDIA “Volta-next” GPU● > 7 TF ● > 32 GiB, HBM-2● NVLINK
1x AMD Milan CPU
4 Slingshot connections● 4x25 GB/s
GPU direct
Unified Virtual Memory (UVM)
2-3 X Cori performance/capability in GPU nodes
Volta specs
11
CPU nodes AMD “Milan” CPU● ~64 cores● “ZEN 3” cores - 7nm+ ● AVX2 SIMD (256 bit)
8 channels DDR memory● >= 256 GiB total per node
1 Slingshot connection● 1x25 GB/s
~ 1x full Cori system performance/capability in CPU nodes
Tuning for Cori KNL is highly relevant
+60%/year
+20%/year+2%/year
• Single-thread (single-core) performance is flat, driven by power and heat dissipation constraints.
• Performance gains are being provided by combining light-weight, low-power cores on a single chip.
• Examples: Intel Xeon Phi “Knight’s Landing” and Graphical Processing Units (GPUs)
• Requires programming for fine-grained parallelism on many-core and/or GPU processors.
Image generated thanks to Jeff Preshing(https://github.com/preshing/analyze-spec-benchmarks)
Why GPUs?
13
Performance advantage for codes that can take advantage of GPUs & power-efficient processors
For the near and intermediate future, only GPUs provide a general-purpose accelerator that continues Moore’s Law-like performance growth.
Q: Why Bother? A: Address New Science Challenges
Transistors (
processing elements)
Single Thread Performance
Frequency
Power
Cores
14
Is the NERSC Workload Ready?
14
GPU Status & Description Fraction
Enabled: Most of code has been ported
37%
Kernels:Ports of some kernels have been documented.
10%
Proxy: Kernels in related codes have been ported
20%
Unlikely:A GPU port would require major effort.
13%
Unknown: GPU readiness cannot be assessed at this time.
20%
Breakdown of Hours at NERSC 2017-18
A good fraction of the workload has at least started
15
Perlmutter Productivity Considerations• Engage with top code teams to optimize apps for GPUs
• Communicate lessons learned to broad community• Support familiar, productive programming methods and tools• Include features to support emerging data and learning needs• CPU only nodes for workflows and codes not amenable to GPUs• All-FLASH scratch file system for bandwidth and metadata
performance• Ethernet-compatible high speed network for external data
connectivity and connection to edge services
16
NESAP: Application ReadinessNESAP is NERSC’s Application Readiness Program for new Systems. Started with Cori; Continuing with Perlmutter.
Strategy: Partner with application teams and vendors to optimize participating applications. Share lessons learned with with NERSC community via documentation and training.
Resource Available to Teams: NERSC Staff development effort and expertise, performance post-docs, access to vendor application engineers, hack-a-thons, early access to hjardware (GPU nodes on Cori and Perlmutter)
Simulation: 14 application teamsData: 9 applicationsLearning: 5 applications
NERSC is trying to replicate the success of NESAP for Cori
17
Hack-a-Thons● Quarterly GPU hackathons from 2019-2021● ~3 apps per hackathon● 6-week prep with performance engineers,
leading up to 1 week of hackathon
● Tutorials throughout the week on different topics
○ OpenMP/OpenACC, Kokkos, CUDA etc.○ profiler techniques/advanced tips○ GPU hardware characteristics, best
known practices
18
NERSC LiaisonsNERSC has steadily built up a team of Application Performance experts who excited to work with you.
Zhengji Zhao
Helen He StephenLeak
Jack DeslippeApps Performance LeadNESAP LEAD
Brandon CookSimulation Area Lead
Thorsten KurthLearning Area Lead
Doug Doerfler
Woo-Sun Yang
Rollin ThomasData Area Lead
Brian FriesenCray/NVIDIA COE Coordinator
Charlene YangTools/Libraries Lead
Kevin Gott Lisa Gerhardt
JonathanMadsen
Rahul Gayatri
Wahid Bhimji
MustafaMustafa
Steve Farrell
ChrisDaley
Mario Melara
19
Two Tiers of NESAP SupportBenefit Tier 1 Tier 2
Early Access to Perlmutter yes yes
Hack-a-thon with vendors yes eligible
Training resources yes yes
Additional NERSC hours from Director’s Reserve yes eligible
NERSC funded postdoctoral fellow eligible no
One-on-one staff assistance yes eligible
Commitment of minimum effort from app team yes no
Number of applications in Tier 29 28
20
NESAP: 14 Simulation TeamsPI Name Institution Application name AreaJosh Meyers LLNL ImSim High Energy Physics (HEP)
Carleton DeTar, Balint Joo Utah; JLAB USQCD (MILC, DWF, chroma, etc) HEP and Nuclear Physics (NP)
Noel Keen, Mark Taylor SNL; LBNL E3SM Climate
David Green ORNL ASGarD (Adaptive Sparse Grid Discretization) Fusion Energy and CS
Mauro Del Ben LBNL BerkeleyGW Materials
Pieter Maris Iowa State Many-Fermions Dynamics for nuclear physics (MFDn) Nuclear Structure
Hubertus van Dam BNL NWChemEx Chemistry
David Trebotich LBNL Chombo-Crunch Subsurface Flow, Chemistry
Marco Govoni ANL WEST Materials
Annabella Selloni, Robert DiStasio and Roberto Car Princeton; Cornell Quantum ESPRESSO Molecular Dynamics
Emad Tajkhorshid UIUC NAMD Molecular Dynamics
CS Chang PPL WDMAPP Fusion Energy
Danny Perez LANL LAMMPS Molecular Dynamics
Ann Almgren / Jean-Luc Vay LBNL AMReX / WarpX Accelerator Design
21
NESAP: 9 Data ApplicationsPI Name Institution Application name Area
Maria Elena Monzani SLAC NextGen SW Libraries for LZ Dark Matter
Kjiersten Fagnan JGI JGI-NERSC-KBase FICUS Genomics
Kathy Yelick LBNL Exabiome (ECP) Genomics
Stephen Bailey LBNL DESI Dark Energy
Julian Borrill LBNL TOAST
Cosmic Microwave Background
Doga Gursoy ANL Tomopy Tomography
Amedeo Perazzo SLAC ExaFEL (ECP) Imaging
Paolo Calafiura LBNL Atlas High Energy Physics
Dirk Hufnagel LBNL CMS High Energy Physics
22
NESAP: 5 Learning ProjectsPI Name Institution Application name AreaChristine Sweeney LANL ExaLearn Light Source Application Light SourcesMarc Day LBNL FlowGAN GAN
Shinjae Yoo BNLExtreme Scale Spatio-Temporal Learning (LSTNet) Learning
Benjamin Nachman and Jean-Roch Vlimant Caltech
Accelerating High Energy Physics Simulation with Machine Learning
High Energy Physics
Zachary Ulissi CMUDeep Learning Thermochemistry for Catalyst Composition Discovery/Optimization Chemistry
23
Perlmutter has a All-Flash Filesystem
23
4.0 TB/s to Lustre>10 TB/s overall
Logins, DTNs, Workflows
All-Flash Lustre Storage
CPU + GPU Nodes
Community FS~ 200 PB, ~500 GB/s
Terabits/sec toESnet
• Fast across many dimensionso 4 TB/s sustained bandwidtho 7,000,000 IOPSo 3,200,000 file creates/sec
• Usable for NERSC userso 30 PB usable capacityo Familiar Lustre interfaceso New data movement capabilities
• Optimized for NERSC data workloadso NEW small-file I/O improvementso NEW features for high IOPS, non-
sequential I/O
24
OpenMP for Productivity● Supports C, C++ and Fortran
○ The NERSC workload consists of ~700 applications with a relatively equal
mix of C, C++ and Fortran
● Provides portability to different architectures at other DOE labs
● Works well with MPI: hybrid MPI+OpenMP approach successfully used in
many codes that run at NERSC
● Recent release of OpenMP 5.0 specification – the third version providing
features for accelerators
○ Many refinements over this five year period
● But will OpenMP work with future accelerators?
25
NRE partnership with PGI/NVIDIA• Agreed upon subset of OpenMP features to
be included in the PGI compiler
• OpenMP test suite created with micro-benchmarks, mini-apps, and the ECP SOLLVE V&V suite
• 5 NESAP application teams will partner with PGI to add OpenMP target offload directives
• Alpha compiler is due soon - Limited access. Closed Beta - Apr 2020 and Open Beta –Oct 2020 - Greater access
• We want to hear from the larger community. Tell us your experience, including what OpenMP techniques worked / failed on the GPU.
26
Engaging around Performance Portability
26
NERSC recently joined as a MemberNERSC is working with
PGI/NVIDIA to enable OpenMP GPU acceleration
Doug DoerflerLead Performance Portability Workshop at SC18. and 2019 DOE COE Perf. Port. Meeting
NERSC is leading development of performanceportability.org
NERSC Hosted Past C++ Summit and ISO C++ meeting on HPC. Math Libs
27
● Large scale computing and storage resources
● Expertise optimizing pipelines & workflows
● Reusable building blocks & APIs● Advanced scheduling and
resource allocation● Scalable infrastructure to launch
services● Edge services and external
connectivity
Experimental Facilities
Math
Computing Facilities
Network
Fast Implementation
Analysis and data
management
Superfacility
Superfacility: A network of connected scientific facilities, software and expertise to enable new modes of discovery
28
Capabilities Technologies
Data Transfer + Access
Workflows
Data Management
Data Analytics
Data Visualization
TaskFarmer
Production Software Stack for Data & Learning
29
NERSC Systems Roadmap
2013
NERSC-7: EdisonMulticore CPU
NERSC-8: Cori Manycore CPUNESAP Launched: transition applications to advanced architectures
20162024
NERSC-9: CPU and GPU nodes Continued transition of applications and support for complex workflows
2020
NERSC-10:Exa system
2028
Increasing need for energy-efficient architectures
NERSC-11:BeyondMoore
30
Coming in late 2020
Questions?
iCAS 2019
September 9, 2019 31