Download - Perlmutter - CISL Home...Perlmutter A 2020 Pre-Exascale GPU-accelerated System for Simulation, Data and Learning iCAS 2019 September 9, 2019 1 Richard Gerber NERSC Senior Science Advisor

PerlmutterA 2020 Pre-Exascale GPU-accelerated System for Simulation, Data and Learning

iCAS 2019

September 9, 2019 1

Richard GerberNERSC Senior Science AdvisorHPC Department Head

Many Thanks to:

Nicholas J. WrightPerlmutter Chief Architect

2

NERSC: Mission HPC for DOE Office of Science Research

Bio Energy, Environment Computing

Particle Physics, Astrophysics

Largest funder of physical science research in the U.S.

Nuclear Physics

7,000 users, 800 projects, 700 codes, 50 states, 40 countries, universities & national labs

Materials, Chemistry, Geophysics

Fusion Energy, Plasma Physics

3

NERSC’s Users Produce Groundbreaking ScienceMaterials ScienceRevealing Reclusive Mechanisms for Solar CellsNERSC PI: C. Van de Walle, UC Santa Barbara, ACS Energy Letters

Earth SciencesSimulations Probe Antarctic Ice VulnerabilityNERSC PIs: D. Martin, Berkeley Lab; E. Ng, Berkeley Lab; S. Price, LANL. Geophysical Research Letters

Advanced ComputingScalable Machine Learning in HPCNERSC PI: L. Oliker, Berkeley Lab, 21st International Conference on AI and Statistics

Plasma PhysicsPlasma Propulsion Systems for SatellitesNERSC PI: I. Kaganovich, Princeton Plasma Physics Lab, Physics of Plasmas

High Energy PhysicsShedding Light on Luminous Blue VariablesNERSC PI: Yan-Fei Jiang, UC Santa Barbara. Nature

Nuclear PhysicsEnabling Science Discovery for STARNERSC PI: J. Porter, Berkeley Lab. J. Phys.: Conference Series

2,500 Refereed Publications per Year

4

NERSC Systems

Cori9,600 Intel Xeon Phi “KNL” manycore nodes2,000 Intel Xeon “Haswell” nodes700,000 processor cores, 1.2 PB memoryCray XC40 / Aries Dragonfly interconnect

Ethernet & IB FabricScience Friendly SecurityProduction Monitoring

Power EfficiencyWAN

28 PBScratch

700 GB/s

2 PBBurst Buffer

1.5 TB/s

7PB/project

275 TB/home

136 GB/s

5 GB/s

DTNs, Spin, Gateways

2 x 10 Gb/s2 x 100 Gb/sSDN

HPSS Tape Archive~200 PB

50 GB/s

5

NERSC Systems

Cori9,600 Intel Xeon Phi “KNL” manycore nodes2,000 Intel Xeon “Haswell” nodes700,000 processor cores, 1.2 PB memoryCray XC40 / Aries Dragonfly interconnect

Ethernet & IB FabricScience Friendly SecurityProduction Monitoring

Power EfficiencyWAN

28 PBScratch

700 GB/s

2 PBBurst Buffer

1.5 TB/s

60-250 PB/cfs

275 TB/home

136 GB/s

5 GB/s

DTNs, Spin, Gateways

2 x 10 Gb/s2 x 100 Gb/sSDN

HPSS Tape Archive~200 PB

50 GB/s

6

Allocation of Computing Time 2019

6

6,834

650

650 DOE Mission Science

ALCC

Directors Discretionary

Distributed by DOE Office of Science program managers

Competitive awards run by DOE Advanced Scientific Computing Research Office

80%

10%

10%Strategic awards from NERSC

NERSC hours in millions

NERSC supports important mission science, not just projects that have optimized codes

7

Allocation Breakdown

- 7 -

Top Climate Projects 2019

PI Project NERSC Hrs

Ruby Leung, PNNL

Energy Exascale Earth Systems Modeling (E3SM)

424 M

Travis O’Brien, LBNL

Calibrated and Systematic Characterization Attribution and Detection of Extremes

51 M

F. Hoffman, ORNL

Reducing Uncertainties in Biogeochemical Interactions through Synthesis and Computation (RUBISCO)

48 M

Susan Bates, NCAR

Climate Change Simulations with CESM: Moderate and High Resolution Studies

30 M

Ruby Leung, PNNL

Water Cycle and Climate Extremes Modeling (WACCEM)

21 M

Org NERSC Hrs

PNNL 545 M

Berkeley Lab 73 M

Oak Ridge 49 M

NCAR 32 M

Gil Compo, U. Colorado

Ocean Atmosphere Reanalyses for Climate Applications (OARCA) 1806-2018

60 M Hours Used

Recent CMIP6 Hackathon!

8

NERSC’s Next System: Optimized for Science

● Cray Shasta System providing 3-4x capability of Cori● First NERSC system designed to meet needs of both large scale

simulation and data analysis from experimental facilities○ Includes both NVIDIA GPU-accelerated and AMD CPU-only nodes ○ Cray Slingshot high-performance network will support Terabit rate connections○ Optimized data software stack enabling analytics and ML at scale○ All-Flash filesystem for I/O acceleration

● Robust readiness program for simulation, data and learning applications and complex workflows

● Delivery in late 2020

9

NERSC-9 will be named after Saul Perlmutter

• Shared 2011 Nobel Prize in Physics for discovery of the accelerating expansion of the universe.

• Supernova Cosmology Project, lead by Perlmutter, was a pioneer in using NERSC supercomputers combine large scale simulations with experimental data analysis

• Login “saul.nersc.gov”

10

GPU nodes

4x NVIDIA “Volta-next” GPU● > 7 TF ● > 32 GiB, HBM-2● NVLINK

1x AMD Milan CPU

4 Slingshot connections● 4x25 GB/s

GPU direct

Unified Virtual Memory (UVM)

2-3 X Cori performance/capability in GPU nodes

Volta specs

11

CPU nodes AMD “Milan” CPU● ~64 cores● “ZEN 3” cores - 7nm+ ● AVX2 SIMD (256 bit)

8 channels DDR memory● >= 256 GiB total per node

1 Slingshot connection● 1x25 GB/s

~ 1x full Cori system performance/capability in CPU nodes

Tuning for Cori KNL is highly relevant

+60%/year

+20%/year+2%/year

• Single-thread (single-core) performance is flat, driven by power and heat dissipation constraints.

• Performance gains are being provided by combining light-weight, low-power cores on a single chip.

• Examples: Intel Xeon Phi “Knight’s Landing” and Graphical Processing Units (GPUs)

• Requires programming for fine-grained parallelism on many-core and/or GPU processors.

Image generated thanks to Jeff Preshing(https://github.com/preshing/analyze-spec-benchmarks)

Why GPUs?

13

Performance advantage for codes that can take advantage of GPUs & power-efficient processors

For the near and intermediate future, only GPUs provide a general-purpose accelerator that continues Moore’s Law-like performance growth.

Q: Why Bother? A: Address New Science Challenges

Transistors (

processing elements)

Single Thread Performance

Frequency

Power

Cores

14

Is the NERSC Workload Ready?

14

GPU Status & Description Fraction

Enabled: Most of code has been ported

37%

Kernels:Ports of some kernels have been documented.

10%

Proxy: Kernels in related codes have been ported

20%

Unlikely:A GPU port would require major effort.

13%

Unknown: GPU readiness cannot be assessed at this time.

20%

Breakdown of Hours at NERSC 2017-18

A good fraction of the workload has at least started

15

Perlmutter Productivity Considerations• Engage with top code teams to optimize apps for GPUs

• Communicate lessons learned to broad community• Support familiar, productive programming methods and tools• Include features to support emerging data and learning needs• CPU only nodes for workflows and codes not amenable to GPUs• All-FLASH scratch file system for bandwidth and metadata

performance• Ethernet-compatible high speed network for external data

connectivity and connection to edge services

16

NESAP: Application ReadinessNESAP is NERSC’s Application Readiness Program for new Systems. Started with Cori; Continuing with Perlmutter.

Strategy: Partner with application teams and vendors to optimize participating applications. Share lessons learned with with NERSC community via documentation and training.

Resource Available to Teams: NERSC Staff development effort and expertise, performance post-docs, access to vendor application engineers, hack-a-thons, early access to hjardware (GPU nodes on Cori and Perlmutter)

Simulation: 14 application teamsData: 9 applicationsLearning: 5 applications

NERSC is trying to replicate the success of NESAP for Cori

17

Hack-a-Thons● Quarterly GPU hackathons from 2019-2021● ~3 apps per hackathon● 6-week prep with performance engineers,

leading up to 1 week of hackathon

● Tutorials throughout the week on different topics

○ OpenMP/OpenACC, Kokkos, CUDA etc.○ profiler techniques/advanced tips○ GPU hardware characteristics, best

known practices

18

NERSC LiaisonsNERSC has steadily built up a team of Application Performance experts who excited to work with you.

Zhengji Zhao

Helen He StephenLeak

Jack DeslippeApps Performance LeadNESAP LEAD

Brandon CookSimulation Area Lead

Thorsten KurthLearning Area Lead

Doug Doerfler

Woo-Sun Yang

Rollin ThomasData Area Lead

Brian FriesenCray/NVIDIA COE Coordinator

Charlene YangTools/Libraries Lead

Kevin Gott Lisa Gerhardt

JonathanMadsen

Rahul Gayatri

Wahid Bhimji

MustafaMustafa

Steve Farrell

ChrisDaley

Mario Melara

19

Two Tiers of NESAP SupportBenefit Tier 1 Tier 2

Early Access to Perlmutter yes yes

Hack-a-thon with vendors yes eligible

Training resources yes yes

Additional NERSC hours from Director’s Reserve yes eligible

NERSC funded postdoctoral fellow eligible no

One-on-one staff assistance yes eligible

Commitment of minimum effort from app team yes no

Number of applications in Tier 29 28

20

NESAP: 14 Simulation TeamsPI Name Institution Application name AreaJosh Meyers LLNL ImSim High Energy Physics (HEP)

Carleton DeTar, Balint Joo Utah; JLAB USQCD (MILC, DWF, chroma, etc) HEP and Nuclear Physics (NP)

Noel Keen, Mark Taylor SNL; LBNL E3SM Climate

David Green ORNL ASGarD (Adaptive Sparse Grid Discretization) Fusion Energy and CS

Mauro Del Ben LBNL BerkeleyGW Materials

Pieter Maris Iowa State Many-Fermions Dynamics for nuclear physics (MFDn) Nuclear Structure

Hubertus van Dam BNL NWChemEx Chemistry

David Trebotich LBNL Chombo-Crunch Subsurface Flow, Chemistry

Marco Govoni ANL WEST Materials

Annabella Selloni, Robert DiStasio and Roberto Car Princeton; Cornell Quantum ESPRESSO Molecular Dynamics

Emad Tajkhorshid UIUC NAMD Molecular Dynamics

CS Chang PPL WDMAPP Fusion Energy

Danny Perez LANL LAMMPS Molecular Dynamics

Ann Almgren / Jean-Luc Vay LBNL AMReX / WarpX Accelerator Design

21

NESAP: 9 Data ApplicationsPI Name Institution Application name Area

Maria Elena Monzani SLAC NextGen SW Libraries for LZ Dark Matter

Kjiersten Fagnan JGI JGI-NERSC-KBase FICUS Genomics

Kathy Yelick LBNL Exabiome (ECP) Genomics

Stephen Bailey LBNL DESI Dark Energy

Julian Borrill LBNL TOAST

Cosmic Microwave Background

Doga Gursoy ANL Tomopy Tomography

Amedeo Perazzo SLAC ExaFEL (ECP) Imaging

Paolo Calafiura LBNL Atlas High Energy Physics

Dirk Hufnagel LBNL CMS High Energy Physics

22

NESAP: 5 Learning ProjectsPI Name Institution Application name AreaChristine Sweeney LANL ExaLearn Light Source Application Light SourcesMarc Day LBNL FlowGAN GAN

Shinjae Yoo BNLExtreme Scale Spatio-Temporal Learning (LSTNet) Learning

Benjamin Nachman and Jean-Roch Vlimant Caltech

Accelerating High Energy Physics Simulation with Machine Learning

High Energy Physics

Zachary Ulissi CMUDeep Learning Thermochemistry for Catalyst Composition Discovery/Optimization Chemistry

23

Perlmutter has a All-Flash Filesystem

23

4.0 TB/s to Lustre>10 TB/s overall

Logins, DTNs, Workflows

All-Flash Lustre Storage

CPU + GPU Nodes

Community FS~ 200 PB, ~500 GB/s

Terabits/sec toESnet

• Fast across many dimensionso 4 TB/s sustained bandwidtho 7,000,000 IOPSo 3,200,000 file creates/sec

• Usable for NERSC userso 30 PB usable capacityo Familiar Lustre interfaceso New data movement capabilities

• Optimized for NERSC data workloadso NEW small-file I/O improvementso NEW features for high IOPS, non-

sequential I/O

24

OpenMP for Productivity● Supports C, C++ and Fortran

○ The NERSC workload consists of ~700 applications with a relatively equal

mix of C, C++ and Fortran

● Provides portability to different architectures at other DOE labs

● Works well with MPI: hybrid MPI+OpenMP approach successfully used in

many codes that run at NERSC

● Recent release of OpenMP 5.0 specification – the third version providing

features for accelerators

○ Many refinements over this five year period

● But will OpenMP work with future accelerators?

25

NRE partnership with PGI/NVIDIA• Agreed upon subset of OpenMP features to

be included in the PGI compiler

• OpenMP test suite created with micro-benchmarks, mini-apps, and the ECP SOLLVE V&V suite

• 5 NESAP application teams will partner with PGI to add OpenMP target offload directives

• Alpha compiler is due soon - Limited access. Closed Beta - Apr 2020 and Open Beta –Oct 2020 - Greater access

• We want to hear from the larger community. Tell us your experience, including what OpenMP techniques worked / failed on the GPU.

26

Engaging around Performance Portability

26

NERSC recently joined as a MemberNERSC is working with

PGI/NVIDIA to enable OpenMP GPU acceleration

Doug DoerflerLead Performance Portability Workshop at SC18. and 2019 DOE COE Perf. Port. Meeting

NERSC is leading development of performanceportability.org

NERSC Hosted Past C++ Summit and ISO C++ meeting on HPC. Math Libs

27

● Large scale computing and storage resources

● Expertise optimizing pipelines & workflows

● Reusable building blocks & APIs● Advanced scheduling and

resource allocation● Scalable infrastructure to launch

services● Edge services and external

connectivity

Experimental Facilities

Math

Computing Facilities

Network

Fast Implementation

Analysis and data

management

Superfacility

Superfacility: A network of connected scientific facilities, software and expertise to enable new modes of discovery

28

Capabilities Technologies

Data Transfer + Access

Workflows

Data Management

Data Analytics

Data Visualization

TaskFarmer

Production Software Stack for Data & Learning

29

NERSC Systems Roadmap

2013

NERSC-7: EdisonMulticore CPU

NERSC-8: Cori Manycore CPUNESAP Launched: transition applications to advanced architectures

20162024

NERSC-9: CPU and GPU nodes Continued transition of applications and support for complex workflows

2020

NERSC-10:Exa system

2028

Increasing need for energy-efficient architectures

NERSC-11:BeyondMoore

30

Coming in late 2020

Questions?

iCAS 2019

September 9, 2019 31

Download - Perlmutter - CISL Home...Perlmutter A 2020 Pre-Exascale GPU-accelerated System for Simulation, Data and Learning iCAS 2019 September 9, 2019 1 Richard Gerber NERSC Senior Science Advisor

Top Related