june 14, 2019 barcelona, spain center for applied...

42
Approved for public release Spack and the U.S. Exascale Computing Project Todd Gamblin Center for Applied Scientific Computing Lawrence Livermore National Laboratory HPC Knowledge Meeting (HPCKP’19) Barcelona, Spain June 14, 2019

Upload: others

Post on 27-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

Approved for public release

Spack and theU.S. Exascale Computing Project

Todd GamblinCenter for Applied Scientific ComputingLawrence Livermore National Laboratory

HPC Knowledge Meeting (HPCKP’19)Barcelona, SpainJune 14, 2019

Page 2: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

2

What is the Exascale Computing Project (ECP)?

ECP is an accelerated research and development project funded by the US Department of Energy (DOE) to ensure all necessary pieces are in place to deliver the nation’s first, capable, exascale ecosystem, including mission critical applications, an integrated software stack, and advanced computer system engineering and hardware components.

Page 3: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

3

ECP by the Numbers

A seven-year, $1.7 billion R&D effort that launched in 2016

6 core DOE National Laboratories: Argonne, Oak Ridge, Berkeley, Lawrence Livermore, Los Alamos, Sandia

• Staff from most of the 17 DOE national laboratories take part in the project

More than 100 top-notch R&D teams

3 technical focus areas:Application Development, Software Technology,

Hardware and Integration

7 YEARS

$1.7B

6CORE DOE

LABS

3TECHNICAL

FOCUSAREAS

100 R&D TEAMS

1000 RESEARCHERS

Page 4: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

4

Exascale machines will support a wide range of science applications

Additive Manufacturing (ExaAM)

Climate (E3SM)

Magnetic Fusion

(WDMApp)

Modular Nuclear Reactors (ExaSMR) Wind Energy (ExaWind)

Subsurface (GEOS)

Urban systems (Urban)

Compressible flow (MARBL)

Combustion (Nek5000)

Page 5: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

5

LLNLIBM/NVIDIA

Department of Energy (DOE) Roadmap to ExascaleTop500 ranks shown in parentheses. All use accelerated nodes (GPU, Xeon Phi, etc.)

ANLIBM BG/Q

ORNLCray/AMD/NVIDIA

LBNLCray/AMD/NVIDIA

LANL/SNLTBD

ANLIntel/Cray

ORNLTBD

LLNLTBD

LANL/SNLCray/Intel Xeon/KNL

2012 2016 2018 2020 2021-2023

ORNLIBM/NVIDIA

LLNLIBM BG/Q

Sequoia (10)

Cori (12)

Trinity (6)

Theta (24)Mira (21)

Titan (9) Summit (1)

NERSC-9Perlmutter

Aurora

ANLCray/Intel KNL

LBNLCray/Intel Xeon/KNL

First U.S. Exascale Systems

Sierra (2)

Pre-Exascale Systems [Aggregate Linpack (Rmax) = 323 PF!]

Page 6: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

6

The Summit System @ Oak Ridge National Laboratory#1 on Top 500

• Peak of 200 Petaflops (FP64) for modeling & simulation

• Peak of 3.3 ExaOps (FP16) for data analytics and artificial intelligence

• Max power 13 MW

• 2 IBM POWER9 processors• 6 NVIDIA Tesla V100 GPUs• 608 GB of fast memory

(96 GB HBM2 + 512 GB DDR4)

• 1.6 TB of NV memory

• 4,608 nodes• Dual-rail Mellanox EDR

InfiniBand network• 250 PB IBM file system

transferring data at 2.5 TB/s

System Specs Each node has The system includes

Page 7: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

7

System Specs

• Peak performance of 125 petaflops for modeling and simulation

• Memory: 1.38 petabytes

• 8,640 Central Processing Units (CPUs)

• 17,280 Graphics Processing Units GPUs)

• Power consumption: 11 megawatts

Each node has

• 2 IBM POWER9 processors• 4 NVIDIA Tesla V100 GPUs• 320 GiB of fast memory

• 256 GiB DDR4• 64 GiB HBM2

• 1.6 TB of NVMe memory

The system includes

• 4,320 nodes• 2:1 tapered Mellanox EDR

InfiniBand tree topology (50% global bandwidth) with dual-port HCA per node

• 154 PB IBM Spectrum Scale file system with 1.54 TB/s R/W bandwidth

The Sierra System @ LLNL (#2 on Top 500)

Sierra supports LLNL’s national security mission and ability to advance science in the public

interest

Page 8: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

8

Open source projects lay the foundation for DOE simulations

OpenMPI

MPICH

Kokkos

RAJA

Spack

Filesystems & I/O

Resource Managers

Parallel Programming Models

MFEMCHOMBO

Meshing / Finite Elements

Scientific Visualization

Packaging/Build

Parallel Solvers

Page 9: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

9

1. Software complexity– many languages: C, C++, Fortran, Python, Lua, others– many low-level libraries (MPI, BLAS, LAPACK)

2. Architectural diversity– living on the bleeding edge means more native code, constant porting and tuning– Simulation code can live for decades!

3. Security requirements– Strict rules around user accounts and two-factor authentication– Difficult to deploy services to automate tasks

Several challenges make the HPC software ecosystem a bit thornier than the mainstream

Page 10: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

10

Software complexity in HPC is growing

Nalu: Generalized Unstructured Massively Parallel Low Mach Flow

Page 11: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

11

dealii: C++ Finite Element Library

Nalu: Generalized Unstructured Massively Parallel Low Mach Flow

Software complexity in HPC is growing

Page 12: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

12

R Miner: R Data Mining Library

dealii: C++ Finite Element Library

Nalu: Generalized Unstructured Massively Parallel Low Mach Flow

Software complexity in HPC is growing

Page 13: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

13

• Half of this DAG is externally developed OSS (blue); more than half of it is open source

• Nearly all of it needs to be built specially for HPC to get the best performance– Need to optimize not only our own code, but also others for new architectures

Even proprietary codes are based on many open source libraries

Page 14: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

14

• Every application has its own stack of dependencies.• Developers, users, and facilities dedicate (many) FTEs to building & porting.• Often trade reuse and usability for performance.

The ECP software environment is enormously complex

80+ software packagesx5+ target architectures

Xeon Power ARMXeon Phi NVIDIA Laptops?

x

Up to 7 compilersIntel GCC Clang XL

PGI Cray NAGx

= ~ 1,260,000 combinations!

15+ applications

x10+ Programming Models

OpenMPI MPICH MVAPICH OpenMP CUDAOpenACC Dharma Legion RAJA Kokkos

2-3 versions of each package + external dependencies

x

Complexity makes software reuse difficult!

Page 15: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

15

@spackpmgithub.com/spack/spackSpack

Over 2,000 monthly active users worldwideOver 3,100 packages

$ spack install [email protected]$ spack install [email protected] %[email protected]$ spack install [email protected] +threadssafe$ spack install [email protected] cppflags="-O3 –g3"$ spack install [email protected] target=haswell$ spack install [email protected] +mpi ^[email protected]

$ git clone https://github.com/spack/spack$ spack install hdf5

A flexible package manager for HPC

No installation required: clone and go

Simple syntax enables complex installs

Over 400 contributors from labs, academia, and industry

• Automate build/install of scientific software with 100’s of dependencies

• Target audiences:• Users: easily rely on others’ applications, manage environments• Developers: combinatorial testing, container builds, dependency mgmt.• HPC facilities: build/deploy multi-compiler, multi-MPI stacks + modules (“Spack Stacks”)

• Used at many top HPC centers (DOE sites, LRZ, CEA, NERSC, others• Reduced deploy time for 1,300 packages on Summit from 2 weeks to 12 hours

• Also has traction outside HPC: HEP community, bio, R communities, others

Page 16: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

16

• Each expression is a spec for a particular configuration– Each clause adds a constraint to the spec– Constraints are optional – specify only what you need.– Customize install on the command line!

• Spec syntax is recursive– Full control over the combinatorial build space

Spack provides the spec syntax to describe custom configurations

$ spack install mpileaks unconstrained$ spack install [email protected] @ custom version$ spack install [email protected] %[email protected] % custom compiler$ spack install [email protected] %[email protected] +threads +/- build option$ spack install [email protected] cxxflags="-O3 –g3” setting compiler flags$ spack install [email protected] os=cnl10 target=haswellsetting target for X-compile$ spack install [email protected] ^[email protected] %[email protected] ^ dependency information

Page 17: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

17

• Spack has over 3,200 builtin package recipes.

`spack list` shows what packages are available

$ spack list==> 3041 packages.abinit glew nalu py-fastaindex r-cairo r-viridisliteabyss glfmultiples nalu-wind py-fasteners r-callr r-visnetworkaccfft glib namd py-faststructure r-car r-vsnack glibmm nano py-filelock r-caret r-webshotactiveharmony glimmer nanoflann py-fiona r-category r-whiskeradept-utils glm nanopb py-fiscalyear r-catools r-withradios global nasm py-flake8 r-cdcfluview r-xdeadios2 globalarrays nauty py-flake8-polyfill r-cellranger r-xgboostadlbx globus-toolkit ncbi-magicblast py-flask r-checkmate r-xlconnectadol-c glog ncbi-rmblastn py-flask-compress r-checkpoint r-xlconnectjarsaegean gloo ncbi-toolkit py-flask-socketio r-chemometrics r-xlsxaida glpk nccl py-flexx r-chron r-xlsxjarsalbany glproto nccmp py-fn r-circlize r-xmapbridgealbert glvis ncdu py-fparser r-class r-xmlalglib gmake ncftp py-funcsigs r-classint r-xml2allinea-forge gmap-gsnap ncl py-functools32 r-cli r-xnomialallinea-reports gmime nco py-future r-clipr r-xtableallpaths-lg gmodel ncurses py-futures r-cluster r-xtsalquimia gmp ncview py-fypp r-clustergeneration r-xvectoralsa-lib gmsh ndiff py-gdbgui r-clusterprofiler r-yamlaluminum gmt nek5000 py-genders r-cner r-yapsaamg gnat nekbone py-genshi r-coda r-yaqcaffyamg2013 gnu-prolog nekcem py-geopandas r-codetools r-yarnamp gnupg nektar py-gevent r-coin r-zlibbiocampliconnoise gnuplot neovim py-git-review r-colorspace r-zooamrex gnutls nest py-git2 r-combinat r3damrvis go netcdf py-gnuplot r-complexheatmap raconandi go-bootstrap netcdf-cxx py-goatools r-compositions raftangsd gobject-introspection netcdf-cxx4 py-gpaw r-convevol ragelant googletest netcdf-fortran py-greenlet r-corhmm rajaantlr gotcha netgauge py-griddataformats r-corpcor randfoldants gource netgen py-guidata r-corrplot random123ape gperf netlib-lapack py-guiqwt r-covr randrproto. . .

Page 18: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

18

• All the versions coexist!– Multiple versions of same

package are ok.

• Packages are installed to automatically find correct dependencies.

• Binaries work regardless of user’s environment.

• Spack also generates module files.– Don’t have to use them.

`spack find` shows what is installed

$ spack find==> 103 installed packages.-- linux-rhel7-x86_64 / [email protected] [email protected] [email protected] [email protected] [email protected] [email protected]@3.9.1 [email protected] [email protected] [email protected] [email protected]@1.0 [email protected] [email protected] [email protected] [email protected]@2.14.0 [email protected] [email protected] [email protected] [email protected]@1.55.0 [email protected] [email protected] [email protected] [email protected]@1.14.0 [email protected] [email protected] [email protected] [email protected]@1.0.2 jpeg@9a [email protected] [email protected] [email protected]@8.1.2 libdwarf@20130729 [email protected] [email protected] [email protected]@8.1.2 [email protected] ocr@2015-02-16 [email protected] [email protected]@2.11.1 [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected]

-- linux-rhel7-x86_64 / [email protected] [email protected] [email protected] [email protected] libdwarf@20130729 [email protected] [email protected] [email protected] [email protected] [email protected] [email protected]

-- linux-rhel7-x86_64 / [email protected] [email protected] [email protected] [email protected]

-- linux-rhel7-x86_64 / [email protected] [email protected] [email protected] libdwarf@20130729 [email protected] [email protected]

-- linux-rhel7-x86_64 / [email protected] [email protected] [email protected] libdwarf@20130729 [email protected] [email protected] [email protected] [email protected] [email protected]

Page 19: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

19

• Architecture, compiler, versions, and variants may differ between builds.

Users can query the full dependency configuration of installed packages.

$ spack find callpath==> 2 installed packages.-- linux-rhel7-x86_64 / [email protected] ———————— -- linux-rhel7-x86_64 / [email protected] [email protected] [email protected]

Expand dependencies with spack find -d

$ spack find -dl callpath==> 2 installed packages.-- linux-rhel7-x86_64 / [email protected] ----------- -- linux-rhel7-x86_64 / [email protected] -----------xv2clz2 [email protected] udltshs [email protected] ^[email protected] rfsu7fb ^[email protected] ^[email protected] ybet64y ^[email protected] ^[email protected] aa4ar6i ^[email protected] ^[email protected] tmnnge5 ^[email protected] ^[email protected] ybet64y ^[email protected] ^libdwarf@20130729 g2mxrl2 ^libdwarf@20130729cj5p5fk ^[email protected] ynpai3j ^[email protected] ^[email protected] ynpai3j ^[email protected] ^libdwarf@20130729 g2mxrl2 ^libdwarf@20130729cj5p5fk ^[email protected] ynpai3j ^[email protected] ^[email protected] ynpai3j ^[email protected] ^[email protected] aa4ar6i ^[email protected]

$ spack find -dl callpath==> 2 installed packages.-- linux-rhel7-x86_64 / [email protected] ----------- -- linux-rhel7-x86_64 / [email protected] -----------xv2clz2 [email protected] udltshs [email protected] ^[email protected] rfsu7fb ^[email protected] ^[email protected] ybet64y ^[email protected] ^[email protected] aa4ar6i ^[email protected] ^[email protected] tmnnge5 ^[email protected] ^[email protected] ybet64y ^[email protected] ^libdwarf@20130729 g2mxrl2 ^libdwarf@20130729cj5p5fk ^[email protected] ynpai3j ^[email protected] ^[email protected] ynpai3j ^[email protected] ^libdwarf@20130729 g2mxrl2 ^libdwarf@20130729cj5p5fk ^[email protected] ynpai3j ^[email protected] ^[email protected] ynpai3j ^[email protected] ^[email protected] aa4ar6i ^[email protected]

Page 20: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

20

Spack packages are templatesThey use a simple Python DSL to define how to build

Metadata at the class level

Versions

Install logicin instance methods

Dependencies(note: same spec syntax)

Not shown: Patches, resources, conflicts, other directives.

from spack import *

class Kripke(CMakePackage): """Kripke is a simple, scalable, 3D Sn deterministic particle transport proxy/mini app. """

homepage = "https://computation.llnl.gov/projects/co-design/kripke" url = "https://computation.llnl.gov/projects/co-design/download/kripke-openmp-1.1.tar.gz"

version(‘1.2.3’, sha256='3f7f2eef0d1ba5825780d626741eb0b3f026a096048d7ec4794d2a7dfbe2b8a6’) version(‘1.2.2’, sha256='eaf9ddf562416974157b34d00c3a1c880fc5296fce2aa2efa039a86e0976f3a3’) version('1.1’, sha256='232d74072fc7b848fa2adc8a1bc839ae8fb5f96d50224186601f55554a25f64a’)

variant('mpi', default=True, description='Build with MPI.’) variant('openmp', default=True, description='Build with OpenMP enabled.’)

depends_on('mpi', when='+mpi’) depends_on('[email protected]:', type='build’)

def cmake_args(self): return [ '-DENABLE_OPENMP=%s’ % ('+openmp’ in self.spec), '-DENABLE_MPI=%s' % ('+mpi’ in self.spec), ]

def install(self, spec, prefix): # Kripke does not provide install target, so we have to copy # things into place. mkdirp(prefix.bin) install('../spack-build/kripke', prefix.bin)

Base package(CMake support)

Variants (build options)

Don’t typically need install() for CMakePackage, but we can work around codes that don’t have it.

Page 21: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

21

• Workflow:1. Users input only an abstract spec with some constraints2. Spack makes choices according to policies (site/user/etc.)3. Spack installs concrete configurations of package + dependencies

• Concrete specs have unique hashes representing their configurations– Combinatorially many configurations can be generated from the same set

of package recipes– Gives users the freedom to experiment with options on different platforms– Allows users to tune performance without having to build manually

Concretization fills in missing parts of requested specs.

mpileaks ^[email protected]+debug ^[email protected]

Concrete spec is fully constrained

and can be passed to install.

Concretiz

e

Page 22: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

22

Spack is being used on many of the top HPC systems

• At HPC sites for software stack+ modules– Reduced Summit deploy time from 2 weeks to 12 hrs.– EPFL deploys its software stack with Jenkins + Spack– NERSC, LLNL, ANL, other US DOE sites– SJTU in China

• Within ECP as part of their software release process– ECP-wide software distribution– SDK workflows

• Within High Energy Physics (HEP) community– HEP (Fermi, CERN) have contributed many features

to support their workflow

• Many others

Summit (ORNL)Sierra (LLNL)

Cori (NERSC)

SuperMUC-NG(LRZ)

富岳Fugaku, formerly Post-K(RIKEN)

Page 23: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

23

• We try to make it easy to modify a package– spack edit <package>– Pull request

• Contributors are HPC software developers as well as user support teams and admins

• We get contributions in the core as well as in packages

• LLNL still ha a majority of the core contributions, with significant help from others.

Spack has benefitted tremendously from its growing community of contributors

Page 24: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

24

GitHub Stars: Spack and some other popular HPC projectsStars over time

Stars per day(same data, 60-day window)

Page 25: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

25

The Spack team gave a tutorial at RIKEN this April

• First Spack tutorial in Japan (full day!)

• 45 attendees (35 local, 10 remote)– Many traveled to Kobe to attend this

tutorial– RIKEN, universities, and industry were

represented

• Also met with RIKEN and Fujitsu staff to discuss collaborations

• We hope this and future events help to increase our contributor base in Japan

Page 26: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

26

We have set up infrastructure to crowd-source Spack documentation translation

• We want to lower the barriers for Japanese contributors– Not just researchers – also staff at Fujitsu, other

companies

• Need to make the documentation accessible to on-board new users

• We’ve set up a repo that implements continuous translation – Translations in any number of languages can be

maintained alongside Spack– Translations are fine-grained, so if one part of

the English version is updated, only part of the translation is stale.

• Since the tutorial, we have pushed our first translated page to the new, internationalized Spack docs site– Anyone can contribute to the repo– We hope this helps Spack succeed in Japan!

spack.readthedocs.io/ja/latest

Page 27: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

27

• Major new features:1. Spack environments2. spack.yaml and spack.lock files for tracking dependencies (covered today)3. Custom configurations via command line (covered today)4. Better support for linking Python packages into view directories (pip in views)5. Support for uploading build logs to CDash6. Packages have more control over compiler flags via flag handlers7. Better support for module file generation8. Better support for Intel compilers, Intel MPI, etc.9. Many performance improvements, improved startup time

• Spack is now permissively licensed under Apache-2.0 or MIT – previously LGPL

• Over 2,900 packages– This is from November; over 3,200 in latest develop branch

Spack v0.12.1 was released in November 2018

Page 28: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

28

• Allows developers to bundle Spack configuration with their repository

• Can also be used to maintain configuration together with Spack packages.– E.g., versioning your own local software stack with consistent

compilers/MPI implementations

• Manifest / Lockfile model pioneered by Bundler is becoming standard– spack.yaml describes project requirements– spack.lock describes exactly what versions/configurations were installed,

allows them to be reproduced.

Spack has added environments and spack.yaml / spack.lockSimple spack.yaml file

installbuild

projectspack.yaml file with names of required

dependencies

Lockfile describes exact versions installed

Dependency packages

Concrete spack.lock file (generated)

Page 29: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

29

• We recently started providing base images on DockerHub with Spack preinstalled.

• Very easy to build a container with some Spack packages in it:

Spack environments help with building containers

spack-docker-demo/Dockerfilespack.yaml

Base image with Spackin PATH

Copy in spack.yamlThen run spack install

List of packages to install,with constraints

Build with docker build .

Run with Singularity(or some other tool)

Page 30: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

30

• U.S. Exascale Computing Project (ECP) will release software through Spack

• Software in ECP stack needs to run on ECP platforms,testbeds, clusters, laptops– Each new environment requires effort.

• ECP asks us to build a robust, reliable, and easy-to-use software stack

• We will provide the infrastructure necessary to make this tractable:1. A dependency model that can handle HPC software2. A hub for coordinated software releases (like xSDK)3. Build and test automation for large packages across facility4. Hosted binary and source software distributions for all ECP HPC platforms

Spack is the delivery platform for the ECP software stack

Page 31: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

31

Software Integration

There are many activities around Spack within ECP

Software Technologies Hardware Integration

Software Dev Kits (SDKs)

Extreme-scale Scientific Software Stack (E4S)

Software PackagingTechnologies

Containers

Facilities

ECP ContinuousIntegration

Apps

Spack

Facility Deployment

Build Pipelines

Spack Stacks

Page 32: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

32

ECP is working towards a periodic, hierarchical release process

• In ECP, teams increasingly need to ensure that their libraries and components work together– Historically, HPC codes used very few dependencies

• Now, groups of teams work together on small releases of “Software Development Kits”

• SDKs will be rolled into a larger, periodic release.

Develop

Package

Build

Test

Deploy

MathLibraries

Develop

Package

Build

Test

Deploy

Visualization

Develop

Package

Build

Test

Deploy

ProgrammingModels …

Build

TestDeploy

IntegrateE4S: ECP-wide

software release

https://e4s.io

Page 33: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

33

LLNLLANLSNLANL

ORNLNERSC

Builds under ECP will be automated withcontinuous integration

Public binary package repo

Spack users

Regular binarybuilds + testsin GitLab CI

Page 34: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

34

Security poses challenges for automation at large, multi-user HPC centers.

1. Difficult to run persistent services (like CI systems)— HPC workloads are mostly batch jobs; have a fixed time limit— Persistent services are difficult to deploy due to data security requirements— Batch jobs typically have a fixed time limit, but HPC centers built around

batch.

2. CI-like automation requires running arbitrary code— Often in response to external repository check-ins— How do we know who ran the code?— How do we trust users, and who do we blame if it the code is malicious?— 2-factor authentication prevents automated ingress from outside

3. All tasks at most HPC centers need to run as some user— Can’t allow different users’ jobs to share data.— Need isolation between jobs run by user A and jobs run by user B— Can’t have unauthenticated services listening on arbitrary ports

Page 35: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

35

• CI at HPC centers is notoriously difficult– Security concerns prevent most CI tools from being run by staff or by users– HPC centers really need to deploy trusted CI services for this to work

• We are developing a secure CI system for HPC centers:– Setuid runners (run CI jobs as users); Batch integration (similar, but parallel jobs); multi-center runner support

• Onyx Point will upstream this support into GitLab CI– Initial rollout in FY19 at ECP labs: ANL, ORNL, NERSC, LLNL, LANL, SNL– Upstream GitLab features can be used by anyone!

Through ECP, we are working with Onyx Point to deliver continuous integration for HPC centers

User checks out / commits code

Two-factor authentication

Fast mirroring

Setuid runner

Batch runner

Trusted runners at HPC facility

Page 36: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

36

Enhancements over current Gitlab CI● Facilities deploy and maintain trusted runners (not users)● Runners run as users who pushed / submitted PR

● Can also specify team users to run as● Facilities set whitelists and blacklists for both users and groups,

per runner● final authority on who to run as is with the runner

On a normal GitLab instance, there would not be sufficient isolation between runners to meet the needs of HPC sites.

source: www.gitlab.com

Milestone 1: SetUID runner - completed

GitLabServer

GitLab and runners are trustedRunners run as users

Page 37: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

37

Batch runners are special SetUID runners

Enhancements over regular SetUID runners● Runners use batch system, and do not block when

running jobs● Allows sites to leverage all their HPC resources

● Integration with SLURM, LSF, Cobalt batch systems● Users can specify parallel resource requests in

.gitlab-ci.yml

source: www.gitlab.com

Milestone 2: Batch runner - completed

GitLabServer

Runners run as users;submit jobs to batch system;

don’t block while jobs run

Page 38: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

38

Final form of DOE GitLab runners will be multi-site

• Current implementation requires trust relationship between GitLab and runners

• Requires GitLab and runner hosts to have same usernames

• Will allow runners to be associated with OAuth domains• Allow SetUID to work across domains• Users will be able to run CI jobs across their many DOE

facility accounts

• Also looking into how we can allow mirroring from GitHub

LLNLANLORNL

DOE-wide

ServerMirroring

Page 39: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

39

Onyx Point is working to upstream these features to GitLab

Upstream

• SetUID runners are generally usable at other sites• GitLab is interested in integrating this feature into their product• Other features TBD

• We have tried to make ECP general enough to release• Target simplicity – try not to be HPC specific unless we have to

• We expect open source contributions from ECP to have a lasting effect.• Any HPC site will be able to do this with GitLab, not just labs

Page 40: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

40

• Allow users to easily express a huge cross-product of specs– All the packages needed for a facility– Generate modules tailored to the site– Generate a directory layout to browse the packages

• Build on the environments workflow– Manifest + lockfile– Lockfile enables reproducibility

• Relocatable binaries allow the same binary to be used in a stack, regular install, or container build.– Difference is how the user interacts with the stack– Single-PATH stack vs. modules.

Spack stacks: combinatorial environmentsfor entire facility deployments

Page 41: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

41

• As an HPC package manager, we want to provide optimized builds– Code level choices (O2, O3)– Architecture specific choices (-mcpu=cortex-a7, -march=haswell)

• Architectures vary as to how much they expose features to users– x86 exposes feature sets in /proc/cpuinfo– Arm hides many features behind revision number

• Methods for accessing architecture optimizations– Vary by both compiler and architecture

• Gcc –mcpu vs. –march, for example• Relies on architectures providing a programmatic way to get information

• We want to expose the names users understand– Thunderx2, cortex-a7 for arm– Power8, power9 for IBM– Haswell, skylake for Intel

Specific architecture target information in specs – In progress

Page 42: June 14, 2019 Barcelona, Spain Center for Applied ...hpckp.org/wp-content/uploads/2019/08/6-T.Gamblin... · alquimia gmp ncview py-fypp r-clustergeneration r-xvector alsa-lib gmsh

42

Contributions to OSS projects

• ECP is building key infrastructure

• Working to bring more cloud-like services and automation to HPC

• Continuous Integraton

Automating build and deployment Towards regular releases

• Socializing a release process with researchers and scientists

• Bringing teams together to do better integration testing

• Regular ECP-wide releases

Software Deployment in ECP Includes Many Efforts

• Standardizing on a common package manager (Spack)

• Implementing build automation across HPC sites

• Trying to balance simple deployment with the complexity of the ecosystem

Spack contributors

Facility Integration