dr. robert w. wisniewski chief software architect extreme ... · (elk/splunk) underlying components...
TRANSCRIPT
Dr. Robert W. Wisniewski
Chief Software Architect Extreme Scale Computing, Intel
November 15, 2017
Session Agenda and Objective
• OpenHPC• Value
• Goals
• Background
• Update
• Intel’s component work planned for submission to OpenHPC
Value from Community
• Stable HPC System Software that:
Fuels a vibrant and efficient HPC software ecosystem
Removes duplication of effort throughout community
Simplify the complexity of installation, configuration, and ongoing maintenance of a custom software stack
Takes advantage of hardware innovation and drives revolutionary technologies
Eases traditional HPC application development and testing at scale
Development environment for new workloads (ML, analytics, big data, cloud)
A Shared Repository
OpenHPC - Mission and VisionMission: to provide a reference collection of open-source HPC software components and best practices, lowering barriers to deployment, advancement, and use of modern HPC methods and tools.
Vision: OpenHPC components and best practices will enable and accelerate innovation and discoveries by broadening access to state-of-the-art, open-source HPC methods and tools in a consistent environment, supported by a collaborative, worldwide community of HPC users, developers, researchers, administrators, and vendors.
Recent article by Adrian Reber, OpenHPC TSC and Red Hat: https://opensource.com/article/17/11/openhpc 4
OpenHPC: a brief History...
5
ISC’15BoF on the merits/interest in a community effort
ISC’16v1.1.1 release,Linux Foundation announces technical leadership, founding members, and governance
SC’15Initial v1.0 release, gather interested parties to work with Linux Foundation
SC’16v1.2 release, BoF
ISC’17v1.3.1release,BoF
June 2015
Nov 2015
June 2016
Nov 2016
June 2017
Nov 2017
SC’17V1.3.3 release, BoF
6
Current Project Members
Member participation interest? Please contact Jeff ErnstFriedman
Mixture of academics, labs, and industry
WWW.OpenHPC.Community
OpenHPC Stack Overview
OpenHPC v1.3.3 - Current S/W components Functional Areas Components
Base OS CentOS 7.4, SLES12 SP3
Architecture aarch64, x86_64
Administrative ToolsConman, Ganglia, Lmod, LosF, Nagios, pdsh, pdsh-mod-slurm, prun, EasyBuild, ClusterShell, mrsh, Genders, Shine, Spack, test-suite
Provisioning Warewulf, xCAT
Resource Mgmt. SLURM, Munge, PBS Professional, PMIx
Runtimes OpenMP, OCR, Singularity
I/O Services Lustre client, BeeGFS client
Numerical/Scientific Libraries
Boost, GSL, FFTW, Hypre, Metis, Mumps, OpenBLAS, PETSc, PLASMA, Scalapack, Scotch, SLEPc, SuperLU, SuperLU_Dist, Trilinos
I/O Libraries HDF5 (pHDF5), NetCDF/pNetCDF (including C++ and Fortran interfaces), Adios
Compiler Families GNU (gcc, g++, gfortran), Clang/LLVM
MPI Families MVAPICH2, OpenMPI, MPICH
Development Tools Autotools, cmake, hwloc, mpi4py, R, SciPy/NumPy, Valgrind
Performance Tools PAPI, IMB, mpiP, pdtoolkit TAU, Scalasca, ScoreP, SIONLib
Basic Cluster Install Example
• Starting install guide/recipe targeted for flat hierarchy
• Leverages image-based provisioner: Warewulf or xCAT
• PXE boot (stateful or stateless)
• Optionally connect external Lustre* or BeeGFS parallel file system
• Need hardware-specific information to support (remote) bare-metal provisioning
Target System Design
10
Large systems have a considerable number of Service Nodes (SNs)
– SMS – System Management Server
– Row/Rack Controllers
– I/O Nodes (IONs)
– Specialized servers for Fabric Manager (FM), Workload Manager/Resource Manager (RM), Database (DB)
For flexibility, the control system will target a “pool” of service nodes
– i.e., the control system has a mechanism to prefer service nodes for efficiency, affinity, or for necessary characteristics when assigning a particular function
– But, the control system has flexibility so it can spread work over many nodes for performance and resiliency
For max application performance, Compute Nodes (CNs) are avoided for execution of control system software
– Reduce noise; reduce footprint overhead in CNs
– Can leverage CNs when noise is ok or expected
– job control operations (start, kill, etc)
– activity between jobs
This approach encourages Scalable Unit packaging concepts
Focus on Core System Software Components
• mOS– Scalable operating systems
• Unified Control System– Unified, Productive (single pane of glass), Reliable
• DAOS– Distribute Asynchronous Object Store
• MPI– Scalable, high performance, topology optimized
• GEOPM– Global Extensible Open Power Manager
• PMIx– Process management with “Instant On”
11
12
mOS High-Level Architecture
• LWK performance for HPC applications
• Nimble to adapt to new technology
• Linux compatibility
• Better contained containers
systemmanifest
Data Access Interface
DAI Framework(SCON, data tier access, daemon mgmt,resilient workitems, etc)
Provider
Security
Provider
Provisioner
Provider
Fabric Manager
Provider
Low-Level Control
Provider
Monitor
Provider
Workload Manager
Provider
Inventory
Provider
Service
Provider
Alerts
Provider
RAS
13
Operator Interface
(Web & REST)
SLURM Warewulf OPA FM Actsys SensysSCON
Service
Pro
vid
er
Opera
tor
Inte
rface
Provider
SCON
Provider
Master
offline tier(ELK/Splunk)
Underlying ComponentsGuided by DAI providers, but generally operate autonomously
bmcs
pduscdus
rectifiers
Inventory
racks
drawerschassis
bladesboards
switches cpus
cablescoresmemory
devices
nodes
Configuration
OS images
softwarecomponents
temperature
Environmental
voltage currentcoolantflow
RAS
events
alerts
jobs
Operations
provisioning
serviceops
Unified Control Systemsystem data
Service
Inventory
Env Data
Alerts
RAS
online tier(VoltDB)
nearline tier(PostgreSQL)
Unified Control System• Provide a comprehensive system view
• Advance the state-of-the art
• Build upon existing work
• Support the system lifecycle
•Scale-out object store designed from the ground up for nextgen storage & fabric technologies
– High throughput/IOPS @arbitrary alignment/size
– Byte addressable for better application scalability
– no read-modify-write or false block sharing
– OS bypass with lightweight client/server
– Small memory footprint & low CPU usage
•Advanced storage API
– New scalable storage model suitable for both structured & unstructured data
– Non-blocking data & metadata operations
– Metadata & data query/indexing
Distributed Asynchronous Object Storage
DAOSOpen Source Apache 2.0 License
Traditional HPC Apps
NetCDF
Spark RDD
SCRFTI
VeloC
MPI-IO
(No)SQL
Big Data & AIApps
HDF5
Ext
POSIXI/O
ApacheArrow
Dataspaces
•Software-defined storage platform
– Flexible storage provisioning
– Predictable performance/capacity
– Ease of management
•Seamless Integration with Lustre
– Single unified namespace
NVRAM NVMeNVMeHDD NVMeNVRAMArgobotsThread model
MercuryFunction shipping
Ext
15
MPICH-OFI
• Open-source implementation based on MPICH
• Uses the new CH4 infrastructure• Co-designed with MPICH community
• Targets existing and new fabrics via next-gen Open Fabrics Interface (OFI)• Ethernet/sockets, Intel® Omni-Path, Cray Aries*, IBM BG/Q*, InfiniBand*
• Improving Performance• Topology-aware collectives and process mapping
• Optimizations for newer networks
• Thread performance enhancements
• Support for automated configuration
• Added support for latest libfabric 1.5 features
16
Runtime for in-band power management and optimization On-the-fly monitoring of hardware counters and application profiling
Feedback-guided optimization of hardware control knob settings
Open source software (flexible BSD three clause license)
Extensible and portable through plugin architecture Enables portability beyond x86 architectures (truly open)
Enables rapid prototyping of new power management strategies
Accommodates the reality that different sites have different constraints and preferences for performance vs. energy savings
Designed for holistic optimization across a whole job Job-wide global optimization of HW control knob settings
Application-awareness for max speedup or energy savings
Scalable via distributed tree-hierarchical design, algorithms
MPI Comms Overlay Shared Mem Region
Power-AwareRM / Scheduler
GEOPM Controller
SHM
GEOPM
GEOPM Root
GEOPM Aggregator
GEOPM Aggregator
GEOPM Leaf
Msr-safe (or Other Drivers for Non-x86 Platforms)
MSR
MPI Ranks0 to i-1
GEOPM Leaf
Processor
MPI Ranksi to j-1
Processor
MPI Ranksj to k-1
GEOPM Leaf
Processor
MPI Ranksk to n-1
GEOPM Leaf
Processor
Project url: http://geopm.github.io/geopm
Contact: [email protected]
OMPISpectrumOSHMEM
SOSPGASothers
What is PMIx?
PMI-1 PMI-2
wireup supportdynamic spawn
keyval publish/lookup
MPICHyears go by…
SLURM
ALPS
RM
PGASothers
2015
Exascale systemson horizon
Launch times longNew paradigms
2016
Exascale launchin < 10s
Orchestration
PMIx v1.2
SLURMJSM
RM
OMPISpectrumOSHMEM
2017
Exascale launchin < 30s
PMIx v2.x
SLURMJSM
others
RM
Three Distinct Entities
• PMIx Standard
Defined set of APIs, attribute strings
Nothing about implementation
• PMIx Reference Library
A full-featured implementation of the Standard
Intended to ease adoption
• PMIx Reference Server
Full-featured “shim” to a non-PMIx RM
Backup
• Backup
Hierarchical Overlay for OpenHPC software
OpenHPC Development Infrastructure
• The ‘usual’ software engineering stuff:• GitHub (SCM and issue tracking/planning)
• Continuous Integration (CI) Testing (Jenkins)
• Documentation (Latex)
• Capable build/packaging system• At present: we target a common
delivery/access mechanism that adopts Linux sysadmin familiarity
• Require Flexible System to manage builds
• A system using Open Build Service (OBS) supported by back-end git
OpenHPC Build System: OBS
• Manage Build Process
• Drive Builds for multiple repositories
• Repeatable builds
• Generate binary and src rpms
• Publish corresponding package repositories
• Client/server architecture supports distributed build slaves and multiple architectures
OpenHPC Integration/Testing/Validation
• Install recipes
• Cross-package interaction
• Development environment
• Mimic use cases common in HPC deployments
• Upgrade mechanism
OpenHPC Integration/Test/Validation
• Standalone integration test infrastructure
• Families of tests that could be used during:• Initial install process (can we build a system?)• Post-install process (does it work?) • Developing tests that touch all of the major components (can we
compile against 3rd party libraries, will they execute under resource manager, etc.?)
• Expectation is that each new component included will need corresponding integration test collateral
• Integration tests are included in the GitHub repo
• Global testing harness includes a number of embedded subcomponents