managing power efficiency of hpc applications with ... · 8 llnl-pres-804125 geopm project goals...

56
LLNL-PRES-804125 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE- AC52-07NA27344. Lawrence Livermore National Security, LLC Managing Power Efficiency of HPC Applications with Variorum and GEOPM Tapasya Patki, Stephanie Brink, Aniruddha Marathe, Barry Rountree David Lowenthal (U. Arizona), Jonathan Eastep (Intel) ECP Tutorial Feb 4, 2020 2:30PM-6:00 PM

Upload: others

Post on 23-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

LLNL-PRES-804125This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

Managing Power Efficiency of HPC Applications with Variorum and GEOPM

Tapasya Patki, Stephanie Brink, Aniruddha Marathe, Barry RountreeDavid Lowenthal (U. Arizona), Jonathan Eastep (Intel)

ECP Tutorial

Feb 4, 2020 2:30PM-6:00 PM

Page 2: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

2LLNL-PRES-804125

Agenda

• Part I: Overview of GEOPM (15 minutes)• High-level design• User-facing, application-context markup API• Demonstrations (10 minutes)

• Part II: Plug-ins to extend GEOPM algorithm and platform support (30 minutes)• Agent: Run-time tuning extension• PlatformIO: Platform-specific support extension• Demonstrations (10 minutes)

• Part III: ECP Argo Contributions (30 minutes)• ConductorAgent: Transparent, performance-optimizing configuration selection• IBM PlatformIO plugin: Port of GEOPM to IBM Power9 + Nvidia platform

• Questions/Discussion (10 minutes)

Page 3: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

3LLNL-PRES-804125

Part I: Hands-on Tutorial on GEOPM

Page 4: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

4LLNL-PRES-804125

Background: System Software Stack for Power Management

• Demand Response, RenewablesSite

• Overprovisioning, Job schedulingCluster

• Adaptive runtimes, Power balancingJob/Application

• Measurement and control (capping)Node

Inhe

rited

Pow

er B

ound

s

RMAP,P-SLURM,PowSchedGEOPM,

Conductor,Pshifter,...

Libmsr,msr-safe

Dashboards

Software

Page 5: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

5LLNL-PRES-804125

Background: System Software Stack for Power Management

• Demand Response, RenewablesSite

• Overprovisioning, Job schedulingCluster

• Adaptive runtimes, Power balancingJob/Application

• Measurement and control (capping)Node

Inhe

rited

Pow

er B

ound

s

RMAP,P-SLURM,PowSchedGEOPM,

Conductor,PShifter,...LibMSR,msr-safe

Dashboards

§ Critical contribution to the development of HPC power-aware system software stack.

Software

Page 6: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

6LLNL-PRES-804125

Power-Constrained Performance-Optimization Problem

Problem definition

Given a job-level power constraint and number of nodes, how do we optimize application performance?

Page 7: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

7LLNL-PRES-804125

GEOPM: Global Extensible Open Power Manager

• Power-aware runtime system for large-scale HPC systems

• Intel developed a production-grade, scalable, open-source job-level extensible runtime and framework

• Extensibility through plug-ins + advanced default functionality

• Limitations of existing runtimes• Research-based codes addressed specific needs and situations• Ad-hoc, targeted specific architecture, memory model • Suffered scalability issues• Reliance on empirical data

• Funded through a contract with Argonne National Laboratory

Page 8: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

8LLNL-PRES-804125

GEOPM Project Goals

§ Managing power• Maximizing power efficiency or performance

under a power cap

§ Managing manufacturing variation• Power / frequency relationship is non-uniform

across different processors of same type

§ Managing workload imbalance• Divert power to CPUs with more work

§ Managing system jitter• Divert power to CPUs interrupted or stalled by

system noise

§ Application profiling• Report application performance and

power metrics

§ Runtime application tuning• Extensible runtime control agent with

plug-in architecture

§ Integration with MPI• Automatic integration with MPI runtime

through PMPI interface

§ Integration with OpenMP• Automatic integration with OpenMP

through OMPT interface

Page 9: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

9LLNL-PRES-804125

GEOPM System Model

Extensible / Plug-in componentsHuman actors

External components – h/w, s/w, files

Job Monitor

Job Optimizer

Power-aware Job Scheduler

UserApplications

Site Admin

User

ApplicationDeveloper

Per-nodeTrace file

System Hardware(sensors,controls,

actuators)GEOPM

Page 10: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

10LLNL-PRES-804125

GEOPM: Capabilities

§ Enables analysis and transparent tuning of distributed-memory applications

§ Feedback-guided optimization: Leverages lightweight application profiling

§ Learns application phase patterns: load imbalance across nodes, distinct

computational phases within a node

§ Uses tuning parameters: processor power limit, core frequency, etc.

§ Built-in optimization algorithms: Static Power capping, energy reduction,

load balancing, limiting synchronization costs

Page 11: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

11LLNL-PRES-804125

GEOPM Components of Interest

GEOPM CoreHierarchical communication

+ plugin infrastructure

Agent PlatformIO

1

3

2

Markup API

Application

Endpoint4

Page 12: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

12LLNL-PRES-804125

GEOPM Components of Interest

GEOPM CoreHierarchical communication

+ plugin infrastructure

Agent PlatformIO

1

3

2

Markup API

Application

Endpoint4

Page 13: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

13LLNL-PRES-804125

GEOPM Infrastructure

GEOPM CoreHierarchical communication

+ power-management plugin

Agent Plugin

PlatformIO Plugin

Page 14: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

14LLNL-PRES-804125

GEOPM Infrastructure

• GEOPM Source repository navigation• Branches, directories, releases• GEOPM Wiki

• Build process• Dependencies• Build configuration

• GEOPM core infrastructure source• Overview of important classes• Plug-in source• Tutorials and examples• Test coverage

Page 15: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

15LLNL-PRES-804125

GEOPM: Component Communication

Controller

PlatformIO

ointResourceManager

Agent

HW Interface(OS)

User Submits Job

Endpoint

Spank Plugin

SLURM GEOPM Runtime

Signal and control flow

Component creation

GEOPM component

Page 16: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

16LLNL-PRES-804125

GEOPM: Input/Output Files

Controller

Application Profile PlatformIO

Agent

Report/Trace

HW Interface(OS)

GEOPM RuntimePolicy

Signal and control flow

Component creation

GEOPM componentI/O files

Page 17: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

17LLNL-PRES-804125

GEOPM Configuration, Build and Launch

Page 18: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

18LLNL-PRES-804125

Building an Application with GEOPM

Step 1 : Set the environment$> module load geopm$> module load <intel compiler>$> module load <MPI compiled with intel-c>

Step 2: Link the Application to GEOPM library $> mpicc APP_SRC.c -L$GEOPM_LIB -lgeopm \

-o APP_EXEC \COMPILER_FLAGS

Example$> mpicc helloworld.c -L$GEOPM_LIB -lgeopm -o a.out

Page 19: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

19LLNL-PRES-804125

Running an Application with GEOPM

Step 3: Generate a policy file$> geopmagent --agent=AGENT_NAME --policy=INPUT_PARAMS > POLICY_FILE.json

Example:$> geopmagent --agent=monitor --policy=None > monitor_policy.json

Step 4: Launch application with GEOPM launcher wrapper$> geopmlaunch srun -n < > -N < >\

--geopm-ctl=process \--geopm-agent=AGENT_NAME \--geopm-policy=POLICY_FILE.json \--geopm-report=REPORT_FILE.txt \--geopm-trace=TRACE_FILE.csv \-- APP_EXEC APP_OPTIONS

Example:$> geopmlaunch srun -n 4 -N 1 \

--geopm-ctl=process \--geopm-agent=monitor \--geopm-policy=monitor_policy.json \--geopm-report=report.txt \--geopm-trace=trace.csv \-- a.out

Page 20: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

20LLNL-PRES-804125

Demo: Running Application with GEOPM

Page 21: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

21LLNL-PRES-804125

GEOPM Components of Interest

GEOPM CoreHierarchical communication

+ plugin infrastructure

Agent PlatformIO

1

3

2

Markup API

Application

Endpoint4

Page 22: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

22LLNL-PRES-804125

GEOPM: Components and Interfaces

§ Application region markup API— Computation/communication

regions of interest

§ Epoch— End of iteration

§ OpenMP event callbacks

Collecting Application Context

§ Governed policy—Node-level

assignment

§ Balanced policy—Cluster-level

assignment

Power Assignment

Policies§ New Agent plugin:

ConductorAgent

§ New PlatformIO plugin:IBM port of GEOPM

Extension Interfaces

Page 23: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

23LLNL-PRES-804125

GEOPM Markup API: Purpose

• C interfaces provided in GEOPM that the application links against• Resemble typical profiler interfaces

• Annotation functions for programmers to provide information about application critical path and phases to GEOPM• Points where bulk synchronizations occur

• Phase changes occur in an MPI rank (i.e. phase entry and exit)

• Hints on whether phases will be compute-,memory-, or communication-intensive

• How much progress each MPI rank has made in the phase (critical path)

Page 24: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

24LLNL-PRES-804125

Application Markup API

• Marking up regions of interest• geopm_prof_region(name, hint, ID)• geopm_prof_enter(ID)• geopm_prof_exit(ID)

• Marking region progress• geopm_prof_progress(ID, %progress)

• Marking a timestep• geopm_prof_epoch()

MPI/Sequential Region

• Marking up regions of interest• geopm_tprof_init( num_work_unit)• geopm_tprof_init_loop(num_thread,

thread ID,num_iter,chunk_size)

• Marking region progress• geopm_tprof_post()

OpenMP Region

Page 25: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

25LLNL-PRES-804125

Demo: Using the GEOPM Markup API

Page 26: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

26LLNL-PRES-804125

Part II: Plug-ins to extend GEOPM algorithm and platform support

Page 27: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

27LLNL-PRES-804125

GEOPM: Policy plugins

§ Application region markup API— Computation/communication

regions of interest

§ Epoch— End of iteration

§ OpenMP event callbacks

Collecting Application Context

§ Governed policy—Node-level

assignment

§ Balanced policy—Cluster-level

assignment

Power Assignment

Policies§ New Agent plugin:

ConductorAgent

§ New PlatformIO plugin:IBM port of GEOPM

Extension Interfaces

Page 28: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

28LLNL-PRES-804125

Demo: Using the Default GEOPM Policies

Page 29: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

29LLNL-PRES-804125

GEOPM Components of Interest

GEOPM CoreHierarchical communication

+ plugin infrastructure

Agent PlatformIO

1

3

2

Markup API

Application

Endpoint4

Page 30: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

30LLNL-PRES-804125

GEOPM Components of Interest

GEOPM CoreHierarchical communication

+ plugin infrastructure

Agent PlatformIO

1

3

2

Markup API

Application

Endpoint4

MSR accesscontrol

telemetryapplication context

Power mgmt algorithmprofiling

accounting

Agent

PlatformIO

Page 31: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

31LLNL-PRES-804125

GEOPM Plugin Interface • Two types of plugins: PlatformIO and Agent plugins

• Example Agent plugins• MonitorAgent• BalancerAgent• GoverningAgent• EnergyEfficientAgent

• Example PlatformIO plugins• MSRIOGroup• KNLIOGroup

• Tutorial plugins: ExampleAgent and ExampleIOGroup• Key methods and code blocks• Policy description interface

Page 32: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

32LLNL-PRES-804125

Demo: GEOPM Agent Example

Page 33: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

33LLNL-PRES-804125

Part III: ECP Argo Contributions

Page 34: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

34LLNL-PRES-804125

ECP Argo: Selecting Power-Optimizing Configuration

§ Approach: Hardware Overprovisioning with job-level power guarantees— More compute resources than you can power up at once

§ Objective: Optimize job performance under a power constraint

§ Solution: GEOPM – power-constrained performance optimization

§ ECP Argo Contributions:— Augment GEOPM’s algorithm with performance-optimizing

application configurations: # threads, Frequency, etc.— Port GEOPM to IBM POWER9 (support for LLNL Sierra)

Page 35: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

35LLNL-PRES-804125

ECP Argo Contributions: Components and Interfaces

§ Application region markup API— Computation/communication

regions of interest

§ Epoch— End of iteration

§ OpenMP event callbacks

Collecting Application Context

§ Governed policy—Node-level

assignment

§ Balanced policy—Cluster-level

assignment

Power Assignment

Policies§ New policy agent plugin:

ConductorAgent

§ New PlatformIO plugin:IBM port of GEOPM

Extension Interfaces

Page 36: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

36LLNL-PRES-804125

ECP Argo: How Much Do We Gain With Configuration Tuning?

ECP Argo ECP Argo

Page 37: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

37LLNL-PRES-804125

Naïve Scheme: Static Power Allocation

§ Equally distribute and enforce power constraint over all nodes of a job—Uses Intel’s Running Average Power Limit (RAPL) interface

§ Statically select a configuration under the power constraint—Configuration: {Number of cores, Frequency/power limit}—Commonly used: Packed configuration

• Maximum cores possible on the processor• Frequency or power limit as the control knob

Page 38: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

38LLNL-PRES-804125

Limitations of Static Power Allocation

1. Trivial node-level configurations may be inefficient

Input: {# cores, frequency/power limit}Output: {Execution time, power usage}

• Up to 30% slower than the optimal configuration

• Needs prohibitively large number of runs of the application

CoMD64 Nodes

50 60 70 80 90Processor power usage (watts)

Page 39: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

39LLNL-PRES-804125

Limitations of Static Power Allocation

1. Trivial node-level configurations may be inefficient

Input: {# cores, frequency/power limit}Output: {Execution time, power usage}

• Up to 30% slower than the optimal configuration

• Needs prohibitively large number of runs of the application

CoMD64 Nodes

2. Portion of power left unused with load-imbalanced applications (up to 40%)50 60 70 80 90

Processor power usage (watts)

Page 40: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

40LLNL-PRES-804125

Conductor: Dynamic Configuration and Power Management

§ Goals of ConductorAgent— Speed up computation on the critical path— Use power-efficient configuration

§ Need to dynamically identify— Computation region potentially on the critical path—{execution time, power usage} profile for every computation on every

processor

Page 41: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

41LLNL-PRES-804125

ConductorAgent AlgorithmStart

Explore configurations Step 1: Configuration Exploration

1 2 3 n. . .

MPI processesConfigurations

k1, k2, ..., knk1 k2 k3 kn

Allgather{Power, Execution Time}

Page 42: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

42LLNL-PRES-804125

50 60 70 80 90Power usage (watts)

Start

Explore configurations

Construct Pareto frontier

Select configuration kOPT

Step 1: Configuration Exploration

ConductorAgent Algorithm

Page 43: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

43LLNL-PRES-804125

Start

Explore configurations

Construct Pareto frontier

Select Configuration kOPT

Is computation non-critical?

Speed up (with unused power)

No

Calculate new power

allocation

Step 2: Power Re-allocation

Slow down (reduce power)

YesPower Limit:

70W

ParaDiS: Before power re-allocation

ParaDiS: After power re-allocation

power usage (watts)

50 55 60 65 70

50 55 60 65 70 75

0 5

10

15

power usage (watts)

# Ta

sks

# Ta

sks

0 5

10

15

ConductorAgent Algorithm

Page 44: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

44LLNL-PRES-804125

Conductor: Integration into GEOPM

§ OMPT class— Explore {OMP, Pcap} configurations during the exploration phase— Select power-efficient configuration during regular execution.

§ Profile class— Report end of timestep (i.e., ‘epoch’), application and system telemetry to enable

sweep of configuration at runtime.

§ ConfigApp class— Perform profiling, generate pareto-optimal configurations.

§ ConfigAgent class — Share telemetry with PowerBalancer agent, send configuration to OMPT.

Page 45: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

45LLNL-PRES-804125

ConductorAgent OMPT Profiler

Init Handshake

Shared memoryspace

GEOPM::SharedMemory

GEOPM::SharedMemoryUser

GEOPM Controller Application Process

Time

Initialization: GEOPM, Application Handshake

Initialize control and telemetry

Page 46: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

46LLNL-PRES-804125

ConductorAgent OMPT Profiler

GEOPM Controller Application Process

Time

Configuration Exploration: Set Configuration, Collect Telemetry

Configuration Exploration

ThreadCntPowerCapRegionID

PowerTime

Set ThreadsSet ConfigurationSet Power Cap

TelemetryRun Region

Signal Timestep

Sweep all configurations

Page 47: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

47LLNL-PRES-804125

ConductorAgent OMPT Profiler

GEOPM Controller Application Process

Time

Configuration Selection: Pick Power-Efficient Configurations

Configuration Selection

Set ConfigurationSet Power Cap ThreadCnt

PowerCapSet ThreadsRun Region

Through application completion

Page 48: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

48LLNL-PRES-804125

ECP Argo: End Result

ECP Argo ECP Argo

Page 49: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

49LLNL-PRES-804125

ECP Argo Contributions: Components and Interfaces

§ Application region markup API— Computation/communication

regions of interest

§ Epoch— End of iteration

§ OpenMP event callbacks

Collecting Application Context

§ Governed policy—Node-level

assignment

§ Balanced policy—Cluster-level

assignment

Power Assignment

Policies§ New policy agent plugin:

ConductorAgent

§ New PlatformIO plugin:IBM port of GEOPM

Extension Interfaces

Page 50: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

50LLNL-PRES-804125

GEOPM Port: Migration to New GEOPM IOGroup Interface

Purpose Old: PlatformImp interface New: IOGroup interface

Get platform information on POWER9

PowerPlatformImp: extendsPlatformImp

PowerIOGroup: extends IOGroup

RAPL-like monitoring and control on POWER9

OCCPlatform: extends Platform PowerIO: Direct CPU monitoring/control interface

Get platform information on GPUs NVMLPlatformImp: extends PlatformImp

NVMLIOGroup: extends IOGroup

RAPL-like monitoring and control on GPUs

NVMLPlatform: extends Platform NVMLIO: Direct GPU monitoring/control

*Additional modifications in GEOPM Agent implementations to fully support GEOPM power management on POWER9 dual socket + Nvidia Volta with NVLink

Page 51: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

51LLNL-PRES-804125

GEOPM IBM Port: IBM “Witherspoon” Node

Telemetry

CPU frequency § /sys/devices/system/cpu/cpufreq/policy*/scaling_cur_freqCPU Sensors § /sys/firmware/opal/exports/occ_inband_sensors

§ Performance Monitoring library (perfmon2): libpfm4GPU information § NVML :: *

Control

CPU frequency § /sys/devices/system/cpu/cpufreq/policy*/scaling_setspeedGPU power limit § NVML :: nvmlDeviceSetPowerManagementLimit()Node-level power capping

§ /sys/firmware/opal/powercap/system-powercap/powercap-current

§ CPU ID: PowerNV 8335-GTH, 2.2 § Number of cores: 160 4-way SMT, 3.7 GHz§ System memory: 66 GB§ GPU: Nvidia Tesla V100-SXM2§ Software: RHEL, GNU C/C++, GNU Fortran, MPICH2

System configuration &

interfaces

We use linear regression-based model to predict power usage at a given CPU frequency

PCPU = α.F + C

where, PCPU : P9 CPU power usage (watts),F : CPU frequency (GHz) α : Coefficient of frequency scalingC : Constant offset base frequency <-> power correlation

Page 52: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

52LLNL-PRES-804125

ECP Argo: Github Contributions

§ Conductor integration and IBM platform plugin: — https://github.com/geopm/geopm/pull/757

§ GEOPM integration with Caliper: — https://github.com/LLNL/Caliper/pull/213

Page 53: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

53LLNL-PRES-804125

GEOPM Team and Collaborations

GEOPM Core Team (Intel)Jonathan Eastep (Project Lead)Chris Cantalupo (Lead Developer)Fede ArdanazBrad GeltzBrandon BakerMohammad AliSiddhartha JanaDiana Guttman

LLNL TeamAniruddha MaratheTapasya PatkiStephanie BrinkBarry Rountree

Page 54: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

54LLNL-PRES-804125

Questions?

Github links

Configuration Exploration: https://github.com/amarathe84/geopm/tree/master

IBM Port: https://github.com/amarathe84/geopm/tree/ibm-port

Page 55: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap

55LLNL-PRES-804125

ECP Future WorkW

IPDo

neFu

ture

§ Extend configuration exploration to include CPU-GPU configuration space

§ Power/Performance models for co-scheduling and workflows

§ Port variorum to ARM, HPE, and other architectures

§ PowerStack Consortium and industry integration

§ ECP Phase I: GEOPM extensions, Power-aware SLURM, Legion extensions

§ ECP Phase II: Power control and monitoring through variorum, co-scheduling and workflows

§ Deliver initial version of PowerStack

§ Integrate GEOPM and Variorum

§ Include node-level power capping after OPAL firmware update

Page 56: Managing Power Efficiency of HPC Applications with ... · 8 LLNL-PRES-804125 GEOPM Project Goals §Managing power • Maximizing power efficiency or performance under a power cap