an autonomic performance environment for exascalekhuck/apex-scalable-tools-workshop-2015.pdfapex...

34
An Autonomic Performance Environment for Exascale Kevin A. Huck 1 , Nick Chaimov 1 , Allen D. Malony 1 , Sameer Shende 1 , Allan Porterfield 2 , Rob Fowler 2 1 University of Oregon, Eugene, Oregon, USA 2 RENCI, Chapel Hill, North Carolina, USA

Upload: others

Post on 28-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Autonomic Performance Environment for Exascalekhuck/APEX-Scalable-Tools-Workshop-2015.pdfAPEX Event Listeners • Profiling listener – Start event: input name/address, get timestamp,

An Autonomic Performance Environment for Exascale  

Kevin A. Huck1, Nick Chaimov1, Allen D. Malony1, Sameer Shende1, Allan Porterfield2, Rob Fowler2

1University of Oregon, Eugene, Oregon, USA 2RENCI, Chapel Hill, North Carolina, USA

Page 2: An Autonomic Performance Environment for Exascalekhuck/APEX-Scalable-Tools-Workshop-2015.pdfAPEX Event Listeners • Profiling listener – Start event: input name/address, get timestamp,

2  An  Autonomic  Performance  Environment  for  Exascale  

What I am going to tell you… •  XPRESS project overview – OpenX, ParalleX, HPX etc.

•  Asynchronous tasking runtimes, tool challenges •  APEX overview •  Examples

Page 3: An Autonomic Performance Environment for Exascalekhuck/APEX-Scalable-Tools-Workshop-2015.pdfAPEX Event Listeners • Profiling listener – Start event: input name/address, get timestamp,

3  An  Autonomic  Performance  Environment  for  Exascale  

XPRESS •  DOE X-Stack project to develop the OpenX software

stack, implementing the ParalleX execution model –  Sandia National Laboratories (R. Brightwell) –  Indiana University (T. Sterling, A. Lumsdane) – Lawrence Berkeley National Laboratory (A. Koniges) – Louisiana State University (H. Kaiser) – Oak Ridge National Laboratory (C. Baker) – University of Houston (B. Chapman) – University of North Carolina (A. Porterfield, R. Fowler) – University of Oregon (A. Malony, K. Huck, S. Shende)

•  http://xstack.sandia.gov/xpress/

Page 4: An Autonomic Performance Environment for Exascalekhuck/APEX-Scalable-Tools-Workshop-2015.pdfAPEX Event Listeners • Profiling listener – Start event: input name/address, get timestamp,

4  An  Autonomic  Performance  Environment  for  Exascale  

LegacyApplications

New Model Applications

MPIMetaprogramming

FrameworkDomain SpecificActive Library

Compiler

AGASname spaceprocessor

LCOdataflow, futuressynchronization

LightweightThreads

context manager

Parcelsmessage driven

computation

...

OpenMP

XPI

Task recognition Address space control

Memory bankcontrol

OSthread In

stru

men

tatio

n

Networkdrivers

Distributed FrameworkOperating System

HardwareArchitecture

OperatingSystem

Instances

RuntimeSystem

Instances

PRIME MEDIUMInterace / Control

{{

Domain SpecificLanguage

+106 nodes � 103 cores / node + integration network

...

Figure 1: Major components of the OpenX architecture.

The two major components of the XPRESS OpenX software architecture are the LXK operating systemand the HPX-4 runtime system software. Between these two is a major interface protocol, the “PrimeMedium” that supports bi-directional complex interoperability between the operating system and the runtimesystem. The interrelationship is mutually supporting, bi-directional, and dynamically adaptive.

Figure 1 shows key elements of the runtime system including lightweight user multithreading, parcelmessage-driven computation, LCO local control objects for sophisticated synchronization, and AGAS activeglobal address space and processes. Each application has its own ephemeral instance of the HPX runtimesystem. The figure also shows the LXK operating system on the other side of the Prime Medium consistingof a large ensemble of persistent lightweight kernel supervisors each dedicated to a particular node of pro-cessor, memory, and network resources. It is the innovation and power of the OpenX software architecturethat the runtime system and the operating system employ each other for services. Both are simple in designbut achieve complexity of operation through a combination of high replication and dynamic interaction.

XPI, the low level imperative API, will serve both as a readable interface for system software develop-ment and early experimental application programming with which to conduct experiments. It will also serveas a target for source-to-source compiler translation from high level APIs and languages. XPI will representthe ParalleX semantics as closely as possible in a familiar form: the binding of calls to the C programminglanguage. Part of the compiler challenge is to bridge XPI to the HPX runtime system.

Domain specific programming will be provided through a metaprogramming framework that will permitrapid development of DSLs for diverse disciplines. An example of one such DSL derived from the meta-toolkit will be developed by the XPRESS project. Finally, targeting either XPI or native calls of the HPXruntime will be compilation strategies and systems to translate MPI and OpenMP legacy codes to a form thatcan be run by OpenX with performance at least as good as a native code implementation. (While it is desiredthat advance work in declarative programming styles, especially in the area of parallelism, be derived and

6

Run5me  Layer  

OS  Layer  

Applica5on  Layer  

OpenX architecture

Page 5: An Autonomic Performance Environment for Exascalekhuck/APEX-Scalable-Tools-Workshop-2015.pdfAPEX Event Listeners • Profiling listener – Start event: input name/address, get timestamp,

5  An  Autonomic  Performance  Environment  for  Exascale  

ParalleX, HPX, and OpenX •  ParalleX is an experimental execution model –  Theoretical foundation of task-based parallelism –  Run in the context of distributed computing context –  Supports integration of the entire system stack

•  HPX is the runtime system that implements ParalleX –  Exposes an uniform, standards-oriented API –  Enables writing a fully asynchronous code using hundreds of

millions of threads –  Provides unified syntax and semantics for local and remote

operations –  HPX-3 (LSU, C++ language/Boost) –  HPX-5 (IU, C language)

•  XPRESS will develop an OpenX exascale software stack based on ParalleX

Page 6: An Autonomic Performance Environment for Exascalekhuck/APEX-Scalable-Tools-Workshop-2015.pdfAPEX Event Listeners • Profiling listener – Start event: input name/address, get timestamp,

6  An  Autonomic  Performance  Environment  for  Exascale  

ParalleX / HPX •  Governing principles (components) –  Active global address space (AGAS) instead of PGAS

•  Enables seamless distributed load balancing and adaptive data placement and locality control

–  Message-driven via Parcels instead of message passing •  Enables transparent latency hiding and event driven computation •  Parcels are a form of active messages

–  Lightweight control objects (LCO) instead of global barriers •  Reduces global contention and improves parallel efficiency •  Synchronization semantics

–  Moving work to data instead of moving data to work •  Basis for asynchronous parallelism and dataflow style computation

–  Fine-grained parallelism of lightweight Threads (vs. CSP) •  Improves system and resource utilization

Page 7: An Autonomic Performance Environment for Exascalekhuck/APEX-Scalable-Tools-Workshop-2015.pdfAPEX Event Listeners • Profiling listener – Start event: input name/address, get timestamp,

7  An  Autonomic  Performance  Environment  for  Exascale  

Process Manager Local Memory Management

AGASTranslation

Interconnect

Parcel Port Parcel

Handler

Action Manager Thread

ManagerThread

Pool

LocalControlObjects

Performance Counters

Performance Monitor

...

HPX-3 Architecture

Page 8: An Autonomic Performance Environment for Exascalekhuck/APEX-Scalable-Tools-Workshop-2015.pdfAPEX Event Listeners • Profiling listener – Start event: input name/address, get timestamp,

8  An  Autonomic  Performance  Environment  for  Exascale  

HPX-5 Architecture

Page 9: An Autonomic Performance Environment for Exascalekhuck/APEX-Scalable-Tools-Workshop-2015.pdfAPEX Event Listeners • Profiling listener – Start event: input name/address, get timestamp,

9  An  Autonomic  Performance  Environment  for  Exascale  

OpenX Software Stack and APEX Legacy

Applications

New Model Applications

MPIMetaprogramming

FrameworkDomain SpecificActive Library

Compiler

AGASname spaceprocessor

LCOdataflow, futuressynchronization

LightweightThreads

context manager

Parcelsmessage driven

computation

...

OpenMP

XPI

Task recognition Address space control

Memory bankcontrol

OSthread In

stru

men

tatio

n

Networkdrivers

Distributed FrameworkOperating System

HardwareArchitecture

OperatingSystem

Instances

RuntimeSystem

Instances

PRIME MEDIUMInterace / Control

{{

Domain SpecificLanguage

+106 nodes � 103 cores / node + integration network

...

Figure 1: Major components of the OpenX architecture.

The two major components of the XPRESS OpenX software architecture are the LXK operating systemand the HPX-4 runtime system software. Between these two is a major interface protocol, the “PrimeMedium” that supports bi-directional complex interoperability between the operating system and the runtimesystem. The interrelationship is mutually supporting, bi-directional, and dynamically adaptive.

Figure 1 shows key elements of the runtime system including lightweight user multithreading, parcelmessage-driven computation, LCO local control objects for sophisticated synchronization, and AGAS activeglobal address space and processes. Each application has its own ephemeral instance of the HPX runtimesystem. The figure also shows the LXK operating system on the other side of the Prime Medium consistingof a large ensemble of persistent lightweight kernel supervisors each dedicated to a particular node of pro-cessor, memory, and network resources. It is the innovation and power of the OpenX software architecturethat the runtime system and the operating system employ each other for services. Both are simple in designbut achieve complexity of operation through a combination of high replication and dynamic interaction.

XPI, the low level imperative API, will serve both as a readable interface for system software develop-ment and early experimental application programming with which to conduct experiments. It will also serveas a target for source-to-source compiler translation from high level APIs and languages. XPI will representthe ParalleX semantics as closely as possible in a familiar form: the binding of calls to the C programminglanguage. Part of the compiler challenge is to bridge XPI to the HPX runtime system.

Domain specific programming will be provided through a metaprogramming framework that will permitrapid development of DSLs for diverse disciplines. An example of one such DSL derived from the meta-toolkit will be developed by the XPRESS project. Finally, targeting either XPI or native calls of the HPXruntime will be compilation strategies and systems to translate MPI and OpenMP legacy codes to a form thatcan be run by OpenX with performance at least as good as a native code implementation. (While it is desiredthat advance work in declarative programming styles, especially in the area of parallelism, be derived and

6

* From XPRESS proposal

Page 10: An Autonomic Performance Environment for Exascalekhuck/APEX-Scalable-Tools-Workshop-2015.pdfAPEX Event Listeners • Profiling listener – Start event: input name/address, get timestamp,

10  An  Autonomic  Performance  Environment  for  Exascale  

Runtime Adaptation Motivation •  Controlling concurrency – Energy efficiency – Performance

•  Parametric variability – Granularity for this machine / dataset?

•  Load Balancing – When to perform AGAS migration?

•  Parallel Algorithms (for_each…) – Separate what from how

•  Address the “SLOWER” performance model

Page 11: An Autonomic Performance Environment for Exascalekhuck/APEX-Scalable-Tools-Workshop-2015.pdfAPEX Event Listeners • Profiling listener – Start event: input name/address, get timestamp,

11  An  Autonomic  Performance  Environment  for  Exascale  

Tool Challenges •  Runtime wants to manage pre-emption – Task starts on one OS thread, at some “waiting” point

is de-scheduled by the task manager (changes the stack), resumes on another (thread/core/node)

– Where/how to instrument?

•  Overhead concerns – ParalleX goal: million-way concurrency – NOTHING* shared between threads

•  Need access to profile for decision control •  2 implementations of HPX

Page 12: An Autonomic Performance Environment for Exascalekhuck/APEX-Scalable-Tools-Workshop-2015.pdfAPEX Event Listeners • Profiling listener – Start event: input name/address, get timestamp,

12  An  Autonomic  Performance  Environment  for  Exascale  

APEX and Autonomics •  Performance awareness and performance adaptation •  Top down and bottom up performance mapping / feedback –  Make node-wide resource utilization data and analysis, energy

consumption, and health information available in real time –  Associate performance state with policy for feedback control

•  APEX introspection –  OS (LXK) track system resource assignment, utilization, job

contention, overhead –  Runtime (HPX) track threads, queues, concurrency, remote

operations, parcels, memory management –  ParalleX, DSLs and legacy codes allow language-level

performance semantics to be measured

Page 13: An Autonomic Performance Environment for Exascalekhuck/APEX-Scalable-Tools-Workshop-2015.pdfAPEX Event Listeners • Profiling listener – Start event: input name/address, get timestamp,

13  An  Autonomic  Performance  Environment  for  Exascale  

APEX Information Flow APEX Introspection

APEX Policy Engine

APEX State RCR Toolkit

Application HPX

Synchronous Asynchronous

Triggered Periodic

events

Page 14: An Autonomic Performance Environment for Exascalekhuck/APEX-Scalable-Tools-Workshop-2015.pdfAPEX Event Listeners • Profiling listener – Start event: input name/address, get timestamp,

14  An  Autonomic  Performance  Environment  for  Exascale  

Node  1                                  

Node  N                                  

APEX Global Flow (as needed)

Leverage runtime (HPX) to provide global introspection, state, and policies

Policy Engine

State

HPX Policy E

ngine

State

HPX

…  

Page 15: An Autonomic Performance Environment for Exascalekhuck/APEX-Scalable-Tools-Workshop-2015.pdfAPEX Event Listeners • Profiling listener – Start event: input name/address, get timestamp,

15  An  Autonomic  Performance  Environment  for  Exascale  

APEX Introspection •  APEX collects data through “inspectors” –  Synchronous uses an event API and event “listeners”

•  Initialize, terminate, new thread – added to HPX runtime •  Timer start, stop, yield*, resume* - added to HPX task scheduler •  Sampled value (counters from HPX-5, HPX-3) •  Custom events (meta-events)

– Asynchonous do not rely on events, but occur periodically •  APEX exploits access to performance data from

lower stack components – Reading from the RCR blackboard (i.e., power, energy) –  “Health” data through other interfaces (/proc/stat, cpuinfo,

meminfo, net/dev, self/status, lm_sensors, power*, etc.)

Page 16: An Autonomic Performance Environment for Exascalekhuck/APEX-Scalable-Tools-Workshop-2015.pdfAPEX Event Listeners • Profiling listener – Start event: input name/address, get timestamp,

16  An  Autonomic  Performance  Environment  for  Exascale  

RCR: Resource Centric Reflection •  Performance introspection across layers to enable

dynamic, adaptive operation and decision control •  Extends previous work on building decision support

instrumentation (RCRToolkit) for introspective adaptive scheduling

•  Daemon monitors shared, non-core resources •  Real-time analysis, raw/processed data published to

shared memory region, clients subscribe •  Utilized at lower levels of the OpenX stack •  APEX introspection and policy components access

and evaluate RCR values

Page 17: An Autonomic Performance Environment for Exascalekhuck/APEX-Scalable-Tools-Workshop-2015.pdfAPEX Event Listeners • Profiling listener – Start event: input name/address, get timestamp,

17  An  Autonomic  Performance  Environment  for  Exascale  

APEX Event Listeners •  Profiling listener –  Start event: input name/address, get timestamp, return profiler

handle –  Stop event: get timestamp, put profiler object in a queue for

back-end processing, return –  Sample event: put the name & value in the queue –  Asynchronous consumer thread: process profiler objects and

samples to build statistical profile (in HPX-3, processed/scheduled as a thread/task)

•  Concurrency listener (postmortem analysis) –  Start event: push timer ID on stack –  Stop event: pop timer ID off stack –  Asynchronous consumer thread: periodically log current timer

for each thread, output report at termination

Page 18: An Autonomic Performance Environment for Exascalekhuck/APEX-Scalable-Tools-Workshop-2015.pdfAPEX Event Listeners • Profiling listener – Start event: input name/address, get timestamp,

18  An  Autonomic  Performance  Environment  for  Exascale  

APEX Policy Listener •  Policies are rules that decide on outcomes based on

observed state –  Triggered policies are invoked by introspection API events –  Periodic policies are run periodically on asynchronous thread

•  Polices are registered with the Policy Engine –  Applications, runtimes, and/or OS register callback functions

•  Callback functions define the policy rules –  “If x < y then…”

•  Enables runtime adaptation using introspection data –  Engages actuators across stack layers –  Could also be used to involve online auto-tuning support*

Page 19: An Autonomic Performance Environment for Exascalekhuck/APEX-Scalable-Tools-Workshop-2015.pdfAPEX Event Listeners • Profiling listener – Start event: input name/address, get timestamp,

19  An  Autonomic  Performance  Environment  for  Exascale  

APEX Global View •  All APEX introspection is collected locally – However, it is not limited to a single-node view

•  Global view of introspection data and interactions – Take advantage of the distributed runtime support

•  HPX3, HPX5, MPI, …

•  API provided for back-end implementations –  apex_global_get_value() – each node gets data to be

reduced, optional AGAS/RDMA put (push model) –  apex_global_reduce() – optional AGAS/RDMA get (pull

model), node data is aggregated at root node, optional broadcast back out

•  Extends global view for policies

Page 20: An Autonomic Performance Environment for Exascalekhuck/APEX-Scalable-Tools-Workshop-2015.pdfAPEX Event Listeners • Profiling listener – Start event: input name/address, get timestamp,

20  An  Autonomic  Performance  Environment  for  Exascale  

APEX Examples •  HPX-3 1-D stencil code •  HPX-5 LULESH kernel •  Experiments conducted on Edison – Cray XC30 @ NERSC.gov – 5576 nodes with two 12-core Intel "Ivy Bridge"

processors at 2.4 GHz – 48 threads per node (24 physical cores w/

hyperthreading) – Cray Aries interconnect with Dragonfly topology with

23.7 TB/s global bandwidth

Page 21: An Autonomic Performance Environment for Exascalekhuck/APEX-Scalable-Tools-Workshop-2015.pdfAPEX Event Listeners • Profiling listener – Start event: input name/address, get timestamp,

21  An  Autonomic  Performance  Environment  for  Exascale  

Concurrency Throttling for Performance

•  Heat diffusion •  1D stencil code •  Data array

partitioned into chunks

•  1 node with no hyperthreading

•  Performance increases to a point with increasing worker threads, then decreases

0

20

40

60

80

100

120

140

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24Ru

ntim

e((s)

Number(of(Worker(Threads

1d_stencil HPX-3

Page 22: An Autonomic Performance Environment for Exascalekhuck/APEX-Scalable-Tools-Workshop-2015.pdfAPEX Event Listeners • Profiling listener – Start event: input name/address, get timestamp,

22  An  Autonomic  Performance  Environment  for  Exascale  

Concurrency Throttling for Performance

•  Region of maximum performance correlates with thread queue length runtime performance counter – Represents # tasks currently waiting to execute

•  Could do introspection on this to control concurrency throttling policy (*work in progress)

53

63

73

83

93

103

113

123

0

20

40

60

80

100

120

140

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Thread

'Que

ue'Le

ngth

Runtim

e'(s)

Number'of'Worker'Threads

1d_stencil

Run

time

Thre

ad q

ueue

leng

th

Page 23: An Autonomic Performance Environment for Exascalekhuck/APEX-Scalable-Tools-Workshop-2015.pdfAPEX Event Listeners • Profiling listener – Start event: input name/address, get timestamp,

23  An  Autonomic  Performance  Environment  for  Exascale  

1d_stencil Baseline

•  48 worker threads (with hyperthreading)

•  Actual concurrency much lower –  Implementation is

memory bound •  Large variation in

concurrency over time –  Tasks waiting on

prior tasks to complete

0

5

10

15

20

25

30

35

40

45

50

0

300

Con

curre

ncy

Pow

er

Time

partition_datado_work_action

hpx::lcos::local::dataflow::executeprimary_namespace_bulk_service_action

primary_namespace_service_actionother

thread cappower

100,000,000  elements,  1000  par55ons  

138 secs

Event-generated metrics

Where calculation takes place

Page 24: An Autonomic Performance Environment for Exascalekhuck/APEX-Scalable-Tools-Workshop-2015.pdfAPEX Event Listeners • Profiling listener – Start event: input name/address, get timestamp,

24  An  Autonomic  Performance  Environment  for  Exascale  

1d_stencil w/optimal # of Threads

•  12 worker threads •  Greater proportion

of threads kept busy –  Less interference

between active threads and threads waiting for memory

•  Much faster –  61 sec. vs 138

sec.

0

2

4

6

8

10

12

0

300

Con

curre

ncy

Pow

er

Time

partition_datado_work_action

hpx::lcos::local::dataflow::executeprimary_namespace_bulk_service_action

primary_namespace_service_actionother

thread cappower

100,000,000  elements,  1000  par55ons  

61 secs

Page 25: An Autonomic Performance Environment for Exascalekhuck/APEX-Scalable-Tools-Workshop-2015.pdfAPEX Event Listeners • Profiling listener – Start event: input name/address, get timestamp,

25  An  Autonomic  Performance  Environment  for  Exascale  

1d_stencil Adaptation with APEX

•  Initially 48 worker threads

•  Discrete hill climbing search to minimize average #of pending tasks

•  Converges on 13 (vs. optimal of 12)

•  Nearly as fast as optimal –  64 seconds vs.

61 seconds

0

5

10

15

20

25

30

35

40

45

50

0

300

Con

curre

ncy

Pow

er

Time

partition_datado_work_action

hpx::lcos::local::dataflow::executeprimary_namespace_bulk_service_action

primary_namespace_service_actionother

thread cappower

100,000,000  elements,  1000  par55ons  

64 secs

Page 26: An Autonomic Performance Environment for Exascalekhuck/APEX-Scalable-Tools-Workshop-2015.pdfAPEX Event Listeners • Profiling listener – Start event: input name/address, get timestamp,

26  An  Autonomic  Performance  Environment  for  Exascale  

1D stencil – adapting grain size

0

4

8

12

0

10

20

30

40

50

Conc

urre

ncy

Grai

n Si

ze P

aram

eter

Val

ue

Time

hpx::lcos::local::dataflow::executehpx_main

othergrain_size_parameter

Page 27: An Autonomic Performance Environment for Exascalekhuck/APEX-Scalable-Tools-Workshop-2015.pdfAPEX Event Listeners • Profiling listener – Start event: input name/address, get timestamp,

27  An  Autonomic  Performance  Environment  for  Exascale  

OpenMP version, adapt concurrency and block size with ActiveHarmony policy

0

5

10

15

20

25

0

50000

100000

150000

200000

250000

300000C

oncu

rrenc

y

Bloc

k Si

ze

Time

solve iteration [clone . omp fn.2]solve iteration [clone . omp fn.2]solve iteration [clone . omp fn.3]

idlethread cap

power

Page 28: An Autonomic Performance Environment for Exascalekhuck/APEX-Scalable-Tools-Workshop-2015.pdfAPEX Event Listeners • Profiling listener – Start event: input name/address, get timestamp,

28  An  Autonomic  Performance  Environment  for  Exascale  

Throttling for Power •  Livermore Unstructured Lagrangian Explicit Shock

Hydrodynamics (LULESH) –  Proxy application in DOE co-design efforts for exascale –  CPU bounded in most implementations (use HPX-5)

•  Develop an APEX policy for power –  Threads are idled in HPX to keep node under power cap –  Use hysteresis of last 3 observations –  If Power < low cap increase thread cap –  If Power > high cap decrease thread cap

•  HPX thread scheduler modified to idle/activate threads per cap •  Test example:

–  343 domains, nx = 48, 100 iterations –  16 nodes of Edison, Cray XC30 –  Baseline vs. Throttled (200W per node high power cap)

LULESH  image,  source:  Hydrodynamics  Challenge  Problem,  Lawrence  Livermore  Na>onal  Laboratory.  Technical  Report,  LLNL-­‐TR-­‐490254  

Page 29: An Autonomic Performance Environment for Exascalekhuck/APEX-Scalable-Tools-Workshop-2015.pdfAPEX Event Listeners • Profiling listener – Start event: input name/address, get timestamp,

29  An  Autonomic  Performance  Environment  for  Exascale  

0

100

200

300

400

500

600

700

800

0

1000

2000

3000

4000

5000

Con

curre

ncy

Pow

er

Time

hpx143fixDEFAULTlcogetPINNED

advanceDomainactionSBN1resultactionSBN3resultaction

PosVelresultactionPosVelsendsactionMonoQresultaction

MonoQsendsactionother

probeDEFAULTSBN3sendsaction

hpxlcosetactionPINNEDthread cap

power

LULESH Baseline

No  power  cap  No  thread  cap  768  threads  Avg.  247  WaJs/node  ~360  kiloJoules  total  ~91  seconds  total  ~74  seconds  HPX  tasks  

91secs

aprun

Page 30: An Autonomic Performance Environment for Exascalekhuck/APEX-Scalable-Tools-Workshop-2015.pdfAPEX Event Listeners • Profiling listener – Start event: input name/address, get timestamp,

30  An  Autonomic  Performance  Environment  for  Exascale  

0

100

200

300

400

500

600

700

800

0

1000

2000

3000

4000

5000

Con

curre

ncy

Pow

er

Time

hpx143fixDEFAULTprobeDEFAULT

lcogetPINNEDadvanceDomainaction

mainactionSBN1resultactionSBN3resultaction

PosVelresultaction

MonoQresultactionMonoQsendsaction

otherSBN3sendsaction

hpxlcosetactionPINNEDPosVelsendsaction

thread cappower

LULESH Throttled by Power Cap

200W  power  cap  768  threads  throJled  to  220  Avg.  186  WaJs/node  ~280  kiloJoules  total  ~94  seconds  total  ~77  seconds  HPX  tasks  

94 secs

aprun

Page 31: An Autonomic Performance Environment for Exascalekhuck/APEX-Scalable-Tools-Workshop-2015.pdfAPEX Event Listeners • Profiling listener – Start event: input name/address, get timestamp,

31  An  Autonomic  Performance  Environment  for  Exascale  

LULESH Performance Explanation •  768 vs. 220 threads (after throttling) •  360 kJ vs. 280 kJ total energy consumed (~22% decrease) •  75.24 vs. 77.96 seconds in HPX (~3.6% increase) •  Big reduction in yielded action stalls (in thread scheduler)

– Less contention for network access •  Hypothesis: LULESH implementation is showing signs of

being network bound - spatial locality of subdomains is not maintained during decomposition Metric   Baseline   Thro?led   %  Difference  Cycles   1.11341E+13   3.88187E+12   34.865%  InstrucHons   7.33378E+12   5.37177E+12   73.247%  L2  Cache  Misses   8422448397   3894908172   46.244%  IPC   0.658677   1.38381   210.089%  INS/L2CM   870.742   1379.18   158.391%  

Page 32: An Autonomic Performance Environment for Exascalekhuck/APEX-Scalable-Tools-Workshop-2015.pdfAPEX Event Listeners • Profiling listener – Start event: input name/address, get timestamp,

32  An  Autonomic  Performance  Environment  for  Exascale  

LULESH latest results

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

Con

curre

ncy

Pow

er

Time

probehandlerpwclcogetrequesthandler

advanceDomainactionPosVelresultactionMonoQresultaction

otherSBN3sendsaction

MonoQsendsactionSBN3resultaction

PosVelsendsactionSBN1resultaction

thread cappower

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

Con

curre

ncy

Pow

er

Time

probehandleradvanceDomainaction

PosVelresultactionPosVelsendsactionMonoQresultaction

otherSBN3resultaction

MonoQsendsactionSBN3sendsaction

initDomainactionthread cap

power

Figures:  unmodified  and  power-­‐capped  (220W)  execu5ons  of  LULESH  (HPX-­‐5),  with  equal  execu5on  5mes.  (8000  subdomains,  643  elements  per  subdomain  on  8016  cores  of  Edison,  343  nodes,  24  cores  per  node).  12.3%  energy  savings  with  no  performance  change.  

Page 33: An Autonomic Performance Environment for Exascalekhuck/APEX-Scalable-Tools-Workshop-2015.pdfAPEX Event Listeners • Profiling listener – Start event: input name/address, get timestamp,

33  An  Autonomic  Performance  Environment  for  Exascale  

Future Work, Discussion •  Conduct more robust experiments and at larger scales on

different platforms •  More, better policy rules

–  Runtime and operating system –  Application and device-specific (*in progress) –  Global policies (*in progress)

•  Multi-objective optimization (ActiveHarmony) •  Integration into HPX-5, HPX-3 code releases •  API refinements for general purpose usage •  Global data exchange – who does it, and when? •  “MPMD” processing – how to (whether to?) sandbox APEX •  Other runtimes: OpenMP/OMPT, OmpSS, Legion, OCR… •  Source code: https://github.com/khuck/xpress-apex

Page 34: An Autonomic Performance Environment for Exascalekhuck/APEX-Scalable-Tools-Workshop-2015.pdfAPEX Event Listeners • Profiling listener – Start event: input name/address, get timestamp,

34  An  Autonomic  Performance  Environment  for  Exascale  

Acknowledgements •  Support for this work was provided through Scientific Discovery

through Advanced Computing (SciDAC) program funded by U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research (and Basic Energy Sciences/Biological and Environmental Research/High Energy Physics/Fusion Energy Sciences/Nuclear Physics) under award numbers DE-SC0008638, DE-SC0008704, DE- FG02-11ER26050 and DE-SC0006925.

•  Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. DOE’s National Nuclear Security Administration under contract DE- AC04-94AL85000.

•  This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.