through the looking-glass: from performance observation to

Integra(on and Synthesis for Automated Performance Tuning

Through the Looking-Glass: From Performance Observation

to Dynamic Adaptation

Allen D. Malony

ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC’15)


In dedication to: the Parallel Performance Tools community


Talk Outline ❑  Sequel ❑  Performance Retrospective

❍  Observability Era ❍  Diagnosis Era ❍  Complexity Era

❑  Performance Evolution or Revolution ❑  Exascale Era and Productivity

❍  Invalid Uniformity Assumptions ❍  A New Performance Observability ❍  Dynamic Adaptability ❍  Projects

❑  Future


“Sometimes I've believed as many as six impossible things before breakfast!”

❑  Willing suspension of disbelief ❑  Suppose that full knowledge of the state of a high-

performance parallel / distribution system were available ❍  There is no cost to produce it, access it, or analyze it ❍  There is no delay to provide it to wherever it might be used ❍  It can be observed by anything at any time

❑  What is the state? What state is valuable? ❑  Would it fundamentally change how future HPC systems are

programmed? How? ❑  Could the combination of wholistic observation (self

awareness), dynamic adaptivity, collective control be the basis for a different execution paradigm?


“I'm not strange, weird, off, nor crazy, my reality is just different from yours.”

❑  Our reality is shaped by our experience ❑  Started work in the parallel computing in 1986 (29 years)

❍  Research engineer at HP Labs (1981 – 1986) ❍  Ph.D. graduate student at UIUC (1986 – 1991) ◆  Dan Reed, Ph.D. advisor

❍  Senior software engineer at Center for Supercomputing Research and Development (CSRD) (1987 – 1991) ◆  built parallel performance tools for Cedar supercomputer ◆  worked with gods: Kuck, Lawrie, E. Davidson, A. Sameh, …

❍  Professor, University of Oregon (1992 – present) ❍  CEO and Director, ParaTools, Inc. (2004 – present)


Parallel Performance, Engineering, and Tools ❑  Generally, I am interested in the nature of parallel performance ❑  Performance has been the fundamental driving concern for parallel

computing and high-performance computing ❑  Achieving performance in real systems is an engineering process ❍  Observation: measure and characterize performance behavior ❍  Diagnosis: identify and understand performance problems ❍  Tuning: modify to run optimally on high-end machines

❑  How to make the process more effective and productive? ❍  What is the nature of the performance problem solving? ❍  What is the performance technology to be applied?

❑  Compelling reasons to build and integrate performance tools ❍  To obtain, learn, and carry forward performance knowledge ❍  To address increasing scalability and complexity issues ❍  To bridge between computation model and execution model operation

❑  Parallel systems evolution will change process, technology, use


Performance Technology Eras ❑  Performance technology has evolved to serve the

dominant architectures and programming models ❍  Observability era (1991 – 1998) ◆  instrumentation, measurement, analysis

❍  Diagnosis era (1998 – 2007) ◆  identifying performance inefficiencies

❍  Complexity era (2008 – 2012) ◆  scale, memory hierarchy, network, multicore, GPU

❑  20+ years of stable parallel execution models ❍  “1st person” application focus for tools support ❍  Productivity focused on optimizing performance

❑  Exascale era (2013 – future)


❑  Performance evaluation problems define the requirements for performance measurement and analysis methods

❑  Performance observability is the ability to “accurately” capture, analyze, and present understand (collectively observe) information about parallel software and system

❑  Tools for performance observability must balance the need for performance data against the cost of obtaining it (environment complexity, performance intrusion) ❍  Too little performance data makes analysis difficult ❍  Too much data perturbs the measured system

❑  Important to understand performance observability complexity and develop technology to address it

Performance Observability (1991-1998)

A. Malony, “Performance Observability,” Ph.D. Thesis, University of Illinois, Urbana-Champaign, 1991.


❑  All performance measurements generate overhead ❑  Overhead is the cost of performance measurement

❍  Overhead causes (generates) performance intrusion ❑  Intrusion is the dynamic performance effect of overhead

❍  Intrusion causes (potentially) performance perturbation ❑  Perturbation is the change in performance (program) behavior

❍  Measurement does not change “possible” executions !!! ❍  Perturbation can lead to erroneous performance data (Why?) ❍  Alteration of “probable” performance behavior (How?)

❑  Instrumentation uncertainty principle ❍  Any instrumentation perturbs the system state ❍  Execution phenomena and instrumentation are coupled ❍  Volume and accuracy are antithetical

❑  Perturbation analysis is possible with deterministic execution ❑  Otherwise, measurement must be considered part of the system

Instrumentation Uncertainty Principle


❑  Use time-based perturbation analysis to remove overhead ❑  There is more to the problem (Why?)

Perturbation Analysis – Livermore Loops

Full: full instrumentation Raw: uninstrumented Model: time-based perturbation analysis

Serial execution

Parallel execution Inner product Banded linear equations

Implicit, conditional computation


Event-based Perturbation Analysis ❑  Time-based perturbation

analysis does not take into account concurrent behavior (dependencies) ❍  Loops 3, 4, 17 are

DOACROSS loops ❍  Measurement affects

probability of contention!

❑  Must use event-based perturbation analysis to “model” contention ❍  Requires more events!

Livermore Loop Measured/Raw Approximated/Raw 3 2.48 0.96 4 2.64 1.06 17 9.97 0.97

instrumentation in the critical section results in a higher probability of contention and thus MORE waiting

relatively greater instrumentation outside the critical section results in lower probability of contention and thus LESS waiting


❑  LLL 3, 4, and 17 are DOACROSS loops that were statically scheduled

❑  Consider a dynamically scheduled DOALL loop

❑  What is the problem? ❍  Non-deterministic execution is

affected by observation ❑  Event-based perturbation analysis

must include knowledge of the scheduling system

❑  In the worst case, you might have to model (simulate) the whole runtime software and possibly the machine!

“Curiouser and Curiouser”

iteration begin instrumentation iteration end instrumentation waiting

This is not possible!


❑  What is the “true” parallel computation performance? ❑  Some technique must be used to observe performance ❑  Any performance measurement will be intrusive ❑  Any performance intrusion might result in perturbation ❑  Performance analysis is based on performance measures

❍  How is “accuracy” of performance analysis evaluated? ❍  How is this done when “true” performance is unknown?

❑  Uncertainty applies to all experimental methods ❍  “Truth” lies just beyond the reach of observation

❑  Develop performance observation systems that can deliver robust performance data efficiently with low overhead

❑  Rationalize about performance measurement effects

Performance Uncertainty


❑  Performance problem solving framework for HPC ❍  Integrated, scalable, flexible, portable ❍  Target all parallel programming / execution paradigms

❑  Integrated performance toolkit (open source) ❍  Multi-level performance instrumentation ❍  Flexible and configurable performance measurement ❍  Widely-ported performance profiling / tracing system ❍  Performance data management and data mining

TAU Performance System®

memory memory

Node Node Node

VM space

Context

SMP

Threads

node memory

…

…

Interconnection Network Inter-node message communication

* *

physical view

model view


❑  Follow an empirically-based approach observation experimentation diagnosis tuning

❑  Performance technology developed for each level

Parallel Performance Engineering Process

characterization

Performance��Tuning

Performance Diagnosis

Performance Experimentation

Performance Observation

hypotheses

properties

• Instrumentation • Measurement • Analysis • Visualization

Performance Technology

• Experiment management

• Performance storage


• Data mining • Models • Expert systems



1992-1995: DARPA pC++ (Gannon, Malony, Mohr). TAU (Tools Are Us) is born. [parallel profiling, tracing, performance extrapolation] 1995-1998: Shende Ph.D. (performance mapping, instrumentation). TAU v1.0. [multiple languages, source analysis, automatic instrumentation]

1998-2001: Significant effort in Fortran analysis and instrumentation, work with Mohr on OpenMP, Kojak tracing integration, focus on automated performance analysis. [performance diagnosis, source analysis, instrumentation] 2002-2005: Focus on profiling analysis, measurement scalability, and perturbation compensation. [analysis, scalability, perturbation analysis, applications] 2005-2007: More emphasis on tool integration, usability, and data presentation. TAU v2.0 released. [performance visualization, binary instrumentation, integration, performance diagnosis and modeling]

2008-2011: Add performance database support, data mining, and rule-based analysis. Develop measurement/analysis for heterogeneous systems. Core measurement infrastructure integration (Score-P). [database, data mining, expert system, heterogeneous measurement, infrastructure integration]

2012-present: Focus on exascale systems. Improve scalability, heterogeneous support, runtime system integration, dynamic adaptation. Apply to petascale / exascale applications. [scale, autotuning, user-level]

TAU History

Observability

Diagnosis

Complexity

Exascale


❑  Semantic entities, attributes, associations (SEAA) ❍  Entities: represent semantics

at any level ❍  Attribute: encode entity

semantics ❍  Association: link entities

across levels to map performance data

❑  SEAA ability to map low-level data to high levels of abstraction reduces the semantic gap for user

Semantic Gap in Performance Mapping

S. Shende, “The Role of Instrumentation and Mapping in Performance Measurement,” Ph.D. Thesis, University of Oregon, 2001.


❑  POOMA (Parallel OO Methods and Applications) ❍  Parallel Expression Templates (PETE) ❍  Shared memory asynchronous runtime system (SMARTS)

POOMA, PETE, SMARTS, and TAU

SEAA

SEAA


Performance Diagnosis (1998-2007) Performance diagnosis is a process to detect and explain performance problems


❑  APART – Automatic Performance Analysis - Real Tools ❍  Problem specification and identification

❑  Poirot – theory of performance diagnosis processes ❍  Compare and analyze performance diagnosis systems ❍  Use theory to create system that is

automated / adaptable ◆  Heuristic classification:

match to characteristics ◆  Heuristic search:

look up solutions with problem knowledge ❍  Problems: ◆  low-level feedback ◆  lack of explanation power

❑  Hercule – knowledge-based (model-based) diagnosis ❍  Capture knowledge about performance problems ❍  Capture knowledge about how to detect and explain them ❍  Knowledge comes from parallel computational models ◆  associate computational models with performance models

Performance Diagnosis Projects


❑  Goals of automation, adaptability, validation ❑  Use of models benefits performance diagnosis

❍  Helps to identify execution events of interest ❍  Provide meaningful diagnostic explanations ❍  Enable reuse of performance analysis expertise

❑  Applications model studied ❍  Master-worker, wavefront, compositional

Hercule Project

EBBA

CLIPS


Advances in TAU Technologies

TAU + Scalasca Score-P

TAUdb

PerfExplorer

ParaProf


Integrated Profiling Madness FLASH

Intrepid Ranger NAMD

❑  Phase-based profiling ❑  Integrated hybrid profiling ❍  Merge sampling-based and

probe-based measurement ❍  Unified context

❑  Charm++ integration ❍  Projections (tracing) ❍  TAU (profiling)


❑  Should not just redescribe the performance results ❑  Should explain performance phenomena ❍  What are the causes for performance observed? ❍  What are the factors and how do they interrelate? ❍  Performance analytics, forensics, and decision support

❑  Need to add knowledge to do more intelligent things ❍  Automated analysis needs good informed feedback ◆  iterative tuning, performance regression testing

❍  Performance model generation requires interpretation ❑  We need better methods and tools for ❍  Integrating meta-information ❍  Knowledge-based performance problem solving

How to Explain and Understand Performance

K. Huck, “Knowledge Support for Parallel Performance Data Mining,” Ph.D. Thesis, University of Oregon, 2008.


❑  PerfExplorer ❍  Multi-experiment data mining ❍  Programmable, extensible framework

to support workflow automation ❍  Rule-based inferences for expert system analysis

❑  S3D scalability analysis (Jaguar, Intrepid, 12K cores)

Automation and Knowledge Engineering

r = 1 implies direct correlation


❑  OpenUH and PerfExplorer ❑  Some OpenUH optimizations

are guided by a cost model ❍  Loop nest optimizer ❍  Use profile analysis feedback to improve model accuracy

❑  PerfExplorer data analysis and meta-analysis ❍  Provides inference-based optimization suggestions

❑  Multiple String Alignment (MS) and OpenMP load imbalance

Rule-based Inference for Expert Analysis

Analysis workflow and inference rules Automated rule processing


❑  Trace-based analysis of message passing communication

❑  Measurement intrusion can effect even timing and ordering

❑  Effects can propagate across process boundaries

❑  Analyze communication perturbation ❍  Remove measurement overhead ❍  Model communication effects ❍  Execution reply

❑  Same methods can be used for “what if” prediction of application performance ❍  Different communication speeds ❍  Faster processing

❑  More measurement gives more fidelity in modeling the perturbation effects!

❑  Have to model the system!

Communication Perturbation Analysis

Mea

sure

d M

odel

ed

w/ o

rder

M

odel

ed

w/ o

rder


❑  Support low-overhead OS performance measurement ❍  Multiple levels of function and

detail through instrumentation ❍  Support both profiling and tracing

❑  Provide kernel-wide and process-centric OS perspectives

❑  Merge user-level and kernel-level performance data ❍  Across all program-OS interactions ❍  Provide fast KTAU data access

❑  Implemented on BG with Zeptos and Linux cluster

❑  Applied to systems noise analysis

KTAU: Kernel Performance Measurement


❑  Program OS Interactions ❍  Direct ◆  applications invoke the OS for

certain services ◆  syscalls and internal OS routines

called ❍  Indirect ◆  OS operations without explicit

invocation by application ◆  preemptive scheduling (other

processes) ◆  (HW) interrupt handling ◆  OS-background activity

–  keeping track of time and timers, bottom-half handling, …

◆  Can occur at any OS entry

Effects of Noise on Parallel Performance

OS noise effects can accumulate


❑  General noise-effect estimation by direct measurement of application and OS

❑  Isolate OS noise in application performance data

❑  Trace information ❍  Application event trace with OS

noise counters ❍  Presence of OS noise has affected

the timing of events ❑  Timeline Approximation

❍  Reason about timing of events that may have occurred in the absence of noise

❍  Remove time-duration of noise and adjust event timestamps

❑  Must be careful to maintain constraints in event ordering

❑  Quantify global effects of noise on application ❍  Noise propagation / absorption

❑  Attribute the effects to specific noise sources

OS Noise Analysis

Sweep3D


❑  Parallel profiling is lightweight, but lacks temporal measurement

❑  Tracing can generate large trace files ❑  Capture performance dynamics with

profile snapshots

Observing Performance Dynamics

Information

Ove

rhea

d Traces

Profile Snapshots Profile

s

Initialization

Checkpointing

Finalization

Flash 3.0

INTRFC


❑  Parallel performance state is globally distributed ❍  Logically part of application’s global data space ❍  Offline: outputs data at execution end for post-mortem analysis ❍  Online: access to performance state for analysis

❑  Definition: Monitoring ❍  Online access to parallel performance (data) state ❍  May or may not involve runtime analysis

❑  Couple with monitoring infrastructures ❑  TAUoverSupermon

❍  Supermon monitor from Los Alamos National Laboratory

❑  TAUoverMRNET ❍  Multicast Reduction Network (MRNet)

infrastructure from Wisconsin ❑  TAUg

❍  MPI-based infrastructure to provide global view of TAU profile data

❑  TAUmon ❍  Transport-neutral (SuperMon, MRNet, MPI)

❑  Develop online analysis methods ❍  Aggregation, statistics, …

Parallel Performance Monitoring


TAUMon Online Analysis

Uintah

Flash


❑  Performance tools have evolved incrementally to serve the dominant architectures and programming models ❍  Reasonably stable, static parallel execution models ❍  Allowed application-level observation focus

❑  Observation requirements for 1st-person measurement: ❍  Performance measurement can be made locally (per thread) ❍  Performance data collected at the end of the execution ❍  Post-mortem analysis and presentation of performance results ❍  Offline performance engineering

❑  Architecture factors increase performance complexity ❍  Greater core counts and hierarchical memory system ❍  Heterogeneous computing ❍  Significantly larger scale

❑  Focus on performance technology integration

Performance Complexity (2008-2012)


r  Heterogeneous parallel systems are highly relevant today ¦  Multi-CPU, multicore

shared memory nodes (latency optimized)

¦  Manycore accelerators with high-BW I/O (throughput optimized)

¦  Cluster interconnection network r  Performance is the main driving concern

¦  Heterogeneity is an important path to extreme scale r  Heterogeneous software technology to get performance

¦  More sophisticated parallel programming environments ¦  Integrated parallel performance tools

Ø  support heterogeneous performance model and perspectives

Heterogeneous Systems and Performance


❑  Current status quo is somewhat comfortable ❍  Mostly homogeneous parallel systems and software ❍  Shared-memory multithreading – OpenMP ❍  Distributed-memory message passing – MPI

❑  Parallel computational models are relatively stable ❍  Corresponding performance models are tractable ❍  Parallel performance tools can keep up and evolve

❑  Heterogeneity creates richer computational potential ❍  Greater performance diversity and complexity

❑  Heterogeneous system environments ❑  Performance tools have to support richer computation

models and more versatile performance perspectives ❑  Will utilize more sophisticated programming and runtime

Implications for Performance Tools


❑  Integrated GPU support ❍  Enable host-GPU measurement ◆  CUDA, OpenCL, OpenACC ◆  accelerator compiler integration ◆  utilize PAPI CUDA and CUPTI

❍  Provide both heterogeneous profiling and tracing support ◆  contextualization of asynchronous

kernel invocation ❍  Static analysis of GPU kernel ❍  GPU kernel sampling

❑  Full support for Intel Xeon Phi

Integrated Heterogeneous Support in TAU

GTC


❑  How do you support a common performance analysis methodology for OpenMP across systems and compilers?

❑  Evaluate four methods for OpenMP measurement: ❍  POMP ❍  ORA w/OpenUH ❍  ORA w/GOMP ❍  OMPT w/Intel … all integrated into TAU

❑  Compare performance visibility, semantic context, measurement overhead, and integration in analysis

Cross-Platform OpenMP Performance Analysis

ORA: OpenMP Run2me API GOMP: GCC OpenMP implementa2on OMPT: OpenMP Tools interface


Tools/Technology Integration and Synthesis

39

Performance

Modeling

Reliability

Autotuning

Performance

Optimization

Resilience

Energy

TAU

mpiP

PBound

Ac2ve Harmony

CHiLL

Roofline

PAPI

PEBIL PSiNtracer

GPTL

Tools / Technologies

RCRToolkit

Code analysis

Center of mass for performance engineerng

ROSE

Orio

End-to-end Integration

SciDAC applications

Prior research funding

SUPER is a 5-‐year SciDAC Ins2tute (2011-‐2016)


❑  Autotuning is a performance engineering process ❍  Automated experimentation and performance testing ❍  Guided optimization by (intelligent) search space exploration ❍  Model-based (domain-specific) computational semantics

❑  Autotuning components ❍  Active Harmony autotuning system (Hollingsworth, UMD) ◆  software architecture for optimization and adaptation

❍  CHiLL compiler framework (Hall, Utah) ◆  CPU and GPU code transformations for optimization

❍  Orio annotation-based autotuning (Norris, ANL) ◆  code transformation (C, Fortran, CUDA, OpenCL) with optimization

❑  Goal is to integrate TAU with existing autotuning frameworks ❍  Use TAU to gather performance data for autotuning/specialization ❍  Store performance data with metadata for each experiment variant and store

in performance database (TAUdb) ❍  Use machine learning and data mining to increase the level of automation

of autotuning and specialization

TAU Integration with Empirical Autotuning


❑  Extended Orio with OpenCL ❍  OpenCL code generation with

annotations based on tunable OpenCL parameters

❍  TAU wraps the OpenCL runtime ❑  Tuning experiments to be conducted

on a broader set of accelerator platforms ❍  Radeon 6970, Xeon Phi,

GTX 480, Tesla C2075, Tesla K20c

❑  Autotuning experiments ❍  BLAS kernels ❍  Radiation transport code

Orio Multi-target Autotuning with TAU

Collected Metrics

TAU OpenCLLibrary Wrapper

OpenCL Profiling Support

Metrics

Orio Code Generator

Experiment TAU OpenCLRuntime Wrapper

TAUdb

TAU Metadata EntriesTransformationsExecution Time

TAU ProfilesWrites

Links at Runtime

Writes

Uploaded


TAU and Autotuning in SUPER

TAUdb

Outlined Function

Selective Instrumentation File (specifying parameters

to capture)

Instrumented Variant

tau_instrumentor

Parameterized Performance Profile

execute

PerfDMFTauDB

parameters from TauDB

CHiLL Recipes

Search Driver(brute force or Active

Harmony)

code variant

TAUdb

CHiLL

profile dataand metadata

WEKA

decision treeinductionalgorithm

ROSE-based Code

Generation Tool

Code VariantsCode Variants

Code VariantsCode Variants

Wrapper Function

PerfDMFTauDBTAUdb

Orio Code Generator

Experiment

TAU Metadata Entries

Transformations

Execution TimeWrites

CUPTI callbackmeasurement library

TAU Profiles

TAUdb

Writes

Uploaded

Links at Runtime

CUDA

OpenCL CHiLL + AH

Orio

ROSE

Geant4 MPAS-O

CESM PerfExplorer XGC1


❑  Stencil codes introduce load imbalances due to halo/ghost cells

❑  Performance measurement in isolation is insufficient to explain source of imbalance and attribute to performance factors

❑  Need to combine: ❍  Performance measurements ❍  Application metadata ❍  Correlations between

measurements and metadata ❍  Visualization of performance data in

context of application domain

MPAS Ocean Load Imbalance Tuning


❑  Capture stencil properties in metadata ❍  nCellSolve = cells in partition ❍  nCell = nCellSolve + nCellsHalo

❑  Correlate computation balance with metadata fields

❑  Original partitioning based on “balancing” nCellsSolve on MPI processes

❑  Really need to balance with respect to nCell ❍  This depends on the partitioner, but do not know in advance!

❑  Apply “knowledge” to Hindsight partioner ❍  Create initial Metis partition ❍  Assign nCell weights on graph and iterate to optimize

Stencil Properties and Metadata Collection

800 900

1000 1100 1200 1300 1400 1500

1.50E+07 2.00E+07 2.50E+07 3.00E+07 3.50E+07 4.00E+07

Metadata Correlated with Adv Timer

nCellsSolve nCells


Partioning Refinement

Total cells (explicit + halo) per block * tighter range

Time spent computing * lower total

Time spent in MPI_Wait * reduced * also in halo routines Number of neighbors per block * greater

min: 473 ì 535

max: 846 î 771

min: 1 ì 2 max: 7 ì 9

min: 8.3e7 ì 9.8e7

max: 2.5e8 î 2.4e8

min: 2.7e7 î 9.0e6

max: 1.9e8 î 1.5e8

Original Refined 240 processors on NERSC Edison


❑  OpenMP scheduling algorithm can affect load imbalance ❍  Threads arrive at boundary cell exchange at different times ❍  Master completes boundary cell exchange at different rates

❑  Vectorization can change communication waiting

MPAS-O Load Imbalance in MPI + OpenMP

faster execution, less wait variation


❑  Community Earth System Model (CESM) ❑  Observed performance variability on ORNL Jaguar

❍  Significant increase in execution time led to failed runs ❑  End-to-end analysis methodology

❍  Collect information from all production jobs ◆  modify jobs scripts ◆  problem/code provenance,

system topology, system workload, process node/core mapping, job progress, time spent in queue, pre/post, total

❍  Load in TAUdb, qualtify nature of variability and impact

Performance Variability in CESM


❑  4096 processor cores ❍  6 hour request ❍  Target: 150 simulation days

❑  35 jobs ❍  May 15 – Jun 29, 2012 ❍  Two of which failed

Example Experiment Case Analysis

0

200

400

600

800

1000

1200

1400

0 20 40 60 80 100 120 140

Runt

ime

for P

revi

ous

5 Si

mul

atio

n Da

ys (s

econ

ds)

Simulation Day

CESM Performance Time Series (Run Loop)

Cray XK6 (1 sixteen-core processor per node)T341f02.F1850r, Components Stacked

4096 processor cores (512 processes, 8 OMP threads) Failed Run (140 Simulation Days) Failed Run (145 Simulation Days)

Slowest Successul Run (150 Simulation Days) Second Slowest Successul Run (150 Simulation Days)

Maximum execution time Minimum execution time

High variability intra-run


❑  Node placement can cause poor resource matching ❍  Fastest ◆  0 unmatched

❍  Slowest ◆  272 unmatched

❍  Can cause more risk of contention with other jobs executing

❑  MPI rank placement also matters ❍  Greater distances

between processes can result in larger overheads for communication

Node Placement and MPI Placement Matter

4:14:08 (hh:mm:ss) 5:35:50 (hh:mm:ss)


Evolution ❑  Increased performance complexity and scale forces the

engineering process to be more intelligent and automated ❍  Automate performance data analysis / mining / learning ❍  Automated performance problem identification

❑  Even with intelligent and application autotuning, the decisions of what to analyze are difficult ❍  Performance engineering tools and practice must incorporate

a performance knowledge discovery process ❑  Extreme scale performance is an optimized orchestration

❍  Application, processor, memory, network, I/O ❑  Application-level only performance view is myopic ❑  Reductionist approaches will be unsuccessful ❑  Need for whole performance evaluation


Exascale Era (2012 – ???) ❑  Challenges of performance growth and power will cause

exascale systems to depart from conventional MPP designs ❍  Greater core counts and hardware thread concurrency ❍  Heterogeneous hardware and deeper memory hierarchy ❍  Hardware-assisted global addressing space support ❍  Power limits and reliability concerns built in

❑  Emerging exascale programming models emphasize message-driven computation and finer-grained parallelism semantics ❍  More asynchronous and lower-level thread management ❍  More exposure of concurrency through task-level parallelism ❍  Global address space models versus conventional message passing ❍  Heterogeneity in parallel execution and locality optimization

❑  Productivity and performance are coupled at exascale ❍  Applications are more complex and must be mapped to systems ❍  Growing crisis for performant and maintainable scientific software


Exascale End-to-End Productivity

Courtesy of Thomas Ndousse-‐FeWer, DOE

•  Dynamic performance adapta0on

Application Software Productivity Execution-time Productivity

Increased complexity in exascale applications


Uniformity Assumptions are No Longer Valid ❑  Exascale design directions raise issues of uniformity

❍  Components and behaviors are not the same or regular ❍  Heterogeneous compute engines behave differently ❍  Fine-grained power management affects homogeneity ❍  Process technology results in non-uniform execution behavior ❍  Fault resilience introduces inhomogeneity

❑  Bulk synchronous model is increasingly impractical ❍  Removing sources of performance variation (jitter) is unrealistic ❍  Huge costs in power/complexity/performance to extend the life

❑  Embrace performance heterogeneity!!! ❍  Variation, variation, variation ❍  Can not assume a stable “state” of the system a priori ❍  Post-mortem performance analysis fails for lack of repeatability


A New Performance “Observability” ❑  Key exascale parallel “performance” abstraction

❍  Inherent state of exascale execution is dynamic ❍  Embodies non-stationarity of “performance” ❍  Constantly shaped by the adaptation of resources to meet

computational needs and optimize objectives ❑  Fundamentally different performance “observability”

❍  “1st person” + “3rd person” performance introspection ❍  Designed to support introspective adaptation ❍  In-situ analysis of performance state, objectives, and progress ❍  Aware of multiple performant and productivity objectives ❍  Policy-driven dynamic feedback and adaptation ❍  Reflects computation to execution model mapping ❍  Integration in exascale productivity environment


Introspection, In Situ Analysis, and Feedback

Courtesy Mar2n Schulz, LLNL – Piper Project


Argo DOE ExaOSR Project ❑  Exascale OS and runtime research project

❍  Labs: ANL, LLNL, PNL ❍  Universities: BU, UC, UIUC, UO, UTK

❑  Philosophy ❍  Whole-system view ◆  dynamic user environment (functionality, dynamism, flexibility) ◆  first-class managed resources (performance, power, …) ◆  hierarchical response to faults

❍  Massive concurrency support ❑  Key ideas

❍  Hierarchical (control, communication, goals, data resolution) ❍  Embedded (performance, power, …) feedback and response ❍  Global system support


Argo Global Information Backplane ❑  BEACON: event/action/control notification ❑  Exposé: performance introspection and adaptation system


Argo Global Optimization View

u – generalized control signals v – se]ngs for next level m – power/resilience models & mechanisms

g – power, performance goals f – feedback from BEACON and EXPOSÉ e – error between goals and feedback

Courtesy Hank Hoffman, University of Chicago


XPRESS ❑  DOE X-Stack project to develop the OpenX software

stack, implementing the ParalleX execution model ❍  Labs: SNL, LBL, ORNL ❍  Universities: LSU, IU, UH, UNC/Renci, UO

❑  ParalleX is an experimental execution model ❍  Theoretical foundation of task-based parallelism ❍  Run in the context of distributed computing context ❍  Supports integration of the entire system stack

❑  HPX is the runtime system that implements ParalleX ❍  HPX-3 (LSU, C++ language/Boost) ❍  HPX-5 (IU, C language)

❑  OpenX is an exascale software stack based on ParalleX


HPX ❑  Governing principles (components)

❍  Active global address space (AGAS) instead of PGAS ◆  Distributed load balancing and adaptive data placement

❍  Message-driven via Parcels instead of message passing

❍  Lightweight control objects (LCO) instead of global barriers

❍  Moving work to data instead of moving data to work ◆  Asynchronous parallelism and dataflow style computation

❍  Fine-grained parallelism of lightweight Threads

Process Manager Local Memory Management

AGASTranslation

Interconnect

Parcel Port Parcel

Handler

Action Manager Thread

ManagerThread

Pool

LocalControlObjects

Performance Counters

Performance Monitor

...


OpenX Software Stack and APEX

LegacyApplications

New Model Applications

MPIMetaprogramming

FrameworkDomain SpecificActive Library

Compiler

AGASname spaceprocessor

LCOdataflow, futuressynchronization

LightweightThreads

context manager

Parcelsmessage driven

computation

...

OpenMP

XPI

Task recognition Address space control

Memory bankcontrol

OSthread In

strum

enta

tion

Networkdrivers

Distributed FrameworkOperating System

HardwareArchitecture

OperatingSystem

Instances

RuntimeSystem

Instances

PRIME MEDIUMInterace / Control

{{

Domain SpecificLanguage

+106 nodes ⇥ 103 cores / node + integration network

...


APEX Design

❑  APEX real time introspection ❍  Node utilization, energy consumption, health information ❍  HPX threads, queues, concurrency, remote operations,

parcels, memory management

❍  OS (LXK) track system resource assignment, utilization, job contention, overhead

❑  APEX policy control based on performance state


1D Stencil (Head Diffusion) ❑  Partitioned data array ❑  Optimum # threads ❑  Region of maximum

performance correlates with thread queue length runtime performance counter ❍  Represents # tasks currently

waiting to execute ❑  What would occur if set up

introspection on this counter to control concurrency throttling policy?

0

20

40

60

80

100

120

140

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Runtim

e((s)

Number(of(Worker(Threads

1d_stencilHPX-3

53

63

73

83

93

103

113

123

0

20

40

60

80

100

120

140

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Thread

'Que

ue'Le

ngth

Runtim

e'(s)

Number'of'Worker'Threads

1d_stencil


1d_stencil Baseline

❑  48 worker threads (with hyperthreading)

❑  Actual concurrency much lower ❍  Many worker threads

are spinning waiting for work to become available

❑  Large variation in concurrency over time ❍  Tasks waiting on

prior tasks to complete

0

5

10

15

20

25

30

35

40

45

50

0

300

Con

curre

ncy

Pow

er

Time

partition_datado_work_action

hpx::lcos::local::dataflow::executeprimary_namespace_bulk_service_action

primary_namespace_service_actionother

thread cappower

100,000,000 elements, 1000 par22ons

138 secs Event-generated metrics Where

calculation takes place


1d_stencil Optimal # of Threads

❑  12 worker threads ❑  Greater proportion of

threads working ❍  Less interference

between active threads and threads waiting for work

❍  Edison has 24 physical cores

❍  Idle hardware threads ❍  >24 are hyperthreads

which may interfere with active threads

❑  Much faster ❍  61 seconds vs.

138 seconds

0

2

4

6

8

10

12

0

300

Con

curre

ncy

Pow

er

Time




thread cappower


61 secs


1d_stencil Optimized with APEX

❑  Initially 48 worker threads

❑  Discrete hill climbing search to minimize average #of pending tasks

❑  Converges on 13 (vs. optimal of 12)

❑  Nearly as fast as optimal ❍  64 seconds vs.

61 seconds

0

5

10

15

20

25

30

35

40

45

50

0

300

Con

curre

ncy

Pow

er

Time




thread cappower


64 secs


Throttling for Power ❑  Livermore Unstructured Lagrangian Explicit Shock

Hydrodynamics (LULESH) ❍  Proxy application in DOE co-design efforts for exascale ❍  CPU bounded in most implementations (use HPX-5)

❑  Develop an APEX policy for power ❍  Threads are idled in HPX to keep node under power cap ❍  Use hysteresis of last 3 observations ❍  If Power < low cap increase thread cap ❍  If Power > high cap decrease thread cap

❑  HPX thread scheduler modified to idle/activate threads per cap ❑  Test example:

❍  343 domains, nx = 48, 100 iterations ❍  16 nodes of Edison, Cray XC30 ❍  Baseline vs. Throttled (200W per node high power cap)

LULESH image, Technical Report, LLNL-‐TR-‐490254


0

100

200

300

400

500

600

700

800

0

1000

2000

3000

4000

5000

Con

curre

ncy

Pow

er

Time

hpx143fixDEFAULTlcogetPINNED

advanceDomainactionSBN1resultactionSBN3resultaction

PosVelresultactionPosVelsendsactionMonoQresultaction

MonoQsendsactionother

probeDEFAULTSBN3sendsaction

hpxlcosetactionPINNEDthread cap

power

LULESH Baseline

No power cap No thread cap 768 threads Avg. 247 WaWs/node ~360 kiloJoules total ~91 seconds total ~74 seconds HPX tasks

91secs

aprun

1 sample per second of node tasks performance


0

100

200

300

400

500

600

700

800

0

1000

2000

3000

4000

5000

Con

curre

ncy

Pow

er

Time

hpx143fixDEFAULTprobeDEFAULT

lcogetPINNEDadvanceDomainaction

mainactionSBN1resultactionSBN3resultaction

PosVelresultaction

MonoQresultactionMonoQsendsaction

otherSBN3sendsaction

hpxlcosetactionPINNEDPosVelsendsaction

thread cappower

LULESH Throttled by Power Cap

200W power cap 768 threads throWled to 220 Avg. 186 WaWs/node ~280 kiloJoules total ~94 seconds total ~77 seconds HPX tasks

94 secs

aprun


LULESH Performance Explanation ❑  768 vs. 220 threads (after throttling) ❑  360 kJ vs. 280 kJ total energy consumed (~22% less!) ❑  75.24 vs. 77.96 seconds in HPX (~3.6% increase) ❑  Big reduction in yielded action stalls (in thread

scheduler) – Less contention for network access ❑  Hypothesis: LULESH implementation is network bound

because spatial locality of subdomains is not maintained ❍  Problem is not with HPX

Metric Baseline Throttled % Difference

Cycles 1.11341E+13 3.88187E+12 34.865%

Instructions 7.33378E+12 5.37177E+12 73.247%

L2 Cache Misses 8422448397 3894908172 46.244%

IPC 0.658677 1.38381 210.089%

INS/L2CM 870.742 1379.18 158.391%


Global Thread Throttling for Load Balance ❑  Livermore Unstructured Lagrangian Explicit Shock

Hydrodynamics (LULESH) ❍  Proxy application in DOE co-design efforts for exascale ❍  CPU bounded in most implementations (use HPX-5)

❑  Develop a global APEX policy for load balance ❍  Performance data is periodically reduce and analyzed ❍  Threads control policy ◆  If node has the max work, increase thread cap by 1 ◆  If node has significantly less work, reduce thread cap by 1 ◆  Otherwise, do not change thread cap

❍  New thread cap limits are distributed to remote localities ❑  Test example:

❍  512 domains, nx = 64, 100 iterations ❍  16 nodes of Edison, Cray XC30, 384 total cores ❍  Baseline vs. Throttled

LULESH image, Technical Report, LLNL-‐TR-‐490254


LULESH Baseline

No power cap No thread cap 384 total threads Avg. 242 WaWs/node Peak power: 4576 W 602.32 kiloJoules total 155.353 seconds total

aprun

0

50

100

150

200

250

300

350

400

0

1000

2000

3000

4000

5000

Con

curre

ncy

Pow

er

Time

parforasynchandlercallcontinuationhandler

probehandleradvanceDomainaction

SBN3resultactionSBN3sendsaction

PosVelresultactionPosVelsendsaction

MonoQresultactionMonoQsendsaction

otherattachhandler

lcodeleteactionhandlerthread cap

power


0

50

100

150

200

250

300

350

400

0

1000

2000

3000

4000

5000

Con

curre

ncy

Pow

er

Time

parforasynchandlercallcontinuationhandler

probehandlerlcodeleteactionhandleradvanceDomainaction

SBN3resultactionSBN3sendsaction

PosVelresultaction

PosVelsendsactionMonoQresultactionMonoQsendsaction

otherattachhandler

thread cappower

LULESH Throttled by Global Policy

De-‐ac0vate threads on underu0lized nodes 384 threads throWled to 375 Avg. 226 WaWs/node Peak power: 4344 W 565 kiloJoules total 155.996 seconds total


LULESH Performance Explanation ❑  384 vs. 375 (379 final) threads ❑  602 kJ vs. 565 kJ total energy consumed

❍  ~6.17% decrease ❑  155.353 vs. 155.996 seconds in HPX

❍  0.4% increase ❑  Static decomposition issues:

❍  Full domain is broken into subdomains ❍  More subdomains (512) than cores (384) ❍  Not perfectly distributed, minor load imbalance

❑  Demonstrates effectiveness of global policies


Scalable Observation System (SOS) ❑  HPC platform service

❍  Dedicated resources ❍  Single instance, system wide ❍  Serves applications, RTS, OS ❍  Shares metrics and analysis ❍  Framework-like interaction

❑  SOS monitors coordinate node to cloud interactions ❍  Any process on node can expose data via SOS

❑  SOS cloud self organizes to efficiently process data ❑  Control can go directly to interested processes ❑  Develop SOS MPMD MPI prototype

❍  Running on BG/Q, Cray, and Linux platforms ❍  Use for testing and experimentation


Global Power Scheduling under Bound ❑  Hardware overprovisioning can result in the inability to

run all hardware under full power ❑  Global power bounds must be enforced

❍  Requires global introspection ❍  In situ power analysis and dynamic power reallocation

❑  See HPDC “POW” short-paper and poster (D. Ellsworth) ❍  Minimal impact on running system ❍  Potential for significant

power saving versus naïve power allocation

❍  Potential performance improvement with power redistribution in constrained scenarios


“Beware the Jabberwock!” ❑  Willing suspension of disbelief ? ❑  Suppose that full knowledge of the state of a high-

performance parallel / distribution system were available ❍  There is no cost to produce it, access it, or analyze it ❍  There is no delay to provide it to wherever it might be used ❍  It can be observed by anything at any time

❑  How can systems be built to make this possible? ❑  Devote “resources” to support it

❍  Not just using the hardware purposed for computing ❍  Likely will require more special purpose components

❑  Want it to be configurable and programmable throughout the application and system software stack

❑  New (state-based?) parallel programming methodology


Performance as Collective Behavior ❑  New performance observability paradigm can be thought

of as providing collective awareness for exascale ❑  It allows the exascale system

to be performance-aware and performance-reactive, able to observe performance state holistically and to couple it with application knowledge for self-adaptive, runtime global control

❑  Exascale systems might be considered to then be functioning as an self-organizing performance collective whose behavior is a natural consequence of seeking highly efficient operational forms optimizing exascale objectives


O1) A. Malony, D. Reed, “A Hardware-based Performance Monitor for the Intel iPSC/2 Multiprocessor,” ICS, pp. 213-226, 1990.

O2) A. Malony, J. Larson, D. Reed, “Tracing Application Program Execution on the Cray X-MP and Cray 2,” SC, pp. 60-73, 1990.

O3) A. Malony, “Performance Observability,” Ph.D. Thesis, University of Illinois, Urbana-Champaign, 1991. O4) A. Malony, “Event-based Performance Perturbation Analysis: A Case Study, PPoPP, pp. 201-212, 1991. O5) A. Malony, D. Reed, “Models for Performance Perturbation Analysis,” ACM/ONR Workshop on Parallel

and Distributed Debugging, pp. 1-12, 1991. O6) A. Malony, D. Reed, H. Wijshoff, “Performance Measurement Intrusion and Perturbation Analysis,” IEEE

Transactions on Parallel and Distributed Systems, 3(4), pp. 433-450, 1992. O7) S. Sarukkai, A. Malony, “Perturbation Analysis of High Level Instrumentation for SPMD Programs,

PPoPP, pp. 44-53, 1993. O8) A. Malony, B. Mohr, P. Beckman, D. Gannon, S. Yang, F. Bodin, “Performance Analysis of pC++: A

Portable Data-Parallel Programming System for Scalable Parallel Computers,” IPPS, pp. 75-84, 1994. O9) B. Mohr, D. Brown, A. Malony, “TAU: A Portable Parallel Program Analysis Environment of pC++,”

CONPAR, pp. 29-40, 1994. O10) M. Heath, A. Malony, D. Rover, “The Visual Display of Parallel Performance Data,” IEEE Computer,

11:21-28, 1995. O11) M. Heath, A. Malony, D. Rover, “Parallel Performance Visualization: From Practice to Theory,” IEEE

PDT, 3(4):44-60, 1995. O12) K. Shanmugam, A. Malony, “Performance Extrapolation of Parallel Programs,” ICPP, pp. II:117-120,

1995. O13) B. Mohr, A. Malony, K. Shanmugam, “Speedy: An Integrated Performance Extrapolation Tool for pC++

Programs,” TOOLS Workshop, pp. 254-268, 1995.

Observation Papers


D1) A. Malony, R. Helm, “A Theory and Architecture for Automating Performance Diagnosis,” FGCS, 18(1):189-200, 2001.

D2) J. de St. Germain, A. Morris, S. Parker, A. Malony, S. Shende, “Performance Analysis Integration in the Uintah Software Development Cycle,” IJPP, 31(1):35-53, 2003.

D3) A. Malony, S. Shende, “Overhead Compensation in Performance Profiling,” EuroPar, pp. 119-132, 2004. D4) A. Malony, S. Shende, “Models for On-th-Fly Compensation of Measurement Overhead in Parallel

Performance Profiling,” EuroPar, pp. 72-82, 2005. D5) A. Malony, S. Shende, N. Trebon, J. Ray, R. Armstrong, C. Rasmussen, M. Sottile, “Performance

Technology for Parallel and Distributed Component Software,” CCPE, 17(2-4):117-141, 2005. D6) K. Huck, A. Malony, A. Morris, “TAUg: Runtime Global Performance Data Access using MPI,”

EuroPVM-MPI, pp. 313-321, 2006. D7) A. Nataraj, A. Malony, S. Shende, A. Morris, “Kernel-level Measurement for Integrated Parallel

Performance Views, the KTAU Project,” Cluster, pp. 1-12, 2006. D8) L. Li, A. Malony, “Model-based Performance Diagnosis of Master-Worker Parallel Computations,” pp.

35-46, EuroPar, 2006. D9) L. Li, A. Malony, “Model-based Relative Performance Diagnosis of Wavefront Parallel Computations,”

HPCC, pp. 200-209, 2006. D10) S. Shende, A. Malony, “The TAU Parallel Performance System,” IJHPCA, 20(2):287-311, 2006. D11) A. Nataraj, M. Sottile, A. Morris, A. Malony, S. Shende, “TAU over Supermon: Low-overhead Parallel

Application Performance Monitoring,” EuroPar, pp. 85-96, 2007. D12) A. Nataraj, A. Morris, A. Malony, M. Sottile, P. Beckman, “The Ghost in the Machine: Observing the

Effects of Kernel Operation on Parallel Application Performance,” SC, pp. 1-12, 2007. D13) L. Li, A. Malony, “Knowledge Engineering for Automatic Parallel Performance Diagnosis,” CCPE,

19(11):1497-1515, 2007. D14) A. Malony, S. Shende, A. Morris, F. Wolf, “Compensation of Measurement Overhead in Parallel

Performance Profiling,” IJHPCA, 21(2):174-194, 2007.

Diagnosis Papers


D15) K. Huck, A. Malony, S. Shende, A. Morris, “Knowledge Support and Automation for Performance Analysis with PerfExplorer 2.0,” JSP, 16(2-3):123-134, 2008.

D16) A. Morris, W. Spear, A. Malony, S. Shende, “Observing Performance Dynamics using Parallel Profile Snapshots,” EuroPar, pp. 162-171, 2008.

D17) K. Huck, O. Hernandez, V. Bui, S. Chandrasekaran, B. Chapman, A. Malony, L. McInnes, B. Norris, “Capturing Performance Knowledge for Automated Analysis,” SC, pp. 1-10, 2008.

D18) A. Nataraj, A. Malony, A. Morris, D. Arnold, B. Miller, “In Search of Sweet-spots in Parallel Performance Monitoring,” Cluster, pp. 69-78, 2008.

D19) A. Nataraj, A. Malony, S. Shende, A. Morris, “Integrated Parallel Performance Views,” IEEE Cluster, 11(1):57-73, 2008.

D20) A. Nataraj, A. Malony, A. Morris, D. Arnold, B. Miller, “A Framework for Scalable Parallel Performance Monitoring,” CCPE, 22(6):720-735, 2009.

D21) S. Biersdorff, C.W. Lee, A. Malony, L. Kale, “Integrated Performance Views in Charm++: Projections Meets TAU,” ICPP, pp. 140-147, 2009.

D22) A. Morris, A. Malony, S. Shende, K. Huck, “Design and Implementation of a Hybrid Parallel Performance Measurement System,” ICPP, pp. 492-501, 2010.

D23) C.W. Lee, A. Malony, A. Morrs, “TAUmon: Scalable Online Performance Data Analysis in TAU,” PROPER Worskop, pp. 493-499, 2010.

D24) W. Spear, A. Malony, C.W. Lee, S. Biersdorff, S. Shende, “An Approach to Creating Performance Visualizations in a Parallel Profile Analysis Tool,” PROPER Workshop, pp. 156-165, 2011.

D25) A. Knuephfer, et al., “Score-P – A Joint Performance Measurement Runtime Infrastructure for Periscope, TAU, and Vampir,”, TOOLS Workshop, pp. 79-91, 2011.

Diagnosis Papers


C1) G. Fursin, et al., “Collective Mind: Towards Practical and Collaborative Autotuning,” Scientifc Programming, 22(4):309-329, 2014.

C2) N. Chaimov, S. Biersdorff, A. Malony, “Tools for Machine-Learning-based Empirical Autotuning and Specialization,” IJHPCS, 27(4):403-411, 2013.

C3) K. Huck, A. Malony, S. Shende, D. Jacobsen, “Integrated Measurement for Cross-Platform OpenMP Performance Analysis,” IWOMP, pp. 146-160, 2014.

C4) N. Chaimov, B. Norris, A. Malony, “Toward Multi-Target Autotuning for Accelerators,” ICPADS, pp. 534-541, 2014.

C5) N. Chaimov, B. Norris, A. Malony, “Integration and Synthesis for Automated Performance Tuning: the SYNAPT Project,” IWAPT, 2014.

C6) A. Malony, et al., “Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs,” ICPP, pp. 176-185, 2011.

C7) A. Sarje, S. Song, D. Jacobsen, K. Huck, J. Hollingsworth, A. Malony, S. Williams and L. Oliker, “Parallel Performance Optimizations on Unstructured Mesh-Based Simulations,” ICCS MSESM Workshop, 2015.

.

Complexity Era Papers

through the looking-glass: from performance observation to

Documents