through the looking-glass: from performance observation to
TRANSCRIPT
Integra(on and Synthesis for Automated Performance Tuning
Through the Looking-Glass: From Performance Observation
to Dynamic Adaptation
Allen D. Malony
ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC’15)
Integra(on and Synthesis for Automated Performance Tuning
In dedication to: the Parallel Performance Tools community
Integra(on and Synthesis for Automated Performance Tuning
Talk Outline ❑ Sequel ❑ Performance Retrospective
❍ Observability Era ❍ Diagnosis Era ❍ Complexity Era
❑ Performance Evolution or Revolution ❑ Exascale Era and Productivity
❍ Invalid Uniformity Assumptions ❍ A New Performance Observability ❍ Dynamic Adaptability ❍ Projects
❑ Future
Integra(on and Synthesis for Automated Performance Tuning
“Sometimes I've believed as many as six impossible things before breakfast!”
❑ Willing suspension of disbelief ❑ Suppose that full knowledge of the state of a high-
performance parallel / distribution system were available ❍ There is no cost to produce it, access it, or analyze it ❍ There is no delay to provide it to wherever it might be used ❍ It can be observed by anything at any time
❑ What is the state? What state is valuable? ❑ Would it fundamentally change how future HPC systems are
programmed? How? ❑ Could the combination of wholistic observation (self
awareness), dynamic adaptivity, collective control be the basis for a different execution paradigm?
Integra(on and Synthesis for Automated Performance Tuning
“I'm not strange, weird, off, nor crazy, my reality is just different from yours.”
❑ Our reality is shaped by our experience ❑ Started work in the parallel computing in 1986 (29 years)
❍ Research engineer at HP Labs (1981 – 1986) ❍ Ph.D. graduate student at UIUC (1986 – 1991) ◆ Dan Reed, Ph.D. advisor
❍ Senior software engineer at Center for Supercomputing Research and Development (CSRD) (1987 – 1991) ◆ built parallel performance tools for Cedar supercomputer ◆ worked with gods: Kuck, Lawrie, E. Davidson, A. Sameh, …
❍ Professor, University of Oregon (1992 – present) ❍ CEO and Director, ParaTools, Inc. (2004 – present)
Integra(on and Synthesis for Automated Performance Tuning
Parallel Performance, Engineering, and Tools ❑ Generally, I am interested in the nature of parallel performance ❑ Performance has been the fundamental driving concern for parallel
computing and high-performance computing ❑ Achieving performance in real systems is an engineering process ❍ Observation: measure and characterize performance behavior ❍ Diagnosis: identify and understand performance problems ❍ Tuning: modify to run optimally on high-end machines
❑ How to make the process more effective and productive? ❍ What is the nature of the performance problem solving? ❍ What is the performance technology to be applied?
❑ Compelling reasons to build and integrate performance tools ❍ To obtain, learn, and carry forward performance knowledge ❍ To address increasing scalability and complexity issues ❍ To bridge between computation model and execution model operation
❑ Parallel systems evolution will change process, technology, use
Integra(on and Synthesis for Automated Performance Tuning
Performance Technology Eras ❑ Performance technology has evolved to serve the
dominant architectures and programming models ❍ Observability era (1991 – 1998) ◆ instrumentation, measurement, analysis
❍ Diagnosis era (1998 – 2007) ◆ identifying performance inefficiencies
❍ Complexity era (2008 – 2012) ◆ scale, memory hierarchy, network, multicore, GPU
❑ 20+ years of stable parallel execution models ❍ “1st person” application focus for tools support ❍ Productivity focused on optimizing performance
❑ Exascale era (2013 – future)
Integra(on and Synthesis for Automated Performance Tuning
❑ Performance evaluation problems define the requirements for performance measurement and analysis methods
❑ Performance observability is the ability to “accurately” capture, analyze, and present understand (collectively observe) information about parallel software and system
❑ Tools for performance observability must balance the need for performance data against the cost of obtaining it (environment complexity, performance intrusion) ❍ Too little performance data makes analysis difficult ❍ Too much data perturbs the measured system
❑ Important to understand performance observability complexity and develop technology to address it
Performance Observability (1991-1998)
A. Malony, “Performance Observability,” Ph.D. Thesis, University of Illinois, Urbana-Champaign, 1991.
Integra(on and Synthesis for Automated Performance Tuning
❑ All performance measurements generate overhead ❑ Overhead is the cost of performance measurement
❍ Overhead causes (generates) performance intrusion ❑ Intrusion is the dynamic performance effect of overhead
❍ Intrusion causes (potentially) performance perturbation ❑ Perturbation is the change in performance (program) behavior
❍ Measurement does not change “possible” executions !!! ❍ Perturbation can lead to erroneous performance data (Why?) ❍ Alteration of “probable” performance behavior (How?)
❑ Instrumentation uncertainty principle ❍ Any instrumentation perturbs the system state ❍ Execution phenomena and instrumentation are coupled ❍ Volume and accuracy are antithetical
❑ Perturbation analysis is possible with deterministic execution ❑ Otherwise, measurement must be considered part of the system
Instrumentation Uncertainty Principle
Integra(on and Synthesis for Automated Performance Tuning
❑ Use time-based perturbation analysis to remove overhead ❑ There is more to the problem (Why?)
Perturbation Analysis – Livermore Loops
Full: full instrumentation Raw: uninstrumented Model: time-based perturbation analysis
Serial execution
Parallel execution Inner product Banded linear equations
Implicit, conditional computation
Integra(on and Synthesis for Automated Performance Tuning
Event-based Perturbation Analysis ❑ Time-based perturbation
analysis does not take into account concurrent behavior (dependencies) ❍ Loops 3, 4, 17 are
DOACROSS loops ❍ Measurement affects
probability of contention!
❑ Must use event-based perturbation analysis to “model” contention ❍ Requires more events!
Livermore Loop Measured/Raw Approximated/Raw 3 2.48 0.96 4 2.64 1.06 17 9.97 0.97
instrumentation in the critical section results in a higher probability of contention and thus MORE waiting
relatively greater instrumentation outside the critical section results in lower probability of contention and thus LESS waiting
Integra(on and Synthesis for Automated Performance Tuning
❑ LLL 3, 4, and 17 are DOACROSS loops that were statically scheduled
❑ Consider a dynamically scheduled DOALL loop
❑ What is the problem? ❍ Non-deterministic execution is
affected by observation ❑ Event-based perturbation analysis
must include knowledge of the scheduling system
❑ In the worst case, you might have to model (simulate) the whole runtime software and possibly the machine!
“Curiouser and Curiouser”
iteration begin instrumentation iteration end instrumentation waiting
This is not possible!
Integra(on and Synthesis for Automated Performance Tuning
❑ What is the “true” parallel computation performance? ❑ Some technique must be used to observe performance ❑ Any performance measurement will be intrusive ❑ Any performance intrusion might result in perturbation ❑ Performance analysis is based on performance measures
❍ How is “accuracy” of performance analysis evaluated? ❍ How is this done when “true” performance is unknown?
❑ Uncertainty applies to all experimental methods ❍ “Truth” lies just beyond the reach of observation
❑ Develop performance observation systems that can deliver robust performance data efficiently with low overhead
❑ Rationalize about performance measurement effects
Performance Uncertainty
Integra(on and Synthesis for Automated Performance Tuning
❑ Performance problem solving framework for HPC ❍ Integrated, scalable, flexible, portable ❍ Target all parallel programming / execution paradigms
❑ Integrated performance toolkit (open source) ❍ Multi-level performance instrumentation ❍ Flexible and configurable performance measurement ❍ Widely-ported performance profiling / tracing system ❍ Performance data management and data mining
TAU Performance System®
memory memory
Node Node Node
VM space
Context
SMP
Threads
node memory
…
…
Interconnection Network Inter-node message communication
* *
physical view
model view
Integra(on and Synthesis for Automated Performance Tuning
❑ Follow an empirically-based approach observation experimentation diagnosis tuning
❑ Performance technology developed for each level
Parallel Performance Engineering Process
characterization
Performance���Tuning
Performance Diagnosis
Performance Experimentation
Performance Observation
hypotheses
properties
• Instrumentation • Measurement • Analysis • Visualization
Performance Technology
• Experiment management
• Performance storage
Performance Technology
• Data mining • Models • Expert systems
Performance Technology
Integra(on and Synthesis for Automated Performance Tuning
1992-1995: DARPA pC++ (Gannon, Malony, Mohr). TAU (Tools Are Us) is born. [parallel profiling, tracing, performance extrapolation] 1995-1998: Shende Ph.D. (performance mapping, instrumentation). TAU v1.0. [multiple languages, source analysis, automatic instrumentation]
1998-2001: Significant effort in Fortran analysis and instrumentation, work with Mohr on OpenMP, Kojak tracing integration, focus on automated performance analysis. [performance diagnosis, source analysis, instrumentation] 2002-2005: Focus on profiling analysis, measurement scalability, and perturbation compensation. [analysis, scalability, perturbation analysis, applications] 2005-2007: More emphasis on tool integration, usability, and data presentation. TAU v2.0 released. [performance visualization, binary instrumentation, integration, performance diagnosis and modeling]
2008-2011: Add performance database support, data mining, and rule-based analysis. Develop measurement/analysis for heterogeneous systems. Core measurement infrastructure integration (Score-P). [database, data mining, expert system, heterogeneous measurement, infrastructure integration]
2012-present: Focus on exascale systems. Improve scalability, heterogeneous support, runtime system integration, dynamic adaptation. Apply to petascale / exascale applications. [scale, autotuning, user-level]
TAU History
Observability
Diagnosis
Complexity
Exascale
Integra(on and Synthesis for Automated Performance Tuning
❑ Semantic entities, attributes, associations (SEAA) ❍ Entities: represent semantics
at any level ❍ Attribute: encode entity
semantics ❍ Association: link entities
across levels to map performance data
❑ SEAA ability to map low-level data to high levels of abstraction reduces the semantic gap for user
Semantic Gap in Performance Mapping
S. Shende, “The Role of Instrumentation and Mapping in Performance Measurement,” Ph.D. Thesis, University of Oregon, 2001.
Integra(on and Synthesis for Automated Performance Tuning
❑ POOMA (Parallel OO Methods and Applications) ❍ Parallel Expression Templates (PETE) ❍ Shared memory asynchronous runtime system (SMARTS)
POOMA, PETE, SMARTS, and TAU
SEAA
SEAA
Integra(on and Synthesis for Automated Performance Tuning
Performance Diagnosis (1998-2007) Performance diagnosis is a process to detect and explain performance problems
Integra(on and Synthesis for Automated Performance Tuning
❑ APART – Automatic Performance Analysis - Real Tools ❍ Problem specification and identification
❑ Poirot – theory of performance diagnosis processes ❍ Compare and analyze performance diagnosis systems ❍ Use theory to create system that is
automated / adaptable ◆ Heuristic classification:
match to characteristics ◆ Heuristic search:
look up solutions with problem knowledge ❍ Problems: ◆ low-level feedback ◆ lack of explanation power
❑ Hercule – knowledge-based (model-based) diagnosis ❍ Capture knowledge about performance problems ❍ Capture knowledge about how to detect and explain them ❍ Knowledge comes from parallel computational models ◆ associate computational models with performance models
Performance Diagnosis Projects
Integra(on and Synthesis for Automated Performance Tuning
❑ Goals of automation, adaptability, validation ❑ Use of models benefits performance diagnosis
❍ Helps to identify execution events of interest ❍ Provide meaningful diagnostic explanations ❍ Enable reuse of performance analysis expertise
❑ Applications model studied ❍ Master-worker, wavefront, compositional
Hercule Project
EBBA
CLIPS
Integra(on and Synthesis for Automated Performance Tuning
Advances in TAU Technologies
TAU + Scalasca Score-P
TAUdb
PerfExplorer
ParaProf
Integra(on and Synthesis for Automated Performance Tuning
Integrated Profiling Madness FLASH
Intrepid Ranger NAMD
❑ Phase-based profiling ❑ Integrated hybrid profiling ❍ Merge sampling-based and
probe-based measurement ❍ Unified context
❑ Charm++ integration ❍ Projections (tracing) ❍ TAU (profiling)
Integra(on and Synthesis for Automated Performance Tuning
❑ Should not just redescribe the performance results ❑ Should explain performance phenomena ❍ What are the causes for performance observed? ❍ What are the factors and how do they interrelate? ❍ Performance analytics, forensics, and decision support
❑ Need to add knowledge to do more intelligent things ❍ Automated analysis needs good informed feedback ◆ iterative tuning, performance regression testing
❍ Performance model generation requires interpretation ❑ We need better methods and tools for ❍ Integrating meta-information ❍ Knowledge-based performance problem solving
How to Explain and Understand Performance
K. Huck, “Knowledge Support for Parallel Performance Data Mining,” Ph.D. Thesis, University of Oregon, 2008.
Integra(on and Synthesis for Automated Performance Tuning
❑ PerfExplorer ❍ Multi-experiment data mining ❍ Programmable, extensible framework
to support workflow automation ❍ Rule-based inferences for expert system analysis
❑ S3D scalability analysis (Jaguar, Intrepid, 12K cores)
Automation and Knowledge Engineering
r = 1 implies direct correlation
Integra(on and Synthesis for Automated Performance Tuning
❑ OpenUH and PerfExplorer ❑ Some OpenUH optimizations
are guided by a cost model ❍ Loop nest optimizer ❍ Use profile analysis feedback to improve model accuracy
❑ PerfExplorer data analysis and meta-analysis ❍ Provides inference-based optimization suggestions
❑ Multiple String Alignment (MS) and OpenMP load imbalance
Rule-based Inference for Expert Analysis
Analysis workflow and inference rules Automated rule processing
Integra(on and Synthesis for Automated Performance Tuning
❑ Trace-based analysis of message passing communication
❑ Measurement intrusion can effect even timing and ordering
❑ Effects can propagate across process boundaries
❑ Analyze communication perturbation ❍ Remove measurement overhead ❍ Model communication effects ❍ Execution reply
❑ Same methods can be used for “what if” prediction of application performance ❍ Different communication speeds ❍ Faster processing
❑ More measurement gives more fidelity in modeling the perturbation effects!
❑ Have to model the system!
Communication Perturbation Analysis
Mea
sure
d M
odel
ed
w/ o
rder
M
odel
ed
w/ o
rder
Integra(on and Synthesis for Automated Performance Tuning
❑ Support low-overhead OS performance measurement ❍ Multiple levels of function and
detail through instrumentation ❍ Support both profiling and tracing
❑ Provide kernel-wide and process-centric OS perspectives
❑ Merge user-level and kernel-level performance data ❍ Across all program-OS interactions ❍ Provide fast KTAU data access
❑ Implemented on BG with Zeptos and Linux cluster
❑ Applied to systems noise analysis
KTAU: Kernel Performance Measurement
Integra(on and Synthesis for Automated Performance Tuning
❑ Program OS Interactions ❍ Direct ◆ applications invoke the OS for
certain services ◆ syscalls and internal OS routines
called ❍ Indirect ◆ OS operations without explicit
invocation by application ◆ preemptive scheduling (other
processes) ◆ (HW) interrupt handling ◆ OS-background activity
– keeping track of time and timers, bottom-half handling, …
◆ Can occur at any OS entry
Effects of Noise on Parallel Performance
OS noise effects can accumulate
Integra(on and Synthesis for Automated Performance Tuning
❑ General noise-effect estimation by direct measurement of application and OS
❑ Isolate OS noise in application performance data
❑ Trace information ❍ Application event trace with OS
noise counters ❍ Presence of OS noise has affected
the timing of events ❑ Timeline Approximation
❍ Reason about timing of events that may have occurred in the absence of noise
❍ Remove time-duration of noise and adjust event timestamps
❑ Must be careful to maintain constraints in event ordering
❑ Quantify global effects of noise on application ❍ Noise propagation / absorption
❑ Attribute the effects to specific noise sources
OS Noise Analysis
Sweep3D
Integra(on and Synthesis for Automated Performance Tuning
❑ Parallel profiling is lightweight, but lacks temporal measurement
❑ Tracing can generate large trace files ❑ Capture performance dynamics with
profile snapshots
Observing Performance Dynamics
Information
Ove
rhea
d Traces
Profile Snapshots Profile
s
Initialization
Checkpointing
Finalization
Flash 3.0
INTRFC
Integra(on and Synthesis for Automated Performance Tuning
❑ Parallel performance state is globally distributed ❍ Logically part of application’s global data space ❍ Offline: outputs data at execution end for post-mortem analysis ❍ Online: access to performance state for analysis
❑ Definition: Monitoring ❍ Online access to parallel performance (data) state ❍ May or may not involve runtime analysis
❑ Couple with monitoring infrastructures ❑ TAUoverSupermon
❍ Supermon monitor from Los Alamos National Laboratory
❑ TAUoverMRNET ❍ Multicast Reduction Network (MRNet)
infrastructure from Wisconsin ❑ TAUg
❍ MPI-based infrastructure to provide global view of TAU profile data
❑ TAUmon ❍ Transport-neutral (SuperMon, MRNet, MPI)
❑ Develop online analysis methods ❍ Aggregation, statistics, …
Parallel Performance Monitoring
Integra(on and Synthesis for Automated Performance Tuning
TAUMon Online Analysis
Uintah
Flash
Integra(on and Synthesis for Automated Performance Tuning
❑ Performance tools have evolved incrementally to serve the dominant architectures and programming models ❍ Reasonably stable, static parallel execution models ❍ Allowed application-level observation focus
❑ Observation requirements for 1st-person measurement: ❍ Performance measurement can be made locally (per thread) ❍ Performance data collected at the end of the execution ❍ Post-mortem analysis and presentation of performance results ❍ Offline performance engineering
❑ Architecture factors increase performance complexity ❍ Greater core counts and hierarchical memory system ❍ Heterogeneous computing ❍ Significantly larger scale
❑ Focus on performance technology integration
Performance Complexity (2008-2012)
Integra(on and Synthesis for Automated Performance Tuning
r Heterogeneous parallel systems are highly relevant today ¦ Multi-CPU, multicore
shared memory nodes (latency optimized)
¦ Manycore accelerators with high-BW I/O (throughput optimized)
¦ Cluster interconnection network r Performance is the main driving concern
¦ Heterogeneity is an important path to extreme scale r Heterogeneous software technology to get performance
¦ More sophisticated parallel programming environments ¦ Integrated parallel performance tools
Ø support heterogeneous performance model and perspectives
Heterogeneous Systems and Performance
Integra(on and Synthesis for Automated Performance Tuning
❑ Current status quo is somewhat comfortable ❍ Mostly homogeneous parallel systems and software ❍ Shared-memory multithreading – OpenMP ❍ Distributed-memory message passing – MPI
❑ Parallel computational models are relatively stable ❍ Corresponding performance models are tractable ❍ Parallel performance tools can keep up and evolve
❑ Heterogeneity creates richer computational potential ❍ Greater performance diversity and complexity
❑ Heterogeneous system environments ❑ Performance tools have to support richer computation
models and more versatile performance perspectives ❑ Will utilize more sophisticated programming and runtime
Implications for Performance Tools
Integra(on and Synthesis for Automated Performance Tuning
❑ Integrated GPU support ❍ Enable host-GPU measurement ◆ CUDA, OpenCL, OpenACC ◆ accelerator compiler integration ◆ utilize PAPI CUDA and CUPTI
❍ Provide both heterogeneous profiling and tracing support ◆ contextualization of asynchronous
kernel invocation ❍ Static analysis of GPU kernel ❍ GPU kernel sampling
❑ Full support for Intel Xeon Phi
Integrated Heterogeneous Support in TAU
GTC
Integra(on and Synthesis for Automated Performance Tuning
❑ How do you support a common performance analysis methodology for OpenMP across systems and compilers?
❑ Evaluate four methods for OpenMP measurement: ❍ POMP ❍ ORA w/OpenUH ❍ ORA w/GOMP ❍ OMPT w/Intel … all integrated into TAU
❑ Compare performance visibility, semantic context, measurement overhead, and integration in analysis
Cross-Platform OpenMP Performance Analysis
ORA: OpenMP Run2me API GOMP: GCC OpenMP implementa2on OMPT: OpenMP Tools interface
Integra(on and Synthesis for Automated Performance Tuning
Tools/Technology Integration and Synthesis
39
Performance
Modeling
Reliability
Autotuning
Performance
Optimization
Resilience
Energy
TAU
mpiP
PBound
Ac2ve Harmony
CHiLL
Roofline
PAPI
PEBIL PSiNtracer
GPTL
Tools / Technologies
RCRToolkit
Code analysis
Center of mass for performance engineerng
ROSE
Orio
End-to-end Integration
SciDAC applications
Prior research funding
SUPER is a 5-‐year SciDAC Ins2tute (2011-‐2016)
Integra(on and Synthesis for Automated Performance Tuning
❑ Autotuning is a performance engineering process ❍ Automated experimentation and performance testing ❍ Guided optimization by (intelligent) search space exploration ❍ Model-based (domain-specific) computational semantics
❑ Autotuning components ❍ Active Harmony autotuning system (Hollingsworth, UMD) ◆ software architecture for optimization and adaptation
❍ CHiLL compiler framework (Hall, Utah) ◆ CPU and GPU code transformations for optimization
❍ Orio annotation-based autotuning (Norris, ANL) ◆ code transformation (C, Fortran, CUDA, OpenCL) with optimization
❑ Goal is to integrate TAU with existing autotuning frameworks ❍ Use TAU to gather performance data for autotuning/specialization ❍ Store performance data with metadata for each experiment variant and store
in performance database (TAUdb) ❍ Use machine learning and data mining to increase the level of automation
of autotuning and specialization
TAU Integration with Empirical Autotuning
Integra(on and Synthesis for Automated Performance Tuning
❑ Extended Orio with OpenCL ❍ OpenCL code generation with
annotations based on tunable OpenCL parameters
❍ TAU wraps the OpenCL runtime ❑ Tuning experiments to be conducted
on a broader set of accelerator platforms ❍ Radeon 6970, Xeon Phi,
GTX 480, Tesla C2075, Tesla K20c
❑ Autotuning experiments ❍ BLAS kernels ❍ Radiation transport code
Orio Multi-target Autotuning with TAU
Collected Metrics
TAU OpenCLLibrary Wrapper
OpenCL Profiling Support
Metrics
Orio Code Generator
Experiment TAU OpenCLRuntime Wrapper
TAUdb
TAU Metadata EntriesTransformationsExecution Time
TAU ProfilesWrites
Links at Runtime
Writes
Uploaded
Integra(on and Synthesis for Automated Performance Tuning
TAU and Autotuning in SUPER
TAUdb
Outlined Function
Selective Instrumentation File (specifying parameters
to capture)
Instrumented Variant
tau_instrumentor
Parameterized Performance Profile
execute
PerfDMFTauDB
parameters from TauDB
CHiLL Recipes
Search Driver(brute force or Active
Harmony)
code variant
TAUdb
CHiLL
profile dataand metadata
WEKA
decision treeinductionalgorithm
ROSE-based Code
Generation Tool
Code VariantsCode Variants
Code VariantsCode Variants
Wrapper Function
PerfDMFTauDBTAUdb
Orio Code Generator
Experiment
TAU Metadata Entries
Transformations
Execution TimeWrites
CUPTI callbackmeasurement library
TAU Profiles
TAUdb
Writes
Uploaded
Links at Runtime
CUDA
OpenCL CHiLL + AH
Orio
ROSE
Geant4 MPAS-O
CESM PerfExplorer XGC1
Integra(on and Synthesis for Automated Performance Tuning
❑ Stencil codes introduce load imbalances due to halo/ghost cells
❑ Performance measurement in isolation is insufficient to explain source of imbalance and attribute to performance factors
❑ Need to combine: ❍ Performance measurements ❍ Application metadata ❍ Correlations between
measurements and metadata ❍ Visualization of performance data in
context of application domain
MPAS Ocean Load Imbalance Tuning
Integra(on and Synthesis for Automated Performance Tuning
❑ Capture stencil properties in metadata ❍ nCellSolve = cells in partition ❍ nCell = nCellSolve + nCellsHalo
❑ Correlate computation balance with metadata fields
❑ Original partitioning based on “balancing” nCellsSolve on MPI processes
❑ Really need to balance with respect to nCell ❍ This depends on the partitioner, but do not know in advance!
❑ Apply “knowledge” to Hindsight partioner ❍ Create initial Metis partition ❍ Assign nCell weights on graph and iterate to optimize
Stencil Properties and Metadata Collection
800 900
1000 1100 1200 1300 1400 1500
1.50E+07 2.00E+07 2.50E+07 3.00E+07 3.50E+07 4.00E+07
Metadata Correlated with Adv Timer
nCellsSolve nCells
Integra(on and Synthesis for Automated Performance Tuning
Partioning Refinement
Total cells (explicit + halo) per block * tighter range
Time spent computing * lower total
Time spent in MPI_Wait * reduced * also in halo routines Number of neighbors per block * greater
min: 473 ì 535
max: 846 î 771
min: 1 ì 2 max: 7 ì 9
min: 8.3e7 ì 9.8e7
max: 2.5e8 î 2.4e8
min: 2.7e7 î 9.0e6
max: 1.9e8 î 1.5e8
Original Refined 240 processors on NERSC Edison
Integra(on and Synthesis for Automated Performance Tuning
❑ OpenMP scheduling algorithm can affect load imbalance ❍ Threads arrive at boundary cell exchange at different times ❍ Master completes boundary cell exchange at different rates
❑ Vectorization can change communication waiting
MPAS-O Load Imbalance in MPI + OpenMP
faster execution, less wait variation
Integra(on and Synthesis for Automated Performance Tuning
❑ Community Earth System Model (CESM) ❑ Observed performance variability on ORNL Jaguar
❍ Significant increase in execution time led to failed runs ❑ End-to-end analysis methodology
❍ Collect information from all production jobs ◆ modify jobs scripts ◆ problem/code provenance,
system topology, system workload, process node/core mapping, job progress, time spent in queue, pre/post, total
❍ Load in TAUdb, qualtify nature of variability and impact
Performance Variability in CESM
Integra(on and Synthesis for Automated Performance Tuning
❑ 4096 processor cores ❍ 6 hour request ❍ Target: 150 simulation days
❑ 35 jobs ❍ May 15 – Jun 29, 2012 ❍ Two of which failed
Example Experiment Case Analysis
0
200
400
600
800
1000
1200
1400
0 20 40 60 80 100 120 140
Runt
ime
for P
revi
ous
5 Si
mul
atio
n Da
ys (s
econ
ds)
Simulation Day
CESM Performance Time Series (Run Loop)
Cray XK6 (1 sixteen-core processor per node)T341f02.F1850r, Components Stacked
4096 processor cores (512 processes, 8 OMP threads) Failed Run (140 Simulation Days) Failed Run (145 Simulation Days)
Slowest Successul Run (150 Simulation Days) Second Slowest Successul Run (150 Simulation Days)
Maximum execution time Minimum execution time
High variability intra-run
Integra(on and Synthesis for Automated Performance Tuning
❑ Node placement can cause poor resource matching ❍ Fastest ◆ 0 unmatched
❍ Slowest ◆ 272 unmatched
❍ Can cause more risk of contention with other jobs executing
❑ MPI rank placement also matters ❍ Greater distances
between processes can result in larger overheads for communication
Node Placement and MPI Placement Matter
4:14:08 (hh:mm:ss) 5:35:50 (hh:mm:ss)
Integra(on and Synthesis for Automated Performance Tuning
Evolution ❑ Increased performance complexity and scale forces the
engineering process to be more intelligent and automated ❍ Automate performance data analysis / mining / learning ❍ Automated performance problem identification
❑ Even with intelligent and application autotuning, the decisions of what to analyze are difficult ❍ Performance engineering tools and practice must incorporate
a performance knowledge discovery process ❑ Extreme scale performance is an optimized orchestration
❍ Application, processor, memory, network, I/O ❑ Application-level only performance view is myopic ❑ Reductionist approaches will be unsuccessful ❑ Need for whole performance evaluation
Integra(on and Synthesis for Automated Performance Tuning
Exascale Era (2012 – ???) ❑ Challenges of performance growth and power will cause
exascale systems to depart from conventional MPP designs ❍ Greater core counts and hardware thread concurrency ❍ Heterogeneous hardware and deeper memory hierarchy ❍ Hardware-assisted global addressing space support ❍ Power limits and reliability concerns built in
❑ Emerging exascale programming models emphasize message-driven computation and finer-grained parallelism semantics ❍ More asynchronous and lower-level thread management ❍ More exposure of concurrency through task-level parallelism ❍ Global address space models versus conventional message passing ❍ Heterogeneity in parallel execution and locality optimization
❑ Productivity and performance are coupled at exascale ❍ Applications are more complex and must be mapped to systems ❍ Growing crisis for performant and maintainable scientific software
Integra(on and Synthesis for Automated Performance Tuning
Exascale End-to-End Productivity
Courtesy of Thomas Ndousse-‐FeWer, DOE
• Dynamic performance adapta0on
Application Software Productivity Execution-time Productivity
Increased complexity in exascale applications
Integra(on and Synthesis for Automated Performance Tuning
Uniformity Assumptions are No Longer Valid ❑ Exascale design directions raise issues of uniformity
❍ Components and behaviors are not the same or regular ❍ Heterogeneous compute engines behave differently ❍ Fine-grained power management affects homogeneity ❍ Process technology results in non-uniform execution behavior ❍ Fault resilience introduces inhomogeneity
❑ Bulk synchronous model is increasingly impractical ❍ Removing sources of performance variation (jitter) is unrealistic ❍ Huge costs in power/complexity/performance to extend the life
❑ Embrace performance heterogeneity!!! ❍ Variation, variation, variation ❍ Can not assume a stable “state” of the system a priori ❍ Post-mortem performance analysis fails for lack of repeatability
Integra(on and Synthesis for Automated Performance Tuning
A New Performance “Observability” ❑ Key exascale parallel “performance” abstraction
❍ Inherent state of exascale execution is dynamic ❍ Embodies non-stationarity of “performance” ❍ Constantly shaped by the adaptation of resources to meet
computational needs and optimize objectives ❑ Fundamentally different performance “observability”
❍ “1st person” + “3rd person” performance introspection ❍ Designed to support introspective adaptation ❍ In-situ analysis of performance state, objectives, and progress ❍ Aware of multiple performant and productivity objectives ❍ Policy-driven dynamic feedback and adaptation ❍ Reflects computation to execution model mapping ❍ Integration in exascale productivity environment
Integra(on and Synthesis for Automated Performance Tuning
Introspection, In Situ Analysis, and Feedback
Courtesy Mar2n Schulz, LLNL – Piper Project
Integra(on and Synthesis for Automated Performance Tuning
Argo DOE ExaOSR Project ❑ Exascale OS and runtime research project
❍ Labs: ANL, LLNL, PNL ❍ Universities: BU, UC, UIUC, UO, UTK
❑ Philosophy ❍ Whole-system view ◆ dynamic user environment (functionality, dynamism, flexibility) ◆ first-class managed resources (performance, power, …) ◆ hierarchical response to faults
❍ Massive concurrency support ❑ Key ideas
❍ Hierarchical (control, communication, goals, data resolution) ❍ Embedded (performance, power, …) feedback and response ❍ Global system support
Integra(on and Synthesis for Automated Performance Tuning
Argo Global Information Backplane ❑ BEACON: event/action/control notification ❑ Exposé: performance introspection and adaptation system
Integra(on and Synthesis for Automated Performance Tuning
Argo Global Optimization View
u – generalized control signals v – se]ngs for next level m – power/resilience models & mechanisms
g – power, performance goals f – feedback from BEACON and EXPOSÉ e – error between goals and feedback
Courtesy Hank Hoffman, University of Chicago
Integra(on and Synthesis for Automated Performance Tuning
XPRESS ❑ DOE X-Stack project to develop the OpenX software
stack, implementing the ParalleX execution model ❍ Labs: SNL, LBL, ORNL ❍ Universities: LSU, IU, UH, UNC/Renci, UO
❑ ParalleX is an experimental execution model ❍ Theoretical foundation of task-based parallelism ❍ Run in the context of distributed computing context ❍ Supports integration of the entire system stack
❑ HPX is the runtime system that implements ParalleX ❍ HPX-3 (LSU, C++ language/Boost) ❍ HPX-5 (IU, C language)
❑ OpenX is an exascale software stack based on ParalleX
Integra(on and Synthesis for Automated Performance Tuning
HPX ❑ Governing principles (components)
❍ Active global address space (AGAS) instead of PGAS ◆ Distributed load balancing and adaptive data placement
❍ Message-driven via Parcels instead of message passing
❍ Lightweight control objects (LCO) instead of global barriers
❍ Moving work to data instead of moving data to work ◆ Asynchronous parallelism and dataflow style computation
❍ Fine-grained parallelism of lightweight Threads
Process Manager Local Memory Management
AGASTranslation
Interconnect
Parcel Port Parcel
Handler
Action Manager Thread
ManagerThread
Pool
LocalControlObjects
Performance Counters
Performance Monitor
...
Integra(on and Synthesis for Automated Performance Tuning
OpenX Software Stack and APEX
LegacyApplications
New Model Applications
MPIMetaprogramming
FrameworkDomain SpecificActive Library
Compiler
AGASname spaceprocessor
LCOdataflow, futuressynchronization
LightweightThreads
context manager
Parcelsmessage driven
computation
...
OpenMP
XPI
Task recognition Address space control
Memory bankcontrol
OSthread In
strum
enta
tion
Networkdrivers
Distributed FrameworkOperating System
HardwareArchitecture
OperatingSystem
Instances
RuntimeSystem
Instances
PRIME MEDIUMInterace / Control
{{
Domain SpecificLanguage
+106 nodes ⇥ 103 cores / node + integration network
...
Integra(on and Synthesis for Automated Performance Tuning
APEX Design
❑ APEX real time introspection ❍ Node utilization, energy consumption, health information ❍ HPX threads, queues, concurrency, remote operations,
parcels, memory management
❍ OS (LXK) track system resource assignment, utilization, job contention, overhead
❑ APEX policy control based on performance state
Integra(on and Synthesis for Automated Performance Tuning
1D Stencil (Head Diffusion) ❑ Partitioned data array ❑ Optimum # threads ❑ Region of maximum
performance correlates with thread queue length runtime performance counter ❍ Represents # tasks currently
waiting to execute ❑ What would occur if set up
introspection on this counter to control concurrency throttling policy?
0
20
40
60
80
100
120
140
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Runtim
e((s)
Number(of(Worker(Threads
1d_stencilHPX-3
53
63
73
83
93
103
113
123
0
20
40
60
80
100
120
140
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Thread
'Que
ue'Le
ngth
Runtim
e'(s)
Number'of'Worker'Threads
1d_stencil
Integra(on and Synthesis for Automated Performance Tuning
1d_stencil Baseline
❑ 48 worker threads (with hyperthreading)
❑ Actual concurrency much lower ❍ Many worker threads
are spinning waiting for work to become available
❑ Large variation in concurrency over time ❍ Tasks waiting on
prior tasks to complete
0
5
10
15
20
25
30
35
40
45
50
0
300
Con
curre
ncy
Pow
er
Time
partition_datado_work_action
hpx::lcos::local::dataflow::executeprimary_namespace_bulk_service_action
primary_namespace_service_actionother
thread cappower
100,000,000 elements, 1000 par22ons
138 secs Event-generated metrics Where
calculation takes place
Integra(on and Synthesis for Automated Performance Tuning
1d_stencil Optimal # of Threads
❑ 12 worker threads ❑ Greater proportion of
threads working ❍ Less interference
between active threads and threads waiting for work
❍ Edison has 24 physical cores
❍ Idle hardware threads ❍ >24 are hyperthreads
which may interfere with active threads
❑ Much faster ❍ 61 seconds vs.
138 seconds
0
2
4
6
8
10
12
0
300
Con
curre
ncy
Pow
er
Time
partition_datado_work_action
hpx::lcos::local::dataflow::executeprimary_namespace_bulk_service_action
primary_namespace_service_actionother
thread cappower
100,000,000 elements, 1000 par22ons
61 secs
Integra(on and Synthesis for Automated Performance Tuning
1d_stencil Optimized with APEX
❑ Initially 48 worker threads
❑ Discrete hill climbing search to minimize average #of pending tasks
❑ Converges on 13 (vs. optimal of 12)
❑ Nearly as fast as optimal ❍ 64 seconds vs.
61 seconds
0
5
10
15
20
25
30
35
40
45
50
0
300
Con
curre
ncy
Pow
er
Time
partition_datado_work_action
hpx::lcos::local::dataflow::executeprimary_namespace_bulk_service_action
primary_namespace_service_actionother
thread cappower
100,000,000 elements, 1000 par22ons
64 secs
Integra(on and Synthesis for Automated Performance Tuning
Throttling for Power ❑ Livermore Unstructured Lagrangian Explicit Shock
Hydrodynamics (LULESH) ❍ Proxy application in DOE co-design efforts for exascale ❍ CPU bounded in most implementations (use HPX-5)
❑ Develop an APEX policy for power ❍ Threads are idled in HPX to keep node under power cap ❍ Use hysteresis of last 3 observations ❍ If Power < low cap increase thread cap ❍ If Power > high cap decrease thread cap
❑ HPX thread scheduler modified to idle/activate threads per cap ❑ Test example:
❍ 343 domains, nx = 48, 100 iterations ❍ 16 nodes of Edison, Cray XC30 ❍ Baseline vs. Throttled (200W per node high power cap)
LULESH image, Technical Report, LLNL-‐TR-‐490254
Integra(on and Synthesis for Automated Performance Tuning
0
100
200
300
400
500
600
700
800
0
1000
2000
3000
4000
5000
Con
curre
ncy
Pow
er
Time
hpx143fixDEFAULTlcogetPINNED
advanceDomainactionSBN1resultactionSBN3resultaction
PosVelresultactionPosVelsendsactionMonoQresultaction
MonoQsendsactionother
probeDEFAULTSBN3sendsaction
hpxlcosetactionPINNEDthread cap
power
LULESH Baseline
No power cap No thread cap 768 threads Avg. 247 WaWs/node ~360 kiloJoules total ~91 seconds total ~74 seconds HPX tasks
91secs
aprun
1 sample per second of node tasks performance
Integra(on and Synthesis for Automated Performance Tuning
0
100
200
300
400
500
600
700
800
0
1000
2000
3000
4000
5000
Con
curre
ncy
Pow
er
Time
hpx143fixDEFAULTprobeDEFAULT
lcogetPINNEDadvanceDomainaction
mainactionSBN1resultactionSBN3resultaction
PosVelresultaction
MonoQresultactionMonoQsendsaction
otherSBN3sendsaction
hpxlcosetactionPINNEDPosVelsendsaction
thread cappower
LULESH Throttled by Power Cap
200W power cap 768 threads throWled to 220 Avg. 186 WaWs/node ~280 kiloJoules total ~94 seconds total ~77 seconds HPX tasks
94 secs
aprun
Integra(on and Synthesis for Automated Performance Tuning
LULESH Performance Explanation ❑ 768 vs. 220 threads (after throttling) ❑ 360 kJ vs. 280 kJ total energy consumed (~22% less!) ❑ 75.24 vs. 77.96 seconds in HPX (~3.6% increase) ❑ Big reduction in yielded action stalls (in thread
scheduler) – Less contention for network access ❑ Hypothesis: LULESH implementation is network bound
because spatial locality of subdomains is not maintained ❍ Problem is not with HPX
Metric Baseline Throttled % Difference
Cycles 1.11341E+13 3.88187E+12 34.865%
Instructions 7.33378E+12 5.37177E+12 73.247%
L2 Cache Misses 8422448397 3894908172 46.244%
IPC 0.658677 1.38381 210.089%
INS/L2CM 870.742 1379.18 158.391%
Integra(on and Synthesis for Automated Performance Tuning
Global Thread Throttling for Load Balance ❑ Livermore Unstructured Lagrangian Explicit Shock
Hydrodynamics (LULESH) ❍ Proxy application in DOE co-design efforts for exascale ❍ CPU bounded in most implementations (use HPX-5)
❑ Develop a global APEX policy for load balance ❍ Performance data is periodically reduce and analyzed ❍ Threads control policy ◆ If node has the max work, increase thread cap by 1 ◆ If node has significantly less work, reduce thread cap by 1 ◆ Otherwise, do not change thread cap
❍ New thread cap limits are distributed to remote localities ❑ Test example:
❍ 512 domains, nx = 64, 100 iterations ❍ 16 nodes of Edison, Cray XC30, 384 total cores ❍ Baseline vs. Throttled
LULESH image, Technical Report, LLNL-‐TR-‐490254
Integra(on and Synthesis for Automated Performance Tuning
LULESH Baseline
No power cap No thread cap 384 total threads Avg. 242 WaWs/node Peak power: 4576 W 602.32 kiloJoules total 155.353 seconds total
aprun
0
50
100
150
200
250
300
350
400
0
1000
2000
3000
4000
5000
Con
curre
ncy
Pow
er
Time
parforasynchandlercallcontinuationhandler
probehandleradvanceDomainaction
SBN3resultactionSBN3sendsaction
PosVelresultactionPosVelsendsaction
MonoQresultactionMonoQsendsaction
otherattachhandler
lcodeleteactionhandlerthread cap
power
Integra(on and Synthesis for Automated Performance Tuning
0
50
100
150
200
250
300
350
400
0
1000
2000
3000
4000
5000
Con
curre
ncy
Pow
er
Time
parforasynchandlercallcontinuationhandler
probehandlerlcodeleteactionhandleradvanceDomainaction
SBN3resultactionSBN3sendsaction
PosVelresultaction
PosVelsendsactionMonoQresultactionMonoQsendsaction
otherattachhandler
thread cappower
LULESH Throttled by Global Policy
De-‐ac0vate threads on underu0lized nodes 384 threads throWled to 375 Avg. 226 WaWs/node Peak power: 4344 W 565 kiloJoules total 155.996 seconds total
Integra(on and Synthesis for Automated Performance Tuning
LULESH Performance Explanation ❑ 384 vs. 375 (379 final) threads ❑ 602 kJ vs. 565 kJ total energy consumed
❍ ~6.17% decrease ❑ 155.353 vs. 155.996 seconds in HPX
❍ 0.4% increase ❑ Static decomposition issues:
❍ Full domain is broken into subdomains ❍ More subdomains (512) than cores (384) ❍ Not perfectly distributed, minor load imbalance
❑ Demonstrates effectiveness of global policies
Integra(on and Synthesis for Automated Performance Tuning
Scalable Observation System (SOS) ❑ HPC platform service
❍ Dedicated resources ❍ Single instance, system wide ❍ Serves applications, RTS, OS ❍ Shares metrics and analysis ❍ Framework-like interaction
❑ SOS monitors coordinate node to cloud interactions ❍ Any process on node can expose data via SOS
❑ SOS cloud self organizes to efficiently process data ❑ Control can go directly to interested processes ❑ Develop SOS MPMD MPI prototype
❍ Running on BG/Q, Cray, and Linux platforms ❍ Use for testing and experimentation
Integra(on and Synthesis for Automated Performance Tuning
Global Power Scheduling under Bound ❑ Hardware overprovisioning can result in the inability to
run all hardware under full power ❑ Global power bounds must be enforced
❍ Requires global introspection ❍ In situ power analysis and dynamic power reallocation
❑ See HPDC “POW” short-paper and poster (D. Ellsworth) ❍ Minimal impact on running system ❍ Potential for significant
power saving versus naïve power allocation
❍ Potential performance improvement with power redistribution in constrained scenarios
Integra(on and Synthesis for Automated Performance Tuning
“Beware the Jabberwock!” ❑ Willing suspension of disbelief ? ❑ Suppose that full knowledge of the state of a high-
performance parallel / distribution system were available ❍ There is no cost to produce it, access it, or analyze it ❍ There is no delay to provide it to wherever it might be used ❍ It can be observed by anything at any time
❑ How can systems be built to make this possible? ❑ Devote “resources” to support it
❍ Not just using the hardware purposed for computing ❍ Likely will require more special purpose components
❑ Want it to be configurable and programmable throughout the application and system software stack
❑ New (state-based?) parallel programming methodology
Integra(on and Synthesis for Automated Performance Tuning
Performance as Collective Behavior ❑ New performance observability paradigm can be thought
of as providing collective awareness for exascale ❑ It allows the exascale system
to be performance-aware and performance-reactive, able to observe performance state holistically and to couple it with application knowledge for self-adaptive, runtime global control
❑ Exascale systems might be considered to then be functioning as an self-organizing performance collective whose behavior is a natural consequence of seeking highly efficient operational forms optimizing exascale objectives
Integra(on and Synthesis for Automated Performance Tuning
O1) A. Malony, D. Reed, “A Hardware-based Performance Monitor for the Intel iPSC/2 Multiprocessor,” ICS, pp. 213-226, 1990.
O2) A. Malony, J. Larson, D. Reed, “Tracing Application Program Execution on the Cray X-MP and Cray 2,” SC, pp. 60-73, 1990.
O3) A. Malony, “Performance Observability,” Ph.D. Thesis, University of Illinois, Urbana-Champaign, 1991. O4) A. Malony, “Event-based Performance Perturbation Analysis: A Case Study, PPoPP, pp. 201-212, 1991. O5) A. Malony, D. Reed, “Models for Performance Perturbation Analysis,” ACM/ONR Workshop on Parallel
and Distributed Debugging, pp. 1-12, 1991. O6) A. Malony, D. Reed, H. Wijshoff, “Performance Measurement Intrusion and Perturbation Analysis,” IEEE
Transactions on Parallel and Distributed Systems, 3(4), pp. 433-450, 1992. O7) S. Sarukkai, A. Malony, “Perturbation Analysis of High Level Instrumentation for SPMD Programs,
PPoPP, pp. 44-53, 1993. O8) A. Malony, B. Mohr, P. Beckman, D. Gannon, S. Yang, F. Bodin, “Performance Analysis of pC++: A
Portable Data-Parallel Programming System for Scalable Parallel Computers,” IPPS, pp. 75-84, 1994. O9) B. Mohr, D. Brown, A. Malony, “TAU: A Portable Parallel Program Analysis Environment of pC++,”
CONPAR, pp. 29-40, 1994. O10) M. Heath, A. Malony, D. Rover, “The Visual Display of Parallel Performance Data,” IEEE Computer,
11:21-28, 1995. O11) M. Heath, A. Malony, D. Rover, “Parallel Performance Visualization: From Practice to Theory,” IEEE
PDT, 3(4):44-60, 1995. O12) K. Shanmugam, A. Malony, “Performance Extrapolation of Parallel Programs,” ICPP, pp. II:117-120,
1995. O13) B. Mohr, A. Malony, K. Shanmugam, “Speedy: An Integrated Performance Extrapolation Tool for pC++
Programs,” TOOLS Workshop, pp. 254-268, 1995.
Observation Papers
Integra(on and Synthesis for Automated Performance Tuning
D1) A. Malony, R. Helm, “A Theory and Architecture for Automating Performance Diagnosis,” FGCS, 18(1):189-200, 2001.
D2) J. de St. Germain, A. Morris, S. Parker, A. Malony, S. Shende, “Performance Analysis Integration in the Uintah Software Development Cycle,” IJPP, 31(1):35-53, 2003.
D3) A. Malony, S. Shende, “Overhead Compensation in Performance Profiling,” EuroPar, pp. 119-132, 2004. D4) A. Malony, S. Shende, “Models for On-th-Fly Compensation of Measurement Overhead in Parallel
Performance Profiling,” EuroPar, pp. 72-82, 2005. D5) A. Malony, S. Shende, N. Trebon, J. Ray, R. Armstrong, C. Rasmussen, M. Sottile, “Performance
Technology for Parallel and Distributed Component Software,” CCPE, 17(2-4):117-141, 2005. D6) K. Huck, A. Malony, A. Morris, “TAUg: Runtime Global Performance Data Access using MPI,”
EuroPVM-MPI, pp. 313-321, 2006. D7) A. Nataraj, A. Malony, S. Shende, A. Morris, “Kernel-level Measurement for Integrated Parallel
Performance Views, the KTAU Project,” Cluster, pp. 1-12, 2006. D8) L. Li, A. Malony, “Model-based Performance Diagnosis of Master-Worker Parallel Computations,” pp.
35-46, EuroPar, 2006. D9) L. Li, A. Malony, “Model-based Relative Performance Diagnosis of Wavefront Parallel Computations,”
HPCC, pp. 200-209, 2006. D10) S. Shende, A. Malony, “The TAU Parallel Performance System,” IJHPCA, 20(2):287-311, 2006. D11) A. Nataraj, M. Sottile, A. Morris, A. Malony, S. Shende, “TAU over Supermon: Low-overhead Parallel
Application Performance Monitoring,” EuroPar, pp. 85-96, 2007. D12) A. Nataraj, A. Morris, A. Malony, M. Sottile, P. Beckman, “The Ghost in the Machine: Observing the
Effects of Kernel Operation on Parallel Application Performance,” SC, pp. 1-12, 2007. D13) L. Li, A. Malony, “Knowledge Engineering for Automatic Parallel Performance Diagnosis,” CCPE,
19(11):1497-1515, 2007. D14) A. Malony, S. Shende, A. Morris, F. Wolf, “Compensation of Measurement Overhead in Parallel
Performance Profiling,” IJHPCA, 21(2):174-194, 2007.
Diagnosis Papers
Integra(on and Synthesis for Automated Performance Tuning
D15) K. Huck, A. Malony, S. Shende, A. Morris, “Knowledge Support and Automation for Performance Analysis with PerfExplorer 2.0,” JSP, 16(2-3):123-134, 2008.
D16) A. Morris, W. Spear, A. Malony, S. Shende, “Observing Performance Dynamics using Parallel Profile Snapshots,” EuroPar, pp. 162-171, 2008.
D17) K. Huck, O. Hernandez, V. Bui, S. Chandrasekaran, B. Chapman, A. Malony, L. McInnes, B. Norris, “Capturing Performance Knowledge for Automated Analysis,” SC, pp. 1-10, 2008.
D18) A. Nataraj, A. Malony, A. Morris, D. Arnold, B. Miller, “In Search of Sweet-spots in Parallel Performance Monitoring,” Cluster, pp. 69-78, 2008.
D19) A. Nataraj, A. Malony, S. Shende, A. Morris, “Integrated Parallel Performance Views,” IEEE Cluster, 11(1):57-73, 2008.
D20) A. Nataraj, A. Malony, A. Morris, D. Arnold, B. Miller, “A Framework for Scalable Parallel Performance Monitoring,” CCPE, 22(6):720-735, 2009.
D21) S. Biersdorff, C.W. Lee, A. Malony, L. Kale, “Integrated Performance Views in Charm++: Projections Meets TAU,” ICPP, pp. 140-147, 2009.
D22) A. Morris, A. Malony, S. Shende, K. Huck, “Design and Implementation of a Hybrid Parallel Performance Measurement System,” ICPP, pp. 492-501, 2010.
D23) C.W. Lee, A. Malony, A. Morrs, “TAUmon: Scalable Online Performance Data Analysis in TAU,” PROPER Worskop, pp. 493-499, 2010.
D24) W. Spear, A. Malony, C.W. Lee, S. Biersdorff, S. Shende, “An Approach to Creating Performance Visualizations in a Parallel Profile Analysis Tool,” PROPER Workshop, pp. 156-165, 2011.
D25) A. Knuephfer, et al., “Score-P – A Joint Performance Measurement Runtime Infrastructure for Periscope, TAU, and Vampir,”, TOOLS Workshop, pp. 79-91, 2011.
Diagnosis Papers
Integra(on and Synthesis for Automated Performance Tuning
C1) G. Fursin, et al., “Collective Mind: Towards Practical and Collaborative Autotuning,” Scientifc Programming, 22(4):309-329, 2014.
C2) N. Chaimov, S. Biersdorff, A. Malony, “Tools for Machine-Learning-based Empirical Autotuning and Specialization,” IJHPCS, 27(4):403-411, 2013.
C3) K. Huck, A. Malony, S. Shende, D. Jacobsen, “Integrated Measurement for Cross-Platform OpenMP Performance Analysis,” IWOMP, pp. 146-160, 2014.
C4) N. Chaimov, B. Norris, A. Malony, “Toward Multi-Target Autotuning for Accelerators,” ICPADS, pp. 534-541, 2014.
C5) N. Chaimov, B. Norris, A. Malony, “Integration and Synthesis for Automated Performance Tuning: the SYNAPT Project,” IWAPT, 2014.
C6) A. Malony, et al., “Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs,” ICPP, pp. 176-185, 2011.
C7) A. Sarje, S. Song, D. Jacobsen, K. Huck, J. Hollingsworth, A. Malony, S. Williams and L. Oliker, “Parallel Performance Optimizations on Unstructured Mesh-Based Simulations,” ICCS MSESM Workshop, 2015.
.
Complexity Era Papers