a survey about performance counters, libraries and tools joseph bryant manzano franco
Post on 19-Dec-2015
218 views
TRANSCRIPT
![Page 1: A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d3e5503460f94a173bd/html5/thumbnails/1.jpg)
A Survey about Performance Counters, Libraries and Tools
Joseph Bryant Manzano Franco
![Page 2: A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d3e5503460f94a173bd/html5/thumbnails/2.jpg)
Agenda
Introduction W3H: The Why, The What, The When, and The How
Hardware Performance Libraries Performance Application Programming Interface
(PAPI) Performance Counters Libraries (PCL)
Visualization Tools TAU: An example of a data collector KOJAK: Semi automatic instrumentation tool VAMPIR: An example of a script language PE: The All levels approach
![Page 3: A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d3e5503460f94a173bd/html5/thumbnails/3.jpg)
Introduction
Program Optimization
Algorithm Optimization
Other ubiquitous optimizations
Architecture Optimizations
Search for the most effective algorithms and data structures
Consider common architecture features such cache structures
Apply architecture specific characteristic (PIM instructions, atomic load and stores, massive memory allocations, etc)
Data Collection
Data Analysis
Identify and solve unexpected problems with the interaction between hardware and software (memory and network bottlenecks, false sharing, poor cache management, etc)
![Page 4: A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d3e5503460f94a173bd/html5/thumbnails/4.jpg)
IntroductionThe Why
Data Collection Data Analysis
High Level Library Functions
Performance Counters
Simulation environments
Easy to use and available on almost all libraries. Restricted and intrusiveCompose of timing function and clever data manipulation
Complete control over the environment including hardware, memory hierarchies and application code.Development is long for new architecturesSteep learning curve
Easy to use (especially with high level wrappers)Provides a range of measurements and is less intrusive
Manual Analysis
Automatic Statistical Analysis
Visualization Tools
Simple, but limited in its useProne to human error
Organize the data in a suitable formatStill need to deal with numbers
Graphical representation of data or its properties. Easy to identify trends even in large sets of data
![Page 5: A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d3e5503460f94a173bd/html5/thumbnails/5.jpg)
Introduction:The What
Performance Counters
Special Registers that are present in an specific architecture
Designed to count architectural events
• An event is defined as an action that the hardware takes• Predefined• Examples: cache misses / hits, TLB misses / hits, context switches, cache invalidations, total instructions, etc
Sun Ultra SPARC Two 32 bit registers called PIC (Performance Instrumentation Counters). User control restricted
Pentium Pro Two 40 bit registers called PerfCrt0/1. User control available
![Page 6: A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d3e5503460f94a173bd/html5/thumbnails/6.jpg)
Introduction: The When
Date Machine/Author Method of reading/Document
1966 Don Widring Initial Metering Design
~1970 GE 645 Multics
1979 Honeywell 6180 Yellow Submarine
1983 Cray-XM User Accessible Registers
Late 80 / early 90 IBM 3090 Mainframes, First generation IBM RS/6000
Restricted and Confidential
1992 First Alpha Chip (DEC) Uprofile, kprofile or IPROBE
1993 Pentium Not documented and embedded in the MSR
![Page 7: A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d3e5503460f94a173bd/html5/thumbnails/7.jpg)
4
Introduction:The How
Example: Ultra SPARC Architecture
Two counters - 32 bits each
Event that are being counted: Number of Instructions (pic0), and Cache invalidations (pic1)
CPU CPU
$ $
Bus
pic0pic1 pic0
pic1load 0,s1
load 1,s2
inc s2
load 0,s1
load 0,s1
load 1,s2
add s1, s2, s1
store 0,s1
3210 10 43210 10
![Page 8: A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d3e5503460f94a173bd/html5/thumbnails/8.jpg)
Agenda
Introduction W3H: The Why, The What, The When, and The How
Hardware Performance Libraries Performance Application Programming Interface
(PAPI) Performance Counters Libraries (PCL)
Visualization Tools TAU: An example of a data collector KOJAK: Semi automatic instrumentation tool VAMPIR: An example of a script language PE: The All levels approach
![Page 9: A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d3e5503460f94a173bd/html5/thumbnails/9.jpg)
Hardware Performance Libraries
Performance Counters: Good idea, but only accessible to hardware experts.
Solution: High Level Wrappers. Usually written in C and Fortran. Easy to make them thread safe and to
integrate them in existent code. Examples:
Performance Application Programming Interface (PAPI)
Performance Counters Library (PCL)
![Page 10: A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d3e5503460f94a173bd/html5/thumbnails/10.jpg)
Performance Application Programming Interface A high Level wrapper functions that includes a vast set of architectures
and events Available for Power3, Power4, Ultra SPARC II and III, all flavors of
Pentium, Itanium, AMD Athlon, etc. Well documented, stable and reliable programming interface. Goals of the PAPI project:
To provide a solid foundation for cross platform performance analysis tools
To present a set of standard definitions for performance metrics on all platforms
To provide a standardize API among users, vendors, and academics To be easy to use, well documented, and freely available (Excerpt obtained from the PAPI user guide)
PAPI is an effort of the Innovative Computer Laboratory (ICL) that is part of the Department of Computer Science at the University of Tennessee
![Page 11: A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d3e5503460f94a173bd/html5/thumbnails/11.jpg)
PAPI
Platform PAPI_read() – PAPI 3.0 Altix (Itanium 2 -Madison Chip) 1357 Cycles/Call IBM Power 4 4034 Cycles/Call Itanium 2 (libpfm 2.0) 1606 Cycles/Call Pentium 3 (perfctr 2.4.5) 324 Cycles/Call Pentium 4 (perfctr 2.4.5) 401 Cycles/Call SGI R12k 3681 Cycles/Call Ultrasparc II 2150 Cycles/Call
High Level API
Kernel ExtensionsOperating System
Hardware Performance Counters
Low Level APIPortable Layer
Machine Dependent Layer
Substrate
Blo
ck D
iag
ram
Ove
rhea
d
![Page 12: A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d3e5503460f94a173bd/html5/thumbnails/12.jpg)
PAPI:Terminology Native Events:
Defined as countable by an specific CPU. Machine dependent Hexadecimal value and a mask provided by PAPI libraries
Present Events: Predefined events. Events (or group of events) that are considered useful and
relative ubiquitous across architectures. A PAPI identifier is provided
Event List: A array of events (usually the consist of PAPI identifiers)
![Page 13: A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d3e5503460f94a173bd/html5/thumbnails/13.jpg)
PAPI:Terminology High Level API:
A group of functions A single of list of events Access to Native Events is prohibited. Flexibility and performance is lost due to its easiness to
use Low Level API:
Another group of functions Multiple event list definitions and native events
interface. Only one event list can be running at any point in time
![Page 14: A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d3e5503460f94a173bd/html5/thumbnails/14.jpg)
PAPI:Steps
Initialization of the PAPI library
Start the counters
Operate on the counters
Stop the counters
De-allocate any resource that has been allocated
#include <papi.h>#include <stdio.h>
#define NUM_EVENTS 2
int main(int argc, char **argv){ int Events[NUM_EVENTS] = { PAPI_TOT_INS, PAPI_TOT_CYC }; long_long values[NUM_EVENTS], val2[NUM_EVENTS]; int a= 0; int retval; retval = PAPI_library_init(PAPI_VER_CURRENT); PAPI_start_counters(Events, 2); PAPI_read_counters(values, 2); a++; PAPI_read_counters(values, 2); PAPI_read_counters(val2, 2); printf("The value of a is: %i \n", a); printf("The Coarse Instructions are: %10lld\n", values[0]); printf("The Coarse Cycles are: %10lld\n", (values[1])); printf("The Overhead Instructions are: %10lld\n", val2[0]); printf("The Overhead Cycles are: %10lld\n", (val2[1])); printf("The Total Instructions are: %10lld\n", (-val2[0] + values[0])); printf("The Total Cycles are: %10lld\n", (-val2[1] + values[1])); PAPI_stop_counters(values, 2); return 0;}
![Page 15: A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d3e5503460f94a173bd/html5/thumbnails/15.jpg)
PAPI:Output
The value of a is: 1The Coarse Instructions are: 179The Coarse Cycles are: 641The Overhead Instructions are: 175The Overhead Cycles are: 395The Total Instructions are: 4The Total Cycles are: 246
ld [%fp-52],%l0 add %l0,1,%l0 st %l0,[%fp-52] add %fp,-32,%o0
Assembly Output of a++
The first access to produce a (L2) cache miss
![Page 16: A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d3e5503460f94a173bd/html5/thumbnails/16.jpg)
PAPI:Extra Features Multithread safe and support Multiplexing where available Overflow control with thresholds Statistical Profiling and related functions Error detection and control features
![Page 17: A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d3e5503460f94a173bd/html5/thumbnails/17.jpg)
Performance Counters Libraries
Another Example of High Level performance counters Events are classified (as in PAPI) as Memory Hierarchy events
(caches, TLB, memory, etc), Instructions (Instruction types, Instructions completed, etc), Status of Functional Units and rates and ratios.
It supports the Pentium architectures up to Pentium 4, the AMD Athlon / Duron, the IBM Power series up to Power 3-II, Alpha’s 21164 and 21264, SGI’s R10000 and R12000 and the UltraSPARC family of processors
PCL is available for C, C++ and Java PCL is an effort of Forschungszentrum Juelich GmbH and the
University of Applied Sciences Bonn-Rhein-Sieg from Germany and currently it is in its second version
![Page 18: A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d3e5503460f94a173bd/html5/thumbnails/18.jpg)
PCL High Level API:
Similar to PAPI High Level API but the functions are different.
Events lists can be created in this API Access to predefine events only Recommended
Low Level API: Let to access the performance counters directly Not recommended
Handle: A single Data (usually an integer) that is used to uniquely
identify a set of resources. Used to provide a thread specific link to the resources (the
list of events)
![Page 19: A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d3e5503460f94a173bd/html5/thumbnails/19.jpg)
PLC:Steps
#include <pcl.h>int main(int argc, char **argv){ int counter_list[2], a = 0; int ncounter; unsigned int mode; PCL_CNT_TYPE i_result_list[2]; PCL_FP_CNT_TYPE fp_result_list[2]; PCL_DESCR_TYPE descr; PCLinit(&descr); ncounter = 2; counter_list[0] = PCL_CYCLES; counter_list[1] = PCL_INSTR; mode = PCL_MODE_USER; PCLstart(descr, counter_list, ncounter, mode); a++; PCLstop(descr, i_result_list, fp_result_list, ncounter); printf("%f instructions in %f cycles\n", (double)i_result_list[1], (double)i_result_list[0]); PCLexit(descr); return 0;
}
Initialization of the PCL library
Start the counters
Operate on the counters
Stop the counters
De-allocate any resource that has been allocated
![Page 20: A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d3e5503460f94a173bd/html5/thumbnails/20.jpg)
PLC:Differences with PAPI Nested function call enabled Rates and Ratios are function calls in PAPI
libraries Low Level API deals with native code as
PAPI’s Low level does but its used is not recommended in PCL
![Page 21: A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d3e5503460f94a173bd/html5/thumbnails/21.jpg)
Agenda
Introduction W3H: The Why, The What, The When, and The How
Hardware Performance Libraries Performance Application Programming Interface
(PAPI) Performance Counters Libraries (PCL)
Visualization Tools TAU: An example of a data collector KOJAK: Semi automatic instrumentation tool VAMPIR: An example of a script language PE: The All levels approach
![Page 22: A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d3e5503460f94a173bd/html5/thumbnails/22.jpg)
Visualization Tools
After gathering the information for the tools, how to present it to the user in the most efficient matter?
The visualization tools provide a good way to present trends in data across extensive data sets
Examples of Visualization tools: Tuning and Analysis Utilities Kit for Objective Judgement and Knowledge-based
Detection of Performance Bottlenecks VAMPIR / VAMPIRTrace Performance Evaluator
![Page 23: A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d3e5503460f94a173bd/html5/thumbnails/23.jpg)
Tuning and Analysis Utilities (TAU)
Program and Performance analysis tool framework for high-performance parallel and distributed computing.
A suite of tools for static and dynamic analysis of programs written in C, C++, FORTRAN 77/90, Python, High Performance FORTRAN, and Java.
Instrumentation by functions The concept of Inclusive and Exclusive
With Time Exclusive time Refers to the time spent in the function minus all the
time spent on functions that has instrumented and called by this function
Inclusive time Total time of the function With Performance Counter
The same as time with the properties of that performance counter Supported extensions in C and FORTRAN: MPI and OpenMP Hardware Counters supported: PAPI and PCL
![Page 24: A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d3e5503460f94a173bd/html5/thumbnails/24.jpg)
TAU Infrastructure
![Page 25: A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d3e5503460f94a173bd/html5/thumbnails/25.jpg)
KOJAK
Kit for Objective Judgement and Knowledge-based Detection of Performance Bottlenecks
A complete infrastructure dedicated to find performance bottlenecks and application properties
Consists of the following components OpenMP Pragma And Region Instrumentor (OPARI)
(Redirect the OpenMP function call and directives toward wrappers that contains instrumentation information (POMP)) and PMPI
TAU (function instrumentation) Event Processing, Investigating and Logging (EPILOG)
runtime library (event oriented trace creator utility)
![Page 26: A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d3e5503460f94a173bd/html5/thumbnails/26.jpg)
KOJAK
Extensive Performance Tool (EXPERT) (trace files analyzer search for low performing sections on them and classify them according to severity) uses the Event Analysis and Recognition Library (EARL)
CUBE (KOJAK’s Trace visualization tool) Trace transformations to different formats (to
VAMPIR trace format)
![Page 27: A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d3e5503460f94a173bd/html5/thumbnails/27.jpg)
KOJAK Infrastructure
![Page 28: A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d3e5503460f94a173bd/html5/thumbnails/28.jpg)
KOJAK Snapshots
![Page 29: A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d3e5503460f94a173bd/html5/thumbnails/29.jpg)
KOJAK Snapshots
![Page 30: A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d3e5503460f94a173bd/html5/thumbnails/30.jpg)
VAMPIR
A configurable visualization trace tool Converts trace information into a variety of graphical views:
Process State Display Statistics Display Timeline Display Communications Statistics Configured by using
Pull-down menus Configuration file
The displays can be related to the source code Zoom in and Zoom out Advance feature Defined trace format: VAMPIR-Trace (runtime library enhanced
with trace creation calls)
![Page 31: A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d3e5503460f94a173bd/html5/thumbnails/31.jpg)
VAMPIR Infrastructure
Source Code
Guide Compiler Object Files
Linker
Guide Libraries
VAMPIRTrace Libraries
Executable
Config File
Trace File VAMPIR V
![Page 32: A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d3e5503460f94a173bd/html5/thumbnails/32.jpg)
VAMPIR Snapshot
![Page 33: A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d3e5503460f94a173bd/html5/thumbnails/33.jpg)
Performance Evaluator
Java Based Tool All level analysis of a program behavior:
Application Software level analysis Data / Algorithm Analysis
Operation System level analysis Thread context switching Thread scheduling
Hardware Level Analysis Memory Hierarchy
Used PMAPI performance counters (IBM proprietary)
![Page 34: A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d3e5503460f94a173bd/html5/thumbnails/34.jpg)
Performance Evaluator Infrastructure
K42 Infrastructure
AIX OS
Others
Parser / Modifier
PE2 Visualization Tool
32
1
32
1
1 Trace Format File2 Map File3 Meta File
PE Trace Format PE2 Trace Format
![Page 35: A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d3e5503460f94a173bd/html5/thumbnails/35.jpg)
Performance Evaluator:A Run Get Hardware Information from the
infrastructures (the source has been instrumented and the OS is collecting information also)
Create: Trace file (s) Trace records of a program
with short hand versions of events Map file Have static information about
functions, threads and other structures Meta file (s) Properties of a trace, records
type definitions and Map type definitions
![Page 36: A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d3e5503460f94a173bd/html5/thumbnails/36.jpg)
Performance Evaluator:A Run Feed the files to the tool Visualize the information with graphs Contemplate the whole application behavior
since beginning to the end Complete GUI with the Eclipse Workbench Designed to work with several Multi Threaded
packages in C and Java OpenMP not supported
![Page 37: A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649d3e5503460f94a173bd/html5/thumbnails/37.jpg)
Thanks so much for your time
Questions? Comments?