© 2002 IBM CorporationJune 2005
News from ACTC
Present and Future of the IBM High Performance Computing Toolkit
Simone SbaragliaAdvanced Computing TechnologyIBM T.J. Watson Research Center
Advanced Computing Technology Center
The IBM High Performance Computing Toolkit © 2005 IBM Corporation
• IBM goal: • A common application performance analysis environment across all
IBM HPC servers
• Common framework for performance analysis of communication, memory, CPU, shared-memory, I/O
• Operate on the binary and yet provide reports in terms of source-level symbols
• Full source code traceback capability
• Dynamically activate/deactivate data collection and change what information to collect
• Where we are:• One consolidated package (AIX)
• Tools for MPI, OMP, processor, memory etc
• One common visualization GUI
Advanced Computing Technology Center
The IBM High Performance Computing Toolkit © 2005 IBM Corporation3
IBM HPC Toolkit Software Components
• Hardware (CPU) PerformanceXprofilerHPM Toolkit
- hpmcount- libhpm- hpmstat
• Shared Memory PerformanceDpomp, PompProf
• Optimized Linear AlgebraWSMP
• Message-Passing PerformanceMP_Profiler, MP_Tracer
• Memory Performance Sigma
• Performance VisualizationPeekPerf
• I/O Performance MIO (modular I/O)
Advanced Computing Technology Center
The IBM High Performance Computing Toolkit © 2005 IBM Corporation4
AGENDA
• Xprofiler: call-graph profiling
• HPM: hardware counter data
• MP-Profiler/Tracer: MPI profiling
• PompProf: OpenMP profiling
• SIGMA: memory profiling
• The New HPC Toolkit
• Questions/Comments
Advanced Computing Technology Center
The IBM High Performance Computing Toolkit © 2005 IBM Corporation
Xprofiler
• Compile with -g -pg
• Width of a bar:time includingcalled routines
• Height of a bar:time excludingcalled routines
• Call arrowslabeled withnumber of calls
• Overview windowfor easy navigation(View Overview)
Advanced Computing Technology Center
The IBM High Performance Computing Toolkit © 2005 IBM Corporation
Xprofiler: Zoom In
Advanced Computing Technology Center
The IBM High Performance Computing Toolkit © 2005 IBM Corporation
Xprofiler:
• Xprofiler provides usual gprof reports plus some extra
Flat Profile
Call Graph Profile
Function Index
Function Call Summary
Library Statistics
Source Code with Line Profile (in ticks)
Disassembled Code
Advanced Computing Technology Center
The IBM High Performance Computing Toolkit © 2005 IBM Corporation
HPM: What Are Performance Counters• Extra logic inserted in the processor to count specific events
• Updated at every cycle
• Strengths:Non-intrusive
Accurate
Low overhead
• Weaknesses:Specific for each processor
Access is not well documented
Lack of standard and documentation on what is counted
Advanced Computing Technology Center
The IBM High Performance Computing Toolkit © 2005 IBM Corporation
HPM: Hardware Counters Examples
• Cycles• Instructions• Floating point instructions• Integer instructions• Load/stores• Cache misses• TLB misses• Branch taken / not taken• Branch mispredictions
• Useful derived metrics
IPC - instructions per cycleFloat point rate (Mflip/s)Computation intensityInstructions per load/storeLoad/stores per cache missCache hit rateLoads per load missStores per store missLoads per TLB missBranches mispredicted %
Advanced Computing Technology Center
The IBM High Performance Computing Toolkit © 2005 IBM Corporation
HPM Toolkit
hpmcountStarts a program and at the end of the execution provides a summary
with hardware counters information and derived metricsSimple to use, no source code modification
libhpmInstrumentation library for performance measurement of Fortran, C,
and C++ applications
hpmstatHardware monitoring at system level
Advanced Computing Technology Center
The IBM High Performance Computing Toolkit © 2005 IBM Corporation
HPMCOUNT Output - Group 5PM_DATA_FROM_L3 (Data loaded from L3) : 64164220PM_DATA_FROM_MEM (Data loaded from memory) : 5623627PM_DATA_FROM_L35 (Data loaded from L3.5) : 281896PM_DATA_FROM_L2 (Data loaded from L2) : 947542929PM_DATA_FROM_L25_SHR (Data loaded from L2.5 shared) : 739PM_DATA_FROM_L275_SHR (Data loaded from L2.75 shared) : 567PM_DATA_FROM_L275_MOD (Data loaded from L2.75 modified) : 903PM_DATA_FROM_L25_MOD (Data loaded from L2.5 modified) : 120
Memory traffic : 2879.297 MBytesMemory bandwidth : 96.317 MBytes/secEstimated latency from loads from memory : 1.730 secTotal loads from L3 : 64.446 ML3 traffic : 8249.103 MBytesL3 bandwidth : 275.945 MBytes/secEstimated latency from loads from L3 : 5.067 secL3 Load miss rate : 8.026 %Total loads from L2 : 947.545 ML2 traffic : 121285.793 MBytesL2 bandwidth : 4057.200 MBytes/secEstimated latency from loads from L2 : 8.747 secL2 Load miss rate : 6.886 %
Estimation based on user input
Advanced Computing Technology Center
The IBM High Performance Computing Toolkit © 2005 IBM Corporation
LIBHPMAllows to go in the source code and instrument different sections independently
• Declaration:#include f_hpm.h
• Use:call f_hpminit( 0, “prog” )call f_hpmstart( 1, “work” )do call do_work()call f_hpmstart( 22, “more work” )
call compute_meaning_of_life()call f_hpmstop( 22 )
end docall f_hpmstop( 1 )call f_hpmterminate( 0 )
• Supports MPI, OpenMP, Pthreads
• Multiple instrumentation points
• Nested sections
• Supports Fortran, C, C++
Advanced Computing Technology Center
The IBM High Performance Computing Toolkit © 2005 IBM Corporation
PeekPerf: Unified GUI with Source Code Traceback
HPM MP_profiler/Tracer PomProf SiGMA MIO
(work in progress)
PeekPerf
Available on AIX, Linux, Windows, Mac OS, BG/L, …
Advanced Computing Technology Center
The IBM High Performance Computing Toolkit © 2005 IBM Corporation
HPM Visualization Using PeekPerf
Advanced Computing Technology Center
The IBM High Performance Computing Toolkit © 2005 IBM Corporation
Message-Passing PerformanceMP_Profiler/MP_Tracer
– Implements wrappers around MPI calls using the PMPI interfacestart timercall pmpi equivalent functionstop timer
– Captures both “summary” and trace data for MPI calls with source code traceback
– No changes to source code, but MUST compile with -g– ~1.7 microsecond overhead per MPI call– Does not synchronize MPI calls– Compile with –g and link with libmpitrace.a– Generate XML files for peekperf
Advanced Computing Technology Center
The IBM High Performance Computing Toolkit © 2005 IBM Corporation
MP_Profiler Visualization Using PeekPerf
Advanced Computing Technology Center
The IBM High Performance Computing Toolkit © 2005 IBM Corporation
MP_Tracer Visualization Using PeekPerf
Advanced Computing Technology Center
The IBM High Performance Computing Toolkit © 2005 IBM Corporation
OMP Profiler (PompProf)
• POMP proposal motivated by the MPI profiling interface (PMPI)
• PompProf is a profiler for OpenMP applications implemented on top of DPCL, using the POMP specification
• Generates a detailed profile describing overheads and time spentby each thread in three key regions of the parallel application:
Parallel regions
OpenMP loops inside a parallel region
User defined functions
• Profile data is presented in the form of an XML file that can bevisualized with PeekPerf
Advanced Computing Technology Center
The IBM High Performance Computing Toolkit © 2005 IBM Corporation
PompProfiler and PeekPerf
Advanced Computing Technology Center
The IBM High Performance Computing Toolkit © 2005 IBM Corporation
SIGMA Infrastructure
Infrastructure for symbolic, dynamic, performance-oriented binary instrumentation
– Inject arbitrary user-supplied probes into a binary application
Provides a platform to:
– Develop performance tools– Experiment with memory performance models– Ask “what-if” questions regarding data and code rearrangements– Provide feedback on design of new memory architectures– Identify performance bottlenecks due to the memory hierarchy and
data layout
SIGMA Memory Profiler is integrated in the HPC Toolkit
Advanced Computing Technology Center
The IBM High Performance Computing Toolkit © 2005 IBM Corporation
duringexecution
memorysimulation
instrumentedprogramexecution
instrumented binary
events
instrumented binary
SigmaSoftware
catalogueof events
library ofevent-
handlers
psigmaInst
script of desired events,handlers and machine
configurationapplication
binaryuser
event-handler
user
activate/deactivate
userevent-handler
Sigmaevent-handler
standardized event description interface
Advanced Computing Technology Center
The IBM High Performance Computing Toolkit © 2005 IBM Corporation
Memory Profile:
– Execute functional cache simulation and provide memory profile– Power3/Power4 architectures prefetching implemented– Write-Back/Write-Through caches, replacement policies etc
Provide counters such as hits, misses, cold misses foreach cache leveleach functioneach data structureeach data structure within each function
Output sorted by the SIGMA memtime:SUM( LoadHits(i)*LoadLat(i) + StoreHits(i)*StoreLat(i) ) +
#TLBmisses * Lat(TLBmiss)
memtime should track wall time for memory bound applications
Advanced Computing Technology Center
The IBM High Performance Computing Toolkit © 2005 IBM Corporation
L1 L2 L3 TLB MEM FUNCTION: calc1 (memtime = 0.0050)
Load Acc/Miss/Cold 522819/2252/1 2252/345/0 345/0/0 784419/126/0 0/-/-Load Acc/Miss Ratio 232 6 - 6225 -Load Hit Ratio 99.57% 84.68% 100.00% 99.98% -Est. Load Latency 0.0008 sec 0.0000 sec 0.0000 sec 0.0001 sec 0.0000 sec Load Traffic - 238.38 Kb 43.12 Kb - 0.00 Kb ………
FUNCTION: calc2 (memtime = 0.0042)
Load Acc/Miss/Cold 622230/2631/0 2631/1661/0 1661/0/0 814269/94/0 0/-/-Load Acc/Miss Ratio 236 1 - 8662 -Load Hit Ratio 99.58% 36.87% 100.00% 99.99% -Est. Load Latency 0.0010 sec 0.0000 sec 0.0001 sec 0.0001 sec 0.0000 sec Load Traffic - 121.25 Kb 207.62 Kb - 0.00 Kb ………
L1 L2 L3 TLB MEM DATA: u (memtime = 0.0012)
Load Acc/Miss/Cold 167710/708/0 708/317/0 317/0/0 216097/31/0 0/-/-Load Acc/Miss Ratio 236 2 - 6970 -Load Hit Ratio 99.58% 55.23% 100.00% 99.99% -Est. Load Latency 0.0003 sec 0.0000 sec 0.0000 sec 0.0000 sec 0.0000 sec Load Traffic - 48.88 Kb 39.62 Kb - 0.00 Kb ……….
DATA: v (memtime = 0.0012)
Load Acc/Miss/Cold 167710/721/0 721/316/0 316/0/0 216097/31/0 0/-/-Load Acc/Miss Ratio 232 2 - 6970 -Load Hit Ratio 99.57% 56.17% 100.00% 99.99% -Est. Load Latency 0.0003 sec 0.0000 sec 0.0000 sec 0.0000 sec 0.0000 sec Load Traffic - 50.62 Kb 39.50 Kb - 0.00 Kb ……….
Memory Profile Output
Advanced Computing Technology Center
The IBM High Performance Computing Toolkit © 2005 IBM Corporation
Memory Profile Viewer – Data Structure Focus
Advanced Computing Technology Center
The IBM High Performance Computing Toolkit © 2005 IBM Corporation25
The New HPC Toolkit (4Q2005)
Advanced Computing Technology Center
The IBM High Performance Computing Toolkit © 2005 IBM Corporation
New HPC Toolkit
• Common framework for CPU, MPI, OpenMP, Memory, I/O
• Instrumentation at the binary level
• Centralized GUI for instrumentation and analysis
• Dynamic instrumentation capabilities
• Enhanced with graphics capabilities (bar charts, plots etc)
• Simultaneous instrumentation for all tools (one run!)
• Query capabilities: compute derived metrics and plot them
• Selective instrumentation of MPI functions
Advanced Computing Technology Center
The IBM High Performance Computing Toolkit © 2005 IBM Corporation
Advanced Computing Technology Center
The IBM High Performance Computing Toolkit © 2005 IBM Corporation
Advanced Computing Technology Center
The IBM High Performance Computing Toolkit © 2005 IBM Corporation
IBM HPC Toolkit Availability
• IBM pSeries ServersLinux on Power
AIX on Power
• All IBM Blue Gene systemsIncluded as part of the system software stack
Customers do not need to acquire additional licensing
• IBM Cell ServerWork in Progress
Advanced Computing Technology Center
The IBM High Performance Computing Toolkit © 2005 IBM Corporation
Support Matrix
HPMCount&
HPMlib
MP-profiler&MP-tracer Xprofiler
SHMEM &
SHMEM-profiler
MIO PompPofiler
SiGMA PeekPerfWatson Sparse Matrix Package
AIX Power
today AIX
5L 5.1, 5.3
today AIX
4.3.3 +
today AIX
5L 5.1
today AIX
5L 5.1
todayAIX
5L 5.1
today AIX
5L 5.1
today AIX
4.3.3+
TodayAIX
4.3.3+
today AIX
5L 5.1
Linux Power
Aug/05 Linux
2.4 &2.6
May/05 Linux
2.6
Aug-Sep/05 Linux
2.6N/A
TBT Linux
2.6N/A
Aug-Sep/05 Linux
2.6TBT
TBTLinux
2.6
Linux JS20
Aug/05 Linux
2.4 &2.6
May/05 Linux
2.6
Aug-Sep/05 Linux
2.6N/A
TBT Linux
2.6 N/A
Aug-Sep/05 Linux
2.6TBT
TBTLinux
2.6
Linux BG/L Aug/05 today today N/A TBT N/A N/A today N/A
Advanced Computing Technology Center
The IBM High Performance Computing Toolkit © 2005 IBM Corporation
Summary• The IBM HPC Toolkit is the IBM environment for HPC application
performance analysis• Three years of field testing and experience
• Visual source code traceback for all metrics
• Common framework for simultaneous performance analysis of CPU, MPI, OpenMP, Memory and I/O
• Available across IBM HPC servers (pSeries, Linux, Blue Gene)
• Future infrastructure characteristics• Instrumentation of the binary
• Common instrumentation and analysis GUI
• Dynamic instrumentation capabilities
© 2002 IBM CorporationJune 2005
Questions / Comments