profiling tools on the nersc crays and ibm/sp

Post on 21-Mar-2016

40 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER. Profiling Tools on the NERSC Crays and IBM/SP. NERSC User Services. N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER. Outline. Profiling Tools on NERSC platforms Cray PVP (killeen, seymour) Cray T3E (mcurie) - PowerPoint PPT Presentation

TRANSCRIPT

Profiling Tools on the NERSC Crays and IBM/SP

NERSC User Services

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

2

Outline• Profiling Tools on NERSC platforms

– Cray PVP (killeen, seymour)

– Cray T3E (mcurie)

– IBM/SP (gseaborg)

• UNIX profiling/performance analysis tools• References

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

3

Why Profile?

• Characterise application :– Is code cpu bound?– Is code I/O bound?– Is code memory bound?– Analyse communication patterns - D.M. codes

• Focus optimisation effort ... and ultimately..• Improve performance and resource utilisation

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

4

Cray PVP/T3E - Application Characterization

• Job accounting (ja) • ja

• ./a.out• ja -st -n a.out - see next slide for sample output

• Look out for :• Maximum Memory Used > available memory• Total I/O wait time (locked+unlocked) > 50% User CPU

time• Multitasking breakdown for parallel codes

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

5

Job accounting : summary reportElapsed Time : 8 Seconds User CPU Time : 35.5939 Seconds Multitasking/ Multistreaming Breakdown (Concurrent CPUs * Connect seconds = CPU seconds)

1 * 0.0100 = 0.0100 2 * 0.0100 = 0.0200 3 * 0.0600 = 0.1800 4 * 8.8500 = 35.4000

(Avg.) (total) (total) 3.99 * 8.9300 = 35.6100

System CPU Time : 0.1226 Seconds I/O Wait Time (Locked) : 0.0000 I/O Wait Time (Unlocked) : 0.0000CPU Time Memory Integral : 5.3854 Mword-seconds Data Transferred : 0.0001 MWords Maximum memory used : 0.4746 MWords

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

6

HPM - Hardware Performance HPM - Hardware Performance MonitorMonitor

• Helps locate CPU related code bottlenecks• reports use of vector registers, instruction buffers,

memory ports

• hpm {options} ./a.out {prog_arguments}• options = -g2 -> memory access information • options = -g3 -> vector register information

• Look for :• Ratio of Floating Ops/CPU second to CPU mem.

references per sec should reflect the FpOps in the code

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

7

Sample hpm output : (hpm -g0 ./a.out)Million inst/sec (MIPS) : 7.67 Instructions : 274017290Avg. clock periods/inst : 26.06% CP holding issue : 94.02 CP holding issue : 6714667737Inst.buffer fetches/sec : 0.04M Inst.buf. fetches: 1420802Floating adds/sec : 15.40M F.P. adds : 550002417Floating multiplies/sec : 24.36M F.P. multiplies : 870004996Floating reciprocal/sec : 0.28M F.P. reciprocals : 10000042Cache hits/sec : 0.00M Cache hits : 45893CPU mem. references/sec : 34.64M CPU references : 1236978495Floating ops/CPU second: 40.5M

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

8

Cray PVP : CPU Bound Codes: prof/profview

• Instruments code to provide % cpu time in function calls

• f90 -lprof prog.f90• ./a.out -> generates prof.data• prof -st ./a.out > prof.report

• Chart (over) indicates relative distribution of CPU execution time by function call– prof -x a.out > pgm.prof– profview pgm.prof

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

9

Profview - Sample Output

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

10

I/O and Memory Bound Codes : procstat/procview

• procstat -m -i -R a.raw a.out • procview a.raw

– I/O Analysis :• Reports, Files -> All User Files (Long Report)

• Bytes Processed or I/O Wait Time

– Memory Analysis :• Reports -> Processes -> Maximum Memory Used

(Long Format)

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

11

I/O Bound Codes : procview

• procview indicates which files consume most real time for I/O processing

Memory Bound Codes : procview– “High” (> 10% Elapsed

Time) Time to complete Memory requests may indicate memory bound code

– Use Graphs option to produce plot of Memory use over elapsed time of application

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

12

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

13

ATExpert - Autotasking ATExpert - Autotasking PredictionPrediction

• Analysis of source code to predict autotasking performance on dedicated Cray PVP

• f90 -eX -O3 -r4 -o {prog_name} prog.f90– ./a.out– atexpert -> shows predicted speed-up

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

14

ATExpert Sample outputATExpert Sample outputIndicates predicted speed-up of 4.3 on dedicated 8 processor PVP when source code is autotasked

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

15

Also available on Cray PVP Also available on Cray PVP • Flowtrace/flowview

• times (using Operating System timers) subroutines and functions during program execution

• jumptrace/jumpview• provides exact timing in function/subroutine by analysis

of machine instructions in program

• perftrace/perfview• times subroutines/functions based on statistics gathered

from HPM tool

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

16

Cray T3E - ApprenticeCray T3E - Apprentice• Locate performance problems /inefficiencies

• MPI and shared memory performance, load balance and communication, memory use

• Provides hardware performance information and tuning recommendations (Displays -> Observations)

• Compile/link• f90 -o {prog} -eA {prog_name.f90} -lapp

• cc -o {prog} -happrentice {prog_name.c} -lapp

• Run code to generate app.rif

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

17

Output from :

apprentice app.rif

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

18

Cray T3E - PATCray T3E - PAT

• Generates profile of CPU time in functions; load balance across PEs; h/w counter info.

• Compile and Link with PAT library• f90 -o exe -lpat {source.f} pat.cld

• Run program as normal• mpprun -n {procs} {exe} -> generate exe.pif

• pat executable exe.pif

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

19

Profile based on relative CPU time in function calls

Load Balance Histogram for routine “COLL”

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

20

Cray T3E - ACTS/TAUCray T3E - ACTS/TAU • Performance analysis of distributed/shared

memory applications (C++ in particular)• module load tau

• instrument programs with TAU macros

• add $(TAU_DEFS), $(TAULIBS) to compile/link

• run application; view tracefile with pprof, VAMPIR

• Reference• http://acts.nersc.gov/tau• http://hpcf.nersc.gov/training/classes/Teleconf/1999july/Wu

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

21

Cray T3E - VampirCray T3E - Vampir • Analysis of message passing characteristics -

generates display of MPI activity over instrumented time period (e.g. sender, receiver, message size, elapsed time)

• module load VAMPIR; module load vampirtrace• Facility to instrument with VAMPIRtrace calls• Generate trace file using TAU or VAMPIRtrace

• Reference :• http://hpcf.nersc.gov/software/tools/vampir.html

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

22

IBM/SP - XprofilerIBM/SP - Xprofiler• Graphical interface for gprof profiles of parallel

applications – Compile and link code with “-g -pg”– poe ./a.out -procs {n}

• generates gmon.out.{n} file for each process• may introduce significant (upto factor of 2) overhead

– (In $TMPDIR) xprofiler ./a.out gmon.out.*• Report menu provides (gprof) text profile• Source statement profiling shown

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

23

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

24

Statement level profile available by clicking on relevant function graphical output - use Show Source Code option

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

25

IBM/SP - Visualization Tool (VT)IBM/SP - Visualization Tool (VT)

• Message passing trace visualization • Realtime system activity monitor (limited)• MPI load balance overview :

• poe ./a.out -procs {n} -tlevel=3• copy a.out.trc to $TMPDIR• (In $TMPDIR) Invoke vt • In trace visualization mode, “Play” a.out.trc • see next slide for sample of Interprocessor

Communication during program execution

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

26

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

27

IBM/SP : system_statsIBM/SP : system_stats• IBM Internal Tool

• module load sptools• instrument code with system_stats() call• Link with $(SPTOOLS), run code as normal

• Sample output Summary of the utilization of system resources:node hostname wall(s) user(s) sys(s) size(KB) pswitches 0 gs01015 16.80 13.18 0.04 2748 2138 1 gs01015 16.80 16.07 0.04 2744 1868 2 gs01003 16.80 16.62 0.04 2740 1870 3 gs01003 16.80 16.56 0.03 2732 1841

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

28

IBM/SP - trace-mpi IBM/SP - trace-mpi • IBM Internal tool - Quantitative information on

MPI calls– module load USG ; module load trace-mpi– Fortran - add $(TRACE_MPIF) to build– C - add $(TRACE_MPI) to build– poe ./a.out -procs {n} - generates mpi.trace_file for each

process (executable must call MPI_Finalize)– summary mpi.trace_file.{n} (see over)

• Useful check for load balance :– grep “Total Communication” mpi.trace.file.*

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

29

MPI message-passing summary for mpi.trace_file.3MPI Function #calls Avg Bytes Time (sec)-------------------------------------------------------------MPI_Allreduce: 9355 8.0 3.596MPI_Barrier: 3 0.0 0.017MPI_Bcast: 66 5.8 0.013MPI_Scatter: 31 1008.0 0.088MPI_Comm_rank: 1 0.0 0.000MPI_Comm_size: 1 0.0 0.000MPI_Isend: 43023 2003.7 0.893MPI_Recv: 43023 2003.7 7.481MPI_Wait: 43023 2003.7 3.739Total Communication Information: WALL = 15.8277, CPU = 15.53, MBYTES = 258.72The total amount of wall time = 26.229613

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

30

Upcoming on the SPUpcoming on the SP• ACTS/TAU (C/C++)

• currently being ported to the IBM/SP

• VAMPIR• has been ordered, awaiting delivery

• Performance Monitor Toolkit (HPM)• should be available with Phase II system

(requires AIX 4.3.4)• Also, see Performance API project:

– http://icl.cs.utk.edu/projects/papi

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

31

General/UNIX Profiling ToolsGeneral/UNIX Profiling Tools• Command line profilers and system analysis

• prof/gprof (enabled for MPI on IBM/SP)• csh time command : time ./a.out• vmstat -> look for high paging over extended time

period (application may require more memory)

• Fortran/C function timers • getrusage • rtc, irtc• etime, dtime, mclock• MPI_Wtime

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

32

Reference MaterialReference Material• NERSC web pages

• http://hpcf.nersc.gov/software/tools

• Cray PVP/Cray T3E • http://www.cray.com/swpubs

– Optimizing Code on Cray PVP Systems– Cray T3E C, Fortran Optimization Guides

• IBM/SP• LLNL Workshop on Performance Tools

top related