mpicl/paragraph evaluation report

MPICL/ParaGraph Evaluation Report

Adam Leko,Hans Sherburne

UPC Group

HCS Research LaboratoryUniversity of Florida

Color encoding key:

Blue: Information

Red: Negative note

Green: Positive note

2

Basic Information Name: MPICL/ParaGraph Developer:

ParaGraph: University of Illinois, University of Tennessee MPICL: ORNL

Current versions: Paragraph (no version number, but last available update 1999) MPICL 2.0 Website:

http://www.csar.uiuc.edu/software/paragraph/http://www.csm.ornl.gov/picl/

Contacts: ParaGraph

Michael Heath ([email protected]) Jennifer Finger

MPICL Patrick Worley ([email protected])

Note: Paragraph last updated 1999, MPICL last updated 2001 [both seem dead]

3

MPICL/ParaGraph Overview MPICL

Trace file creation library Uses MPI profiling interface Only records MPI commands

Support for “custom” events using manual instrumentation Writes traces in documented ASCII PICL format

ParaGraph PICL trace visualization tool Very old tool (first written during 1989-1991) Offers a lot of visualizations

Analog: MPICL -> MPE, Jumpshot -> ParaGraph

4

MPICL Overview Installation a nightmare

Requires knowledge of F2C symbol naming convention (!) Had to edit and remove some code to work with new version of MPICH

Hardcoded values for certain field sizes had to be updated One statement in the Fortran environment setup was causing a coredump of instrumented

programs on startup

Automatic instrumentation of MPI programs offered via profiling interface Once installed, very easy to use Have to add 3 lines of code to enable creation of trace files

Calls to tracefiles(), tracelevel(), and tracenode() (see ParaGraph documentation) Minor annoyance, could be done automatically

Manual instrumentation routines also available Calls to tracedata() and traceevent() (see ParaGraph documentation) Notion of program “phases” which allow crude form of source code correlation

Also has extra code to ensure accurate clock synchronization Extra work is done to ensure consistent ordering of events Helps prevent “tachyons” (showing messages received before they are sent) Delays startup by several seconds (but is not mandatory)

After trace file is collected, it has to be sorted using tracesort

5

MPICL Overhead Instrumentation performed using MPI profiling interface

Used a 5MB buffer for trace files On average, instrumentation relatively intrusive, but within 20% Does not include overhead for synchronizing clocks Note: Benchmarks marked with * have high variability in runtimes

MPICL instrumentation overhead

0% 2% 4% 6% 8% 10% 12% 14% 16% 18% 20%

CAMEL*

NAS LU (8p, W)

PP: Big message

PP: Diffuse procedure*

PP: Hot procedure*

PP: Intensive server

PP: Ping pong

PP: Random barrier

PP: Small messages*

PP: System time

PP: Wrong way*

Ben

chm

ark

Overhead (instrumented/uninstrumented)

6

ParaGraph Overview Uses its own widget set

Probably necessary when it was first written in 1989 Widgets look extremely crude by today’s standards

Button = square with text in the middle Uses its own conventions, takes a bit getting used to Once you adjust to interface, becomes less of an issue, but at times conventions

used become cumbersome Example: closing any child window shuts down entire application

ParaGraph philosophy Provide as many different types of visualizations as possible

4 categories: Utilization, communication, tasks, other Use a tape player abstraction for viewing trace data

Similar to Paraver, cumbersome for trying to maneuver to specific times All visualizations use a form of animation Trace data is drawn as fast as possible

This creates problems on modern machines “Slow motion” option available, but doesn’t work that well

Supports application-specific visualizations Have to write custom code and link against it during ParaGraph compilation

7

ParaGraph Visualizations Utilization visualizations

Display rough estimate of processor utilization Utilization broken down into 3 states:

Idle – When program is blocked waiting for a communication operation (or it has stopped execution) Overhead – When a program is performing communication but is not blocked (time spent within MPI library) Busy – if execution part of program other than communication

“Busy” doesn’t necessarily mean useful work is being done since it assumes (not communication) := busy

Communication visualizations Display different aspects of communication Frequency, volume, overall pattern, etc. “Distance” computed by setting topology in options menu

Task visualizations Display information about when processors start & stop tasks Requires manually instrumented code to identify when processors start/stop tasks

Other visualizations Miscellaneous things

Can load/save a visualization window set (does not work)

8

Utilization Visualizations – Utilization Count

Displays # of processors in each state at a given moment in time

Busy shown on bottom, overhead in middle, idle on top

9

Utilization Visualizations – Gantt Chart

Displays utilization state of each processor as a function of time

10

Utilization Visualizations – Kiviat Diagram Shows our friend, the

Kiviat diagram Each spoke is a single

processor Dark green shows moving

average, light green shows current high watermark Timing parameters for each

can be adjusted Metric shown can be

“busy” or “busy + overhead”

11

Utilization Visualizations – Streak

Shows “streak” of state Similar to winning/losing

streaks of baseball teams

Win = overhead or busy Loss = idle

Not sure how useful this is

12

Utilization Visualizations – Utilization Summary

Shows percentage of time spent in each utilization state up to current time

13

Utilization Visualizations – Utilization Meter

Shows percentage of processors in each utilization state at current time

14

Utilization Visualizations – Concurrency Profile Shows histograms of

# processors in a particular utilization state

Ex: Diagram shows Only 1 processor was

busy ~5% of the time All 8 processors were

busy ~90% of the time

15

Communication Visualizations – Color Code

Color code controls colors used on most communication visualizations

Can have color indicate message sizes, message distance, or message tag Distance computed by topology set in options menu

16

Communication Visualizations – Communication Traffic

Shows overall traffic at a given time Bandwidth used, or Number of messages in flight

Can show single node or aggregate of all nodes

17

Communication Visualizations – Spacetime Diagram

Shows standard space-time diagram for communication Messages sent from node to node at which times

18

Communication Visualizations – Message Queues

Shows data about message queue lengths Incoming/outgoing Number of bytes queued/number of messages queued

Colors mean different things Dark color shows current moving average Light color shows high watermark

19

Communication Visualizations – Communication Matrix Shows which

processors sent data to which other processors

20

Communication Visualizations – Communication Meter Show percentage of

communication used at the current time

Message count or bandwidth 100% = max # of messages /

max bandwidth used by the application at a specific time

21

Communication Visualizations – Animation Animates messages as they

occur in trace file Can overlay messages over

topology Available topologies

Mesh Ring Hypercube User-specified

Can layout each node as you want Can store to a file and load later on

22

Communication Visualizations – Node Data Shows detailed

communication data Can display

Metrics Which node Message tag Message distance Message length

For a single node, or aggregate for all nodes

23

Task Visualizations – Task Count

Shows number of processors that are executing a task at the current time

At end of run, changes to show summary of all tasks

24

Task Visualizations – Task Gantt

Shows Gantt chart of which task each processor was working on at a given time

25

Task Visualizations – Task Speed

Similar to Gantt chart, but displays “speed” of each task Must record work done by task in instrumentation call (not

done for example shown above)

26

Task Visualizations – Task Status Shows which tasks

have started and finished at the current time

27

Task Visualizations – Task Summary

Shows % time spent on each task Also shows any overlap between tasks

28

Task Visualizations – Task Surface

Shows time spent on each task by each processor

Useful for seeing load imbalance on a task-by-task basis

29

Task Visualizations – Task Work Displays work done by

each processor Shows rate and volume

of work being done Example doesn’t show

anything because no work amounts recorded in trace being visualized

30

Other Visualizations – Clock, Coordinates Clock

Shows current time Coordinate information

Shows coordinates when you click on any visualization

31

Other Visualizations – Critical Path

Highlights critical path in space-time diagram in red Longest serial path shown in red Depends on point-to-point communication (collective can screw it

up)

32

Other Visualizations – Phase Portrait Shows relationship

between processor utilization and communication usage

33

Other Visualizations – Statistics

Gives overall statistics for run Data

% busy, overhead, idle time Total count and bandwidth of

messages Max, min, average

Message size Distance Transit time

Shows max of 16 processors at a time

34

Other Visualizations – Processor Status Shows

Processor status Which task each

processor is executing Communication (sends

& receives) Each processor is a

square in the grid (8-processor example shown)

35

Other Visualizations – Trace Events

Shows text output of all trace file events

36

Bottleneck Identification Test Suite Testing metric: what did visualizations tell us (no manual instrumentation)? Programs correctness not affected by instrumentation CAMEL: PASSED

Space-time diagram & bandwidth utilization visualizations showed large number of small messages at beginning

Utilization graphs showed low overhead, few idle states LU: PASSED

Space-time diagram showed large number of small messages Kiviat diagram showed moving average of processor utilization low Phase portrait showed large correlation between communication and low

processor utilization Big messages: PASSED

Utilization Gantt and space-time diagrams showed large amount of overhead at time of each send

Diffuse procedure: PASSED Utilization Gantt showed one processor busy & rest idle Need manual instrumentation to determine that one routine takes too long

37

Bottleneck Identification Test Suite (2) Hot procedure: FAILED

Purely sequential code, so ParaGraph could not distinguish between idle and busy states

Intensive server: PASSED Utilization Gantt chart showed all processors except first idle Space-time chart showed processor 0 being inundated with messages

Ping-pong: PASSED Space-time chart showed large # of small messages dependent on each

other Random barrier: TOSS-UP

Utilization count showed one processor busy through execution Utilization Gantt chart showed busy processor randomly dispersed However, “waiting for barrier” state shown as idle, so difficult to track

down to barrier without extra manual instrumentation

38

Bottleneck Identification Test Suite (3) Small messages: PASSED

Utilization Gantt chart showed lots of time spent in MPI code (overhead)

Space-time diagram showed large numbers of small messages

System time: FAILED All processes show as busy, no distinction of user vs. system

time No communication = classification of processor states not really

done at all, everything just gets attributed to busy time

Wrong order: PASSED Space-time diagram showed messages being received in the

reverse order they were sent But, have to pay close attention to how the diagram is drawn

39

How to Best Use ParaGraph/MPICL Don’t use MPICL

Better trace file formats and libraries are available now We probably should look over the clock synchronization

code, but this probably isn’t useful if high-resolution timers are available Especially for shared-memory machines

Don’t use ParaGraph’s code directly But, has a lot of neat visualizations we could copy At the most we should scan the code to see how a

visualization is calculated In summary: just take the best ideas & visualizations

40

Evaluation (1) Available metrics: 2/5

Only records communication, task entrance and exit Approximates processor state by equating not communication = busy

Cost: 5/5 Free!

Documentation quality: 2/5 ParaGraph has excellent manual Very hard to find information on MPICL MPICL installation instructions woefully inadequate

Extensibility: 2/5 Can add custom visualizations, but must write code and recompile ParaGraph Open source, but uses old X-Windows API & it’s own widget set Dead project (no updates since 1999)

Filtering and aggregation: 1/5 Not really performed A few visualizations can be restricted to a certain processor Can output summary statistics (other visualization -> stats)

41

Evaluation (2) Hardware support: 5/5

Cray X1, AlphaServer (Tru64), IBM SP (AIX), SGI Altix, 64-bit Linux clusters (Opteron & Itanium)

Support for a large number of vendor-specific MPI libraries Would probably need a lot of effort to port to more modern architectures

though Heterogeneity support: 0/5 (not supported) Installation: 1.5/5

ParaGraph relatively easy to compile and install MPICL installation is extremely difficult, especially with modern versions of

MPIC/LAM Interoperability: 0/5

Does not interoperate with other tools Learning curve: 2.5/5

MPICL library easy to use ParaGraph interface unintuitive, can get in the way

42

Evaluation (3) Manual overhead: 1/5

Can record all MPI calls by linking, but this requires the addition of trace control instructions in source code

Task visualizations depend on manual instrumentation Measurement accuracy: 2/5

CAMEL: ~18% overhead Instrumentation adds a bit of runtime overhead, especially when many messages

are sent Multiple executions: 0/5 (not supported) Multiple analyses & views: 5/5

Many, many ways of looking at trace data Performance bottleneck identification: 4/5

Bottleneck identification must be performed manually Many visualizations help with bottleneck detection, but no guidance is provided

on which one you should examine first

43

Evaluation (4) Profiling/tracing support: 3/5

Only tracing supported Profiling data can be shown in ParaGraph after processing trace file

Response time: 2/5 Nothing reported until after program runs Also need (computationally expensive) trace sort to be performed

before you can view trace file Large trace files take a while to load (ParaGraph must pass over entire

trace before displaying anything) Searching: 0/5 (not supported) Software support: 3/5

Can link against any library using MPI profiling interface, but will not be instrumented

Only MPI and some (very old, obsolete) vendor-specific message-passing libraries are supported

44

Evaluation (5) Source code correlation: 0/5

Not supported Can do indirectly via manual instrumentation of tasks, but still hard to figure out

exactly where things occur in source code System stability: 3.5/5

MPICL relatively stable after bugs were fixed during compilation ParaGraph stable as long as you don’t try to do weird things (load the wrong file)

Not very robust with error handling ParaGraph’s load/save window set doesn’t work

Technical support: 0/5 Dead project Project email addresses still seem valid, but not sure how much help we could

get from the developers now

mpicl/paragraph evaluation report

Documents

mpi profiling interfaceused

extra code

lines of code

creation of trace filescalls

mandatoryafter trace

profiling interfaceonce

trace fileson average

mpicl mpe