ipdps 2005, slide 1 automatic construction and evaluation of “performance skeletons” (...

IPDPS 2005, slide 1

Automatic Construction and Evaluation of “Performance Skeletons”

(Predicting Performance in an Unpredictable World)

Sukhdeep Sodhi

Microsoft

Jaspal SubhlokUniversity of Houston

IPDPS 2005

IPDPS 2005, slide 2

What is a Performance Skeleton anyway ?

A short running program that mimics execution behavior of a given application

GOAL: execution time of a performance skeleton is a fixed fraction of application execution time - say 1:1000, then..

Sounds vaguely interesting but… Who cares ? How to do it ? Is it even possible to build one ?

If the Application runtime is

10K seconds on a dedicated compute cluster

15K seconds on a shared compute cluster

20K seconds on a shared heterogeneous grid

1 million seconds under simulation

1K seconds on a supercomputer

…..,

Skeleton runs in

10 secs

15 secs

20 secs

1000 secs

1 second

IPDPS 2005, slide 3

Who Cares ? Anyone who needs a performance estimate when it cannot be modeled well

Data

Sim 1VisSim 2

Stream

Model

Pre

?ApplicationNetwork

Which nodes offer best performance

• Performance testing of a future architecture under simulation: Large applications cannot be tested as simulation is 1000X slower

• Applications Distributed on Networks: Resource selection, Mapping, Adapting

IPDPS 2005, slide 4

Mapping Distributed Applications on Networks: “state of the art”

Data

Sim 1Sim 2

Stream

Model

PreVis

Mapping for Best Performance

1. Measure and model network and application characteristics (NWS is popular)

2. Find “best” match of nodes for execution

But the approach has significant limitations…

• Knowing network status is not the same as knowing how an application will perform

• Frequent measurements are expensive, less frequent measurements mean stale data

IPDPS 2005, slide 5

Data

Sim 1VisSim 2

Stream

Model

Pre

?

Application

Network

Predict performance and select nodes by actual execution of performance skeletons on groups of nodes

Mapping Distributed Applications on Networks: “our approach”

IPDPS 2005, slide 6

Data

Sim 1VisSim 2

Stream

Model

Pre

How to Construct a Performance Skeleton ?

Data

Sim 1

VisSim 2

Stream

Model

Pre

Central challenge in this research

Common sense dictates that an application and its skeleton must be similar in:– Computation behavior

– Communication behavior

– Memory behavior

– I/O Behavior

All execution behavior is to be captured in a short program

application

skeletonHow ? How ?

IPDPS 2005, slide 7

Data

Sim 1VisSim 2

Stream

Model

Pre

How to Construct a Performance Skeleton ?

Data

Sim 1

VisSim 2

Stream

Model

Pre

Run application

skeletonHow ?

Record Execution Trace

Compress execution trace into Execution Signature

Construct Performance Skeleton

Execution trace is a record of all system activity during execution such as memory accesses, communication messages and CPU events.Execution signature is a compressed summarized record of executionPerformance Skeleton is a program based on execution signature

IPDPS 2005, slide 8

Likmitations of Work Presented Today

Only model the coarse application computation and communication patterns to build performance skeleton

– ignore memory and I/O behavior

– Ignore specific instructions – only consider whether CPU is computing or communicating or idle

– somewhat intrusive – link with a profiling library

– Limited to MPI programs

But these are not limitations of the approach.

Most are being addressed in the project.

IPDPS 2005, slide 9

Data

Sim 1VisSim 2

Stream

Model

Pre

Constructing a Performance Skeleton

Data

Sim 1

VisSim 2

Stream

Model

Pre

Run application

skeletonHow ?



Construct Performance Skeleton program from execution signature

IPDPS 2005, slide 10

• Link MPI application with PMPI based profiling library– no source code modification / analysis required

• Execute on a dedicated testbed• Records all MPI function calls

– Call name, start time, stop time, parameters – Timing done to microsecond granularity

• CPU busy = time between consecutive MPI calls

Result is a (long) execution sequence of computation and communication events and their durations/parameters

Recording Execution Trace


Data

Sim 1VisSim 2

Stream

Model

Pre

Constructing a Simple Performance Skeleton

Data

Sim 1

VisSim 2

Stream

Model

Pre

Run application

skeletonHow ?





Application execution typically follows cyclic patterns• Goal: Form loop structure by identifying repeating

execution behavior.Step 1: Execution trace to symbol strings• Identify “similar” (may not be identical) execution events• Each event in such a cluster of similar events is

replaced by a representative and assigned a symbol

Execution trace is replaced by symbol string …

Where, say [ = compute for ~100ms], [ = MPI call to send ~800 ] bytes to a neighbor node

Compress Execution Trace Execution Signature


Step 2: Compress string by Identifying Cycles– Build loop structure recursively from symbol strings

e.g. is replaced by

[ ]3 [ []2 ]2

– Similar to longest substring matching problem

Typical Execution Signature is multiple orders of magnitude smaller than trace

Step 3: Adaptively increase degree of compression (by managing a “similarity parameter”) until signature is compact enough

Compress Execution Trace Execution Signature


Data

Sim 1VisSim 2

Stream

Model

Pre

Constructing a Simple Performance Skeleton

Data

Sim 1

VisSim 2

Stream

Model

Pre

Run application

skeletonHow ?





Goal:Execution time of performance skeleton is 1/K application execution time (K given by user)

• Reduce Iterations of each loop in application signature by a factor K

• Heuristically process remaining iterations and events outside loops

• Replace symbols by C language statements

Generate Performance Skeleton Program


Skeletons constructed for Class B NAS MPI benchmarks. Executed on 4 cluster nodes in following sharing scenarios:

• Dedicated nodes (defines reference execution time ratio between skeleton and application)

• Competing processes on: one node/ all nodes• Competing traffic on: one link /all links• Competition as above on one node and one link

Skeleton execution time used to predict application execution time in different scenarios

Setup: Intel Xeon dual CPU 1.7 GHz nodes running Linux 2.4.7. Gigabit crossbar switch. Simple CPU intensive competing processes. iproute to simulate link sharing

Experimental Validation


02468

101214161820

10 secondskeleton

5 secondskeleton

2 secondskeleton

1 secondskeleton

0.5 secondskeleton

Skeleton Sizes

Err

or

(%)

BT CG IS LU MG SP Average

Average prediction error is ~ 6 %, max ~ 18% --acceptable

Longer skeletons better but even .5 sec. skeletons meaningful (tool issues a warning if requested skeleton size is too small)

Prediction Accuracy of Skeletons(average across all sharing scenarios)


Prediction for Different Sharing Scenarios (10 second skeletons)

0

5

10

15

20

25

BT CG IS LU MG SP AverageApplications

Err

or

(%)

Competing process on one node

Competing process on all nodes

Competing traffic on one link

Competing traffic on all links

Competing process and traffic on one node and link

Error is higher with network contention• communication is harder to scale down and affects synchronization more directly


0

20

40

60

80

100

120

10 secskeleton

5 secskeleton

2 secskeleton

1 secskeleton

0.5 secskeleton

Class S Average

Prediction methodology

Err

or

(%)

MIN Average MAX

Comparison with Simple Prediction Methods

Average Prediction: Average slowdown of entire benchmark is used to predict execution time for each program.

Class S Prediction: Class S benchmark(~1sec) programs used as skeletons for Class B (30-900s)benchmarks

Even the smallest skeletons are far superior!


Conclusions

• Promising approach to performance estimation for

– Unpredictable environments (GRIDS)

– Non existing architectures (under simulation)

– ….

• It is work in progress – a lot more remains, such as:

– accurately reproducing memory behavior (some results in LCR 2004 workshop)

– integration of memory and communicate/compute

– validation on larger grid environments

– accurate reproduction of CPU behavior (such as instruction types etc.)

– Skeletons that scale to different numbers of nodes


End of Talk! Or is It ?

Questions ?

End of Talk! Or is It ?

Questions ?

FOR MORE INFORMATION:

www.cs.uh.edu/~jaspal [email protected]

Thanks to NSF and DOE!

http://www.cs.uh.edu/~jaspal

mailto:[email protected]


Discovered Communication Structure of NAS Benchmarks

0 1

32

BT

0 1

32

CG

0 1

3

IS

0 1

32

EP

0 1

32

LU

0 1

32

MG

0 1

32

SP

2


CPU Behavior of NAS Benchmarks

0%10%20%

30%40%50%60%70%

80%90%

100%

CG IS MG SP LU BT EP

Computation Communication Idle

ipdps 2005, slide 1 automatic construction and evaluation of “performance skeletons” (...

Documents