ipdps 2005, slide 1 automatic construction and evaluation of “performance skeletons” (...
TRANSCRIPT
IPDPS 2005, slide 1
Automatic Construction and Evaluation of “Performance Skeletons”
(Predicting Performance in an Unpredictable World)
Sukhdeep Sodhi
Microsoft
Jaspal SubhlokUniversity of Houston
IPDPS 2005
IPDPS 2005, slide 2
What is a Performance Skeleton anyway ?
A short running program that mimics execution behavior of a given application
GOAL: execution time of a performance skeleton is a fixed fraction of application execution time - say 1:1000, then..
Sounds vaguely interesting but… Who cares ? How to do it ? Is it even possible to build one ?
If the Application runtime is
10K seconds on a dedicated compute cluster
15K seconds on a shared compute cluster
20K seconds on a shared heterogeneous grid
1 million seconds under simulation
1K seconds on a supercomputer
…..,
Skeleton runs in
10 secs
15 secs
20 secs
1000 secs
1 second
IPDPS 2005, slide 3
Who Cares ? Anyone who needs a performance estimate when it cannot be modeled well
Data
Sim 1VisSim 2
Stream
Model
Pre
?ApplicationNetwork
Which nodes offer best performance
• Performance testing of a future architecture under simulation: Large applications cannot be tested as simulation is 1000X slower
• Applications Distributed on Networks: Resource selection, Mapping, Adapting
IPDPS 2005, slide 4
Mapping Distributed Applications on Networks: “state of the art”
Data
Sim 1Sim 2
Stream
Model
PreVis
Mapping for Best Performance
1. Measure and model network and application characteristics (NWS is popular)
2. Find “best” match of nodes for execution
But the approach has significant limitations…
• Knowing network status is not the same as knowing how an application will perform
• Frequent measurements are expensive, less frequent measurements mean stale data
IPDPS 2005, slide 5
Data
Sim 1VisSim 2
Stream
Model
Pre
?
Application
Network
Predict performance and select nodes by actual execution of performance skeletons on groups of nodes
Mapping Distributed Applications on Networks: “our approach”
IPDPS 2005, slide 6
Data
Sim 1VisSim 2
Stream
Model
Pre
How to Construct a Performance Skeleton ?
Data
Sim 1
VisSim 2
Stream
Model
Pre
Central challenge in this research
Common sense dictates that an application and its skeleton must be similar in:– Computation behavior
– Communication behavior
– Memory behavior
– I/O Behavior
All execution behavior is to be captured in a short program
application
skeletonHow ? How ?
IPDPS 2005, slide 7
Data
Sim 1VisSim 2
Stream
Model
Pre
How to Construct a Performance Skeleton ?
Data
Sim 1
VisSim 2
Stream
Model
Pre
Run application
skeletonHow ?
Record Execution Trace
Compress execution trace into Execution Signature
Construct Performance Skeleton
Execution trace is a record of all system activity during execution such as memory accesses, communication messages and CPU events.Execution signature is a compressed summarized record of executionPerformance Skeleton is a program based on execution signature
IPDPS 2005, slide 8
Likmitations of Work Presented Today
Only model the coarse application computation and communication patterns to build performance skeleton
– ignore memory and I/O behavior
– Ignore specific instructions – only consider whether CPU is computing or communicating or idle
– somewhat intrusive – link with a profiling library
– Limited to MPI programs
But these are not limitations of the approach.
Most are being addressed in the project.
IPDPS 2005, slide 9
Data
Sim 1VisSim 2
Stream
Model
Pre
Constructing a Performance Skeleton
Data
Sim 1
VisSim 2
Stream
Model
Pre
Run application
skeletonHow ?
Record Execution Trace
Compress execution trace into Execution Signature
Construct Performance Skeleton program from execution signature
IPDPS 2005, slide 10
• Link MPI application with PMPI based profiling library– no source code modification / analysis required
• Execute on a dedicated testbed• Records all MPI function calls
– Call name, start time, stop time, parameters – Timing done to microsecond granularity
• CPU busy = time between consecutive MPI calls
Result is a (long) execution sequence of computation and communication events and their durations/parameters
Recording Execution Trace
IPDPS 2005, slide 11
Data
Sim 1VisSim 2
Stream
Model
Pre
Constructing a Simple Performance Skeleton
Data
Sim 1
VisSim 2
Stream
Model
Pre
Run application
skeletonHow ?
Record Execution Trace
Compress execution trace into Execution Signature
Construct Performance Skeleton program from execution signature
IPDPS 2005, slide 12
Application execution typically follows cyclic patterns• Goal: Form loop structure by identifying repeating
execution behavior.Step 1: Execution trace to symbol strings• Identify “similar” (may not be identical) execution events• Each event in such a cluster of similar events is
replaced by a representative and assigned a symbol
Execution trace is replaced by symbol string …
Where, say [ = compute for ~100ms], [ = MPI call to send ~800 ] bytes to a neighbor node
Compress Execution Trace Execution Signature
IPDPS 2005, slide 13
Step 2: Compress string by Identifying Cycles– Build loop structure recursively from symbol strings
e.g. is replaced by
[ ]3 [ []2 ]2
– Similar to longest substring matching problem
Typical Execution Signature is multiple orders of magnitude smaller than trace
Step 3: Adaptively increase degree of compression (by managing a “similarity parameter”) until signature is compact enough
Compress Execution Trace Execution Signature
IPDPS 2005, slide 14
Data
Sim 1VisSim 2
Stream
Model
Pre
Constructing a Simple Performance Skeleton
Data
Sim 1
VisSim 2
Stream
Model
Pre
Run application
skeletonHow ?
Record Execution Trace
Compress execution trace into Execution Signature
Construct Performance Skeleton program from execution signature
IPDPS 2005, slide 15
Goal:Execution time of performance skeleton is 1/K application execution time (K given by user)
• Reduce Iterations of each loop in application signature by a factor K
• Heuristically process remaining iterations and events outside loops
• Replace symbols by C language statements
Generate Performance Skeleton Program
IPDPS 2005, slide 16
Skeletons constructed for Class B NAS MPI benchmarks. Executed on 4 cluster nodes in following sharing scenarios:
• Dedicated nodes (defines reference execution time ratio between skeleton and application)
• Competing processes on: one node/ all nodes• Competing traffic on: one link /all links• Competition as above on one node and one link
Skeleton execution time used to predict application execution time in different scenarios
Setup: Intel Xeon dual CPU 1.7 GHz nodes running Linux 2.4.7. Gigabit crossbar switch. Simple CPU intensive competing processes. iproute to simulate link sharing
Experimental Validation
IPDPS 2005, slide 17
02468
101214161820
10 secondskeleton
5 secondskeleton
2 secondskeleton
1 secondskeleton
0.5 secondskeleton
Skeleton Sizes
Err
or
(%)
BT CG IS LU MG SP Average
Average prediction error is ~ 6 %, max ~ 18% --acceptable
Longer skeletons better but even .5 sec. skeletons meaningful (tool issues a warning if requested skeleton size is too small)
Prediction Accuracy of Skeletons(average across all sharing scenarios)
IPDPS 2005, slide 18
Prediction for Different Sharing Scenarios (10 second skeletons)
0
5
10
15
20
25
BT CG IS LU MG SP AverageApplications
Err
or
(%)
Competing process on one node
Competing process on all nodes
Competing traffic on one link
Competing traffic on all links
Competing process and traffic on one node and link
Error is higher with network contention• communication is harder to scale down and affects synchronization more directly
IPDPS 2005, slide 19
0
20
40
60
80
100
120
10 secskeleton
5 secskeleton
2 secskeleton
1 secskeleton
0.5 secskeleton
Class S Average
Prediction methodology
Err
or
(%)
MIN Average MAX
Comparison with Simple Prediction Methods
Average Prediction: Average slowdown of entire benchmark is used to predict execution time for each program.
Class S Prediction: Class S benchmark(~1sec) programs used as skeletons for Class B (30-900s)benchmarks
Even the smallest skeletons are far superior!
IPDPS 2005, slide 20
Conclusions
• Promising approach to performance estimation for
– Unpredictable environments (GRIDS)
– Non existing architectures (under simulation)
– ….
• It is work in progress – a lot more remains, such as:
– accurately reproducing memory behavior (some results in LCR 2004 workshop)
– integration of memory and communicate/compute
– validation on larger grid environments
– accurate reproduction of CPU behavior (such as instruction types etc.)
– Skeletons that scale to different numbers of nodes
IPDPS 2005, slide 21
End of Talk! Or is It ?
Questions ?
End of Talk! Or is It ?
Questions ?
FOR MORE INFORMATION:
www.cs.uh.edu/~jaspal [email protected]
Thanks to NSF and DOE!
IPDPS 2005, slide 22
Discovered Communication Structure of NAS Benchmarks
0 1
32
BT
0 1
32
CG
0 1
3
IS
0 1
32
EP
0 1
32
LU
0 1
32
MG
0 1
32
SP
2
IPDPS 2005, slide 23
CPU Behavior of NAS Benchmarks
0%10%20%
30%40%50%60%70%
80%90%
100%
CG IS MG SP LU BT EP
Computation Communication Idle