skeleton based performance prediction on shared networks sukhdeep sodhi microsoft corp jaspal...
TRANSCRIPT
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS
Sukhdeep Sodhi
Microsoft Corp
Jaspal Subhlok
University of Houston
2
Resource Selection for Network/Grid Applications
Application
Network
?where is the best performance
Data
Sim 1GUI
Model
Pre Stream
3
Current approaches to Node Selection
1. Measure and model network properties, such as available bandwidth and CPU loads (with tools like NWS)
2. Find “best” nodes for execution based on network statusBut expected application performance based on measured
resource status may not be accurate• depends on application characteristics – hard to model• translation, e.g., unused bandwidth vs expected throughput• data may be stale as frequent measurements are expensive
Data
Sim 1GUI
Model
Pre Stream
4
Our Approach
Application
Network
PREDICT APPLICATION PERFORMANCE BY RUNNING A SMALL PROGRAM REPRESENTATIVE OF ACTUAL DISTRIBUTED APPLICATION
Data
Sim 1GUI
Model
Pre Stream
5
Performance Skeleton is a synthetic short running program whose execution characteristics mirror the application it represents
An application and its skeleton have similar
• communication pattern
• CPU usage
• memory usage
• synchronization pattern
Goal: Performance of a skeleton is directly related to the performance of the application under any condition
• e.g., a skeleton executes in .1% of the time the application takes to execute on any part of a shared network
Performance Skeleton
6
Central Contribution of This Paper
Data
Sim 1GUI
Model
Pre Stream
Data
Sim 1
GUI
Model
PreStream
CREATE SKELETON
Framework for Automatic Construction of
Performance Skeletons
ApplicationSkeleton
7
Data
Sim 1GUI
Model
Pre Stream
Data
Sim 1
GUI
Model
PreStream
CREATE SKELETON
Automatic Construction of Skeletons
Record Execution Trace
ApplicationSkeleton
Compress execution trace into execution signature
Construct skeleton program from execution signature
8
Data
Sim 1GUI
Model
Pre Stream
Data
Sim 1
GUI
Model
PreStream
CREATE SKELETON
Automatic Construction of Skeletons
Record Execution Trace
ApplicationSkeleton
Compress execution trace into execution signature
Construct skeleton program from execution signature
9
Recording of Execution Trace
• Implemented for MPI applications• Link MPI application with PMPI based profiling
library– no source code modification / analysis required
• Execute on a dedicated testbed• Records all MPI function calls
– Call name, start time, stop time, parameters passed– Timing done to microsecond granularity
• CPU busy = time between two consecutive MPI calls
10
Data
Sim 1GUI
Model
Pre Stream
Data
Sim 1
GUI
Model
PreStream
CREATE SKELETON
Automatic Construction of Skeletons
Record Execution Trace
ApplicationSkeleton
Compress execution trace into execution signature
Construct skeleton program from execution signature
11
Generation of Execution Signature …1
Application execution typically follows cyclic patternsGoal: Determine cyclic patterns and form loop
structure by identifying repeating execution behavior.– Repeating patterns should be broadly similar
Step 1:Execution trace to symbol strings– Cluster similar execution events
• Replace all events in cluster by average event
– Each cluster is then assigned a unique symbol– Execution trace is replaced by string of symbols:
,,,,,,,,,,, , ,,, , ,,, …
12
Generation of Execution Signature …2
Step 2: Compress string by Identifying Cycles– Similar to longest substring matching problem
– Algorithm builds loop structure recursively from symbol strings
e.g. ,,,,,,,,,,, , ,,, , ,,, is replaced by
[,,]4, [,[]2,]2
– Typically signature is multiple orders of magnitude smaller than trace
Step 3: Adaptively increase degree of clustering – until signature is compact enough
13
Data
Sim 1GUI
Model
Pre Stream
Data
Sim 1
GUI
Model
PreStream
CREATE SKELETON
Automatic Construction of Skeletons
Record Execution Trace
ApplicationSkeleton
Compress execution trace into execution signature
Construct skeleton program from execution signature
14
Generate Performance Skeleton Program
Goal:Execution time of performance skeleton should be a fixed factor K less than application execution time
Reduce Iterations of each loop by a factor K– Add remainder iterations to events outside of all loops
Process events outside loop as follows:– Reduce execution time of compute operations by a factor K– Reduce execution time of message exchanges by reducing
bytes exchanged by a factor K• Communication operations not scaled linearly due to latency. • Considering latency would make approach architecture-specific
Replace symbols by C language statements
15
Experimental Validation
Skeletons constructed for Class B NAS MPI benchmarks are executed in following sharing scenarios
• Competing processes on one node• Competing processes on all nodes• Competing traffic on one link• Competing traffic on all links• Competing process and traffic on one node and linkSkeleton execution time is used to predict
application execution time. Setup: Intel Xeon dual CPU 1.7 GHz nodes running Linux
2.4.7. Gigabit crossbar switch. iproute to simulate link sharing
16
Prediction Accuracy
Graph shows error between predicted and measured application execution time
Skeleton execution is 1/10th of Application execution
average error: 6% max error 18%
Error is higher for scenarios with competing traffic
17
Comparison with other methods
0
2040
60
80100
120
140160
180
BT CG IS LU MG SP Avg
Benchmarks
%ag
e er
ror
Performance Skeleton
Average Prediction
Class S
Average Prediction: Average slowdown of entire benchmark is used to predict execution time for each program.
Class S Prediction: Class S benchmark(~1sec) programs used as skeletons for Class B (30-900s)benchmarks
18
Preliminary Conclusions
Performance estimation with skeleton has high accuracy
Need to incorporate memory access patterns and fine grain CPU behavior for execution across architectures
Implementation limited to mpi applications– basic approach should work for other paradigms
Skeletons may have other uses as a fast way of estimating application performance– e.g. on a slow simulated future system