methodologies for performance simulation of super-scalar ooo processors srinivas neginhal...
TRANSCRIPT
Methodologies for Performance Simulation of
Super-scalar OOO processors
Srinivas NeginhalAnantharaman Kalyanaraman
CprE 585: Survey Project
Introduction
Modeling Simulation
Performance Study
ProcessorDesign
Architectural Simulators Explore Design Space Evaluate existing hardware, or
Predict performance of proposed hardware
Designer has control
Functional Simulators:
Model architecture (programmers’ focus)Eg., sim-fast, sim-safe
Performance Simulators:
Model microarchitecture (designer’s focus)Eg., cycle-by-cycle (sim-outoforder)
Simulation Issues Real-applications take too long for a cycle-by-cycle
simulation
Vast design space: Design Parameters:
code properties, value prediction, dynamic instruction distance, basic block size, instruction fetch mechanisms, etc.
Architectural metrics: IPC/ILP, cache miss rate, branch prediction accuracy, etc.
Find design flaws + Provide design improvements
Need a “robust” simulation methodology !!
Two Methodologies HLS
Hybrid: Statistical + Symbolic REF:
HLS: Combining Statistical and Symbolic Simulation to Guide Microprocessor Designs. M. Oskin, F. T. Chong and M. Farrens. Proc. ISCA. 71-82. 2000.
BBDA Basic block distribution analysis REF:
Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications. T. Sherwood, E. Perelman and B. Calder. Proc. PACT. 2001.
HLS: An Overview A hybrid processor simulator
HLS
Statistical Model
Symbolic Execution
Performance Contours spanned by design space parameters
What can be achieved?
Explore design changes in architectures and compilers that would be impractical to simulate using conventional simulators
HLS: Main Idea
Application code
Statistical Profiling
Instruction stream, data stream
Machine independent characteristics:
-basic block size
-Dynamic instruction distance
-Instruction mix
Machine independent characteristics:
-basic block size
-Dynamic instruction distance
-Instruction mix
Structural Simulation of FU, issue pipeline units
Architecture metrics:
-Cache behavior
-Branch prediction accuracy
Architecture metrics:
-Cache behavior
-Branch prediction accuracy
Synthetically generated code
Synthetically generated code
Statistical Code Generation Each “synthetic instruction”
contains the following parameters based on the statistical profile:
Functional unit requirements Dynamic instruction distances Cache behavior
Validation of HLS against SimpleScalar
For varying combinations of design parameters:
Run original benchmark code on SimpleScalar (use sim-outoforder)
Run statistically generated code on HLS
Compare SimpleScalar IPC vs. HLS IPC
Validation: Single- and Multi-value correlations
IPC vs. L1-cache hit rate
For SPECint95:
HLS Errors are within 5-7% of the cycle-by-cycle results !!
Validation: L1 Instruction Cache Miss Penalty vs. Hit
Rate
Correlation suggests that:
Cache hit rate should be at least 88% to dominate
HLS: Code PropertiesBasic Block Size vs. L1-Cache Hit
Rate
Correlation suggests that:
Increasing block size helps only when L1 cache hit rate is >96% or <82%
HLS: Code PropertiesDynamic Instruction Distance vs. Basic Block Size
Correlation suggests that:
Moderate DID values suffice for IPC, and high values of basic block size (>8) does not help without an increase in DID
HLS: Value Prediction
DID vs. Value predictability
GOAL: Break True Dependency
Stall Penalty for mispredict vs. Value Prediction Knowledge
HLS: More Multi-value Correlations
L1-cache hit rate vs. Value Predictability DID vs. Superscalar issue width
HLS: Discussion Low error rate only on SPECint95 benchmark
suite. High error rates on SPECfp95 and STREAM benchmarks
Findings: by R. H. Bell et. Al, 2004
Reason: Instruction-level granularity for workload
Recommended Improvement: Basic block-level granularity
Goals The end of the initialization The period of the program Ideal place to simulate given a
specific number of instructions one has to simulate
Accurate confidence estimation of the simulation point.
<Note> Revamp this slide.
Program Behavior Program behavior has ramification
on architectural techniques. Program behavior is different in
different parts of execution. Initialization Cyclic behavior (Periodic)
Basic Block Distribution Analysis
Each basic block gets executed a certain number of times.
Number of times each basic block executes gives a fingerprint.
Use the fingerprints to find representative areas to simulate.
<Note> How does fingerprinting help?
Cyclic Behavior of Programs Cyclic Behavior is not
representative of all programs. Common case for compute bound
applications. SPEC95 wave program executes 7
billion instructions before it reaches the code that amounts to the bulk of execution.
Basic Block Vectors Fast profiling to determine the number of
times a basic block executes. Behavior of the program is directly related to the
code that it is executing. Profiling gives a basic block fingerprint for that
particular interval of time. Full execution of the program and the interval we
choose spends proportionally the same amount of time in the same code.
Collected in intervals of 100 million instructions.
Basic Block Vector - BBV BBV is a single dimensional array.
There is an element for each basic block in the program.
Each element is the count of how many times a given basic block was entered during an interval.
Varying size intervals A BBV collected over an interval of N times
100 million instructions is a BBV of duration N.
Basic Block Vectors BBV is normalized
Each element divided by the sum of all elements.
Target BBV BBV for the entire execution of the
program. Objective
Find a BBV of small duration similar to Target BBV.
Basic Block Vector Difference Difference between BBVs
Element wise subtraction, sum of absolute values.
A number between 0 and 2. Manhattan and Euclidean Distance.
Basic Block Difference Graph Plot of how well each individual sample
in the program compares to the target BBV.
For each interval of 100 million instructions, we create a BBV and calculate its difference from target BBV.
Used to Find the initialization phase Find the period for the program.
Basic Block Difference Graph Diagram and explain
Initialization Initialization is not trivial. Important to simulate representative sections of
code. Detection of the end of the initialization phase is
important. Initialization Difference Graph
Initial Representative Signal - First quarter of BB Difference graph.
Slide it across BB difference graph. Difference calculated at each point for first half of
BBDG. When IRS reaches the end of the initialization stage on
the BB difference graph, the difference is maximized.
Initialization Diagram and explain
Period Period Difference Graph
Period Representative Signal Part of BBDG, starting from the end of
initialization to ¼th the length of program execution.
Slide across half the BBDG. Distance between the minimum Y-axis points
is the period. Using larger durations of a BBV creates a
BBDG that emphasizes the larger periods.
Period Diagram and explain
Method SimpleScalar modified.
Output and clear statistics counters every 100 million instructions committed.
Graphed data: IPC, % RUU Occupancy, Cache Miss Rate etc.
To get the most representative sample of a program at least one full period must be simulated.
Results
Basic Block Similarity Matrix A phase of a program behavior can be
defined as all similar sections of execution regardless of temporal adjacency.
Similarity Matrix Upper Triangle N X N Matrix, where N is the
number of intervals in the program execution. An entry at (x, y) in the matrix represents
Manhattan distance between the BBV at x and BBV at y.
Basic Block Similarity Matrix IMAGE and explain the image.
Finding Basic Block Similarity Many intervals of execution are
similar to each other. It makes sense to group them
together. Analogous to clustering.
Clustering Goal is to divide a set of points into groups
such that points within each group are similar to one another by some metric.
This problem arises in other fields such as computer vision, genomics etc.
Two types of clustering algorithms exist Partitioning
Choose an initial solution then iteratively update to find better solution
Linear Time Complexity Hierarchical
Divisive or Agglomerative Quadratic Time Complexity
Phase Finding Algorithm Generate BBVs with a duration of 1. Reduce the dimension of the BBVs
to 15. Apply clustering algorithm on the
BBVs. Score the clustering and choose the
most suitable.
Random Projection Curse of Dimensionality BBV dimensions
Number of executed Basic Blocks. Could grow to millions.
Dimension Selection Dimension Reduction
Random Linear Projection.
Clustering Algorithm K-means algorithm
Iterative optimizing algorithm. Two repetitive phases that converge.
WORK IN PROGRESS