dynamic performance tuning for speculative threads yangchun luo, venkatesan packirisamy, nikhil...
TRANSCRIPT
Dynamic Performance Tuning for Speculative Threads
Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre†,
Ankit Tarkas†, Wei-Chung Hsu, and Antonia
ZhaiDept. of Computer Science and Engineering†Dept. of Electronic Computer Engineering
University of Minnesota – Twin Cities
2
Motivation
Multicore processors brought forth large potentials of computing power
Exploiting these potentials demands thread-level parallelism
Intel Core i7 die photo
3
Exploiting Thread-Level Parallelism
Potentially more parallelism with speculation
dependence
Sequential
Store *p
Load *q
Store *p
Time
p != q
Thread-Level Speculation (TLS)
Traditional Parallelization
Load *q
p != q ??
Store 88
Load 20
Parallel execution
Store 88
Load 88
Speculation Failure
Time
Time
Load 88
Compiler gives up
p == q
But Unpredictable
4
Addressing UnpredictabilityState-of-the-art approach: Profiling-directed optimization Compiler analysis
Source codeProfiling run
Profiling info
Region selection
Parallel code generation
Feedback
Machine code
A typical compilation framework
Speculative code optimization
5
Problem #1
01
23
3
Time signal
allowcommit
wait
spawning thread
inter-thread synchronization
speculation failure
……load
imbalance
4
Hard to model TLS overhead statically, even with profiling
6
Problem #2
Profiling Behavior
Training Input
Actual Input
Actual Behavior
Profile-based decision may not be appropriate for actual input
7
Problem #3
Timephase1 phase2
Met
ric
phase3
Behavior of the same code changes over time
8
Problem #4
Load *p=0x88
Prefetching
Extra misses
Pollution
Load 0x22 miss
Load 0x22 hit
Load 0x22 Load 0x22miss miss
Sequential Parallel
P P P
How important are cache impacts?
L1$ L1$ L1$
PL1$
Load *p’=0x88
PL1$
PL1$
PL1$
Load *p=0x80
Load *p’=0x08
9
SEQ TLS-4core0
0.2
0.4
0.6
0.8
1
CacheSquashOther
Nor
mal
ized
Exe
cutio
n Ti
me
Impact on Cache Performance
60%
load *p=0x88
Cache: p == p’Prefetched (+)
load *p’=0x88
Cache impact is difficult to model statically
A major loop from benchmark art (scanner.c:594)
2.5X
10
Our proposalProposed Performance Tuning Framework
Compiler Analysis Runtime Analysis
Source code
Profiling info
Region selection
Parallel code generation
Machine code
Speculative code optimization
Accurate Timing & Cache Impacts
Adaptations for inputs and
phase changes
Performance Evaluation
Tuning PolicyCompiler Annotation
RuntimeInformation
Dynamic Execution
Performance Profile Table
11
Parallelization Decisions at Runtime
…
Evaluate performance
Thread spawning End of a parallel execution
spawn
parallelize
spawnspawn
serialize
…
spawn
Parallel execution Sequential execution
Parallel Execution
New execution model facilitates dynamic tuning
Performance Profile Table
12
Evaluating Performance ImpactRuntime Analysis
Simple metric: cost of speculative failure
Comprehensive evaluation: sequential performance prediction
SEQ TLS-4core0
0.2
0.4
0.6
0.8
1
CacheSquashOther
Nor
mal
ized
Exe
cutio
n Ti
me
Performance Evaluation
Tuning Policy
RuntimeInformation
Dynamic ExecutionPerformance Profile Table
13
Sequential Performance Prediction
PredictP P
Programmable Cycle Classifier
Information from ROB
Hardware Counters
iFetch
tick
CacheBusy
ExeStall
SpeculativeParallel
Predicted sequential
Busy
iFetch
ExeStall
Cache
TLSoverhead
Performance Impact
14
Cache Behavior: Scenario #1
TLS
Count the miss in squashed execution
load p(unmodified)
miss
load p hit
Predicted
load p hit
SEQ
load p miss
T0
T0
T0 T0
Squash
15
Cache Behavior: Scenario #2
TLS
load p miss
load p miss
Predicted
load p miss*2
SEQ
load p missstore p
invalidate p
store p
Not count a miss if tag matched invalid status
store p
T0 T0 T0
T0
16
Tune the performance
Exhaustive search
Prioritized search
Runtime Analysis
Performance Evaluation
Tuning Policy
RuntimeInformation
Dynamic ExecutionPerformance Profile Table
17
How to prioritize the searchSearch order: Inside-Out Compiler suggestion Outside-In
Performance summary metric: Speedup or not? Improvement in cycles
Search order
Perf
orm
ance
Su
mm
ary
met
ric
Spee
dup
or n
ot?
Inside-Out Compilersuggestion
Impr
ove.
in c
ycle
sMore
Aggressive
Quantitative+StaticHintQuantitative
Simple
Suboptimal
Evaluation Infrastructure
Benchmarks· SPEC2000 written in C, -O3 optimization
Underlying architecture· 4-core, chip-multiprocessor (CMP)· speculation supported by coherence
Simulator· Superscalar with detailed memory model· simulates communication latency· models bandwidth and contention
Detailed, cycle-accurate simulation
C
C
P
C
P
Interconnect
C
P
C
P
18
19
Comparing Tuning Policies
ammp artbzip
2cra
fty
equake gapgcc gzip mcf
mesaparse
r
perlbmk
twolf
vorte
xvp
r-p vpr-r
G.M.
0
0.5
1
1.5
2
2.5
3
Simple Quantitative Quantitative+StaticHint
Spee
dup
wrt
. SEQ
1.17x
1.23x
1.37x
Parallel Code Overhead
20
Profile-Based vs. Dynamic
ammp artbzip
2cra
fty
equake gapgcc gzip mcf
mesaparse
r
perlbmk
twolf
vorte
xvp
r-p vpr-r
G.M.
0
0.5
1
1.5
2
2.5
3
ProfileBased Quant+StaticHintProfileBased/ProfileBasedSEQ (Quant+StaticHint)/dynamicSEQ
Spee
dup
wrt
. SEQ
Dynamic outperformed static by ≈10% (1.37x/1.25x)
1.25x
Potentially go up to ≈15% (1.46x/1.27x)
1.27x
1.46x
21
Related works: Region Selection
Runtime Loop Detection
[Tubella HPCA’98]
[Marcuello ICS’98]
Compiler Heuristics
[Vijaykumar Micro’98]
Statically Dynamically
Saving Power from Useless Re-spawning
[Renau ICS’05]
Simple Profiling Pass
[Liu PPoPP’06] (POSH)
Extensive Profiling
[Wang LCPC’05]
Balanced Min-Cut
[Johnson PLDI’04]
Profile-Based Search
[Johnson PPoPP’07]
Lack of adaptability Lack of high-levelinformation
Dynamic Performance Tuning[Luo ISCA’09] (this work)
22
Conclusions
Deciding how to extract speculative threads at runtimeQuantitatively summarize performance profileCompiler hints avoid suboptimal decisions
Compiler and hardware work together. 37% speedup over sequential execution ≈10% speedup over profile-based decisions
Dynamic performance tuning can improve TLS efficiency
Configurable HPMs enable efficient dynamic optimizations