thread-level speculation: towards ubiquitous parallelism greg steffan school of computer science
DESCRIPTION
Thread-Level Speculation: Towards Ubiquitous Parallelism Greg Steffan School of Computer Science Carnegie Mellon University. Moore’s Law: the Original Version. Log transistors on a chip. Time. exponentially increasing resources. Moore’s Law: the Popular Interpretation. - PowerPoint PPT PresentationTRANSCRIPT
1Thread-Level Speculation SteffanCarnegie Mellon
Thread-Level Speculation: Thread-Level Speculation:
Towards Ubiquitous ParallelismTowards Ubiquitous Parallelism
Greg SteffanGreg Steffan
School of Computer ScienceSchool of Computer Science
Carnegie Mellon UniversityCarnegie Mellon University
2Thread-Level Speculation SteffanCarnegie Mellon
Moore’s Law: the Moore’s Law: the Original VersionOriginal Version
Log
trans
istor
s on
a ch
ip
Time
exponentially increasing resources
3Thread-Level Speculation SteffanCarnegie Mellon
Moore’s Law: the Popular InterpretationMoore’s Law: the Popular Interpretation
Log
perfo
rman
ce
Time
increase resources increase performance?
4Thread-Level Speculation SteffanCarnegie Mellon
A Superposition of InnovationsA Superposition of Innovations
Datapath Size(8b, 16b, 32b, 64b)Lo
g of
Per
form
ance
Time
ILP is running out of steam
Instruction-LevelParallelism (ILP)
5Thread-Level Speculation SteffanCarnegie Mellon
Why ILP is Running Out of SteamWhy ILP is Running Out of Steam
Cross-chip wire latency (in cycles):Cross-chip wire latency (in cycles):
Development cost:Development cost:
Power density:Power density:
Probability of a defect:Probability of a defect:
these problems must be addressed
6Thread-Level Speculation SteffanCarnegie Mellon
How Do We Sustain the Performance Curve?How Do We Sustain the Performance Curve?
Datapath Size(8b, 16b, 32b, 64b)Lo
g of
Per
form
ance
Time
what is the next big win for micro-architecture?
Instruction-LevelParallelism (ILP)
?we are here
now
7Thread-Level Speculation SteffanCarnegie Mellon
A New Path: Thread-Level ParallelismA New Path: Thread-Level Parallelism
Tolerate cross-chip wire latency:Tolerate cross-chip wire latency:– localized wireslocalized wires
Lower development cost:Lower development cost:– stamp out processor coresstamp out processor cores
Lower power:Lower power:– turn off idle processorsturn off idle processors
Tolerate defects:Tolerate defects:– disable any faulty processordisable any faulty processor
many advantages
C
C
P
C
P
Chip Multiprocessor (CMP)
Processors
Caches
8Thread-Level Speculation SteffanCarnegie Mellon
Multithreading in Every Scale of MachineMultithreading in Every Scale of Machine
Supercomputers
Threads
DesktopsChip Multiprocessor (CMP)
Cache
Proc Proc
(IBM Power4, SUN MAJC, Sibyte SB-1250)
multithreading on a chip!
Simultaneous-Multithreading(ALPHA 21464,
Intel Xeon)
Cache
Proc
9Thread-Level Speculation SteffanCarnegie Mellon
Improving Performance with a Chip MultiprocessorImproving Performance with a Chip Multiprocessor
C
C
P
C
P
C
P
C
P
C
C
P
Multiprogramming Workload:
ExecutionTime
improves throughput
Processor
Caches
Applications
10Thread-Level Speculation SteffanCarnegie Mellon
Improving Performance with a Chip MultiprocessorImproving Performance with a Chip Multiprocessor
C
C
P
C
P
C
P
C
P
C
C
P
Single Application:
need parallel threads to reduce execution time
C
C
P
C
P
C
P
C
P
Exec.Time
11Thread-Level Speculation SteffanCarnegie Mellon
How Do We Parallelize Everything?How Do We Parallelize Everything?
1) Programmers write parallel code from now on1) Programmers write parallel code from now on– time-consuming and frustratingtime-consuming and frustrating
– very hard to get rightvery hard to get right
– not a broad solutionnot a broad solution
2) System parallelizes automatically2) System parallelizes automatically– no burden on the programmerno burden on the programmer
– parallelize any applicationparallelize any application
automatic parallelization is preferred
12Thread-Level Speculation SteffanCarnegie Mellon
Current Technique: Prove IndependenceCurrent Technique: Prove Independence
IndependentIndependent
DependentDependent
for (i = 0;i < N;i++) A[i] = 0;
for (i = 1;i < N;i++) A[i] = A[i-1];
A[0]0A[1]0
A[2]0
A[1]A[0]A[2]A[1]
A[3]A[2]
need to fully understand data access pattern
13Thread-Level Speculation SteffanCarnegie Mellon
Ubiquitous Parallelization: How Close Are We?Ubiquitous Parallelization: How Close Are We?
Compiler can parallelize portions of numeric programsCompiler can parallelize portions of numeric programs– scientific, floating-point, array-based codesscientific, floating-point, array-based codes
– usually written in fortranusually written in fortran
What about everything else?What about everything else?– general-purpose, integer codesgeneral-purpose, integer codes
– written in C, C++, Java, etc.written in C, C++, Java, etc.
– little (if any) success so farlittle (if any) success so far
parallelize by proving independence
proving independence is infeasible
14Thread-Level Speculation SteffanCarnegie Mellon
The Main Culprit: IndirectionThe Main Culprit: Indirection
for (i = 0;i < N;i++) A[i] = A[B[i]];
while (...){... = *q;*p = ...;
}
need to know the values of B[]
need to know the targets of p and q
PointersPointers
Indirect array referencesIndirect array references A[0]A[B[0]]A[1]A[B[1]]
A[2]A[B[2]]
?
?
… *q*p …
… *q*p …
?
15Thread-Level Speculation SteffanCarnegie Mellon
SummarySummary
We need the next big performance winWe need the next big performance win– instruction-level parallelism will run out of gasinstruction-level parallelism will run out of gas
Multithreading will soon be everywhereMultithreading will soon be everywhere– we need automatically-parallelized programswe need automatically-parallelized programs
The scope of current techniques is extremely limitedThe scope of current techniques is extremely limited– proving independence is infeasibleproving independence is infeasible
A solution: Thread-Level Speculation (TLS)
16Thread-Level Speculation SteffanCarnegie Mellon
Thread-Level Speculation: the Basic IdeaThread-Level Speculation: the Basic Idea
exploit available thread-level parallelism
Exec.Time TLS
…*q*p…
Recover
…*q
violation
17Thread-Level Speculation SteffanCarnegie Mellon
OutlineOutline
The Software/Hardware Sweet SpotThe Software/Hardware Sweet Spot
• Compiler SupportCompiler Support
• Industry-Friendly HardwareIndustry-Friendly Hardware
• Improving Value CommunicationImproving Value Communication
• ConclusionsConclusions
18Thread-Level Speculation SteffanCarnegie Mellon
Support for TLS: What Do We Need?Support for TLS: What Do We Need?
Break programs into speculative threadsBreak programs into speculative threads– to maximize thread-level parallelismto maximize thread-level parallelism
Track data dependencesTrack data dependences– to determine whether speculation was safeto determine whether speculation was safe
Recover from failed speculationRecover from failed speculation– to ensure correct executionto ensure correct execution
three key elements of every TLS system
19Thread-Level Speculation SteffanCarnegie Mellon
Compiler Researche
rsdo it
in Software
20Thread-Level Speculation SteffanCarnegie Mellon
LRPD Test (Illinois at UC)LRPD Test (Illinois at UC)
++ implemented entirely in software implemented entirely in software
–– applies only to array-based codeapplies only to array-based code
–– no partial parallelismno partial parallelism
softwaredependencetracking
was parallelexecution safe?
Exec.Time
21Thread-Level Speculation SteffanCarnegie Mellon
Architects do it
in Hardware
22Thread-Level Speculation SteffanCarnegie Mellon
Multiscalar (Wisconsin)Multiscalar (Wisconsin)
• compiler breaks program into threadscompiler breaks program into threads
• Address Resolution BufferAddress Resolution Buffer (ARB) (ARB)
+ + –– highly specialized for speculation highly specialized for speculation
ARBP
PP P
P
P
P P
23Thread-Level Speculation SteffanCarnegie Mellon
Our Approach: Find the Sweet SpotOur Approach: Find the Sweet Spot
Compiler:Compiler:++ global view of control flow global view of control flow
–– hard/impossible to understand data dependenceshard/impossible to understand data dependences
Hardware:Hardware:–– operates on a small window of instructions operates on a small window of instructions
++ observes dynamic memory accesses observes dynamic memory accesses
leverage their respective strengths
24Thread-Level Speculation SteffanCarnegie Mellon
The Sweet SpotThe Sweet Spot
• Compiler: Compiler: – break programs into speculative threadsbreak programs into speculative threads
• why: compiler has a global view of control flowwhy: compiler has a global view of control flow
• Hardware:Hardware:– track data dependencestrack data dependences
• why: software comparison of all addresses infeasiblewhy: software comparison of all addresses infeasible
– recover from failed speculationrecover from failed speculation• why: software buffering of all writes infeasiblewhy: software buffering of all writes infeasible
important: minimize additional hardware
25Thread-Level Speculation SteffanCarnegie Mellon
OutlineOutline
The Software/Hardware Sweet SpotThe Software/Hardware Sweet Spot
Compiler SupportCompiler Support
• Industry-Friendly HardwareIndustry-Friendly Hardware
• Improving Value CommunicationImproving Value Communication
• ConclusionsConclusions
26Thread-Level Speculation SteffanCarnegie Mellon
MIPSExecutable
Compiler Support for TLSCompiler Support for TLS
RegionSelection
Transformation and
Optimization
SequentialSourceCode
insertsTLS instructions
profileinformation which loops?
27Thread-Level Speculation SteffanCarnegie Mellon
Simple Performance ModelSimple Performance Model
P P P P
DependenceTracking
• 4 processors• Each processor issues one instruction per cycle • No communication latency between processors
shows potential performance benefit
28Thread-Level Speculation SteffanCarnegie Mellon
Potential ImprovementPotential Improvement
significant impact on execution time
29Thread-Level Speculation SteffanCarnegie Mellon
OutlineOutline
The Software/Hardware Sweet SpotThe Software/Hardware Sweet Spot
Compiler SupportCompiler Support
Industry-Friendly HardwareIndustry-Friendly Hardware
• Improving Value CommunicationImproving Value Communication
• ConclusionsConclusions
30Thread-Level Speculation SteffanCarnegie Mellon
GoalsGoals
1) Handle arbitrary memory accesses1) Handle arbitrary memory accesses– i.e. not just array referencesi.e. not just array references
2) Preserve single-thread performance2) Preserve single-thread performance– keep hardware support minimal and simplekeep hardware support minimal and simple
3) Apply to any scale of multithreaded architecture3) Apply to any scale of multithreaded architecture– within a chip and beyondwithin a chip and beyond
effective, simple, scalable
31Thread-Level Speculation SteffanCarnegie Mellon
RequirementsRequirements
1) Recover from failed speculation1) Recover from failed speculation• buffer speculative writes from memory buffer speculative writes from memory
2) Track data dependences 2) Track data dependences • detect data dependence violationsdetect data dependence violations
each has several implementation options
32Thread-Level Speculation SteffanCarnegie Mellon
Recover From Failed Speculation: Option 1Recover From Failed Speculation: Option 1
Augment the store buffer:Augment the store buffer:+ + common device in superscalar processorscommon device in superscalar processors
• facilitates non-blocking storesfacilitates non-blocking stores
–– too smalltoo small
Procstore buffer
33Thread-Level Speculation SteffanCarnegie Mellon
Add a new dedicated bufferAdd a new dedicated buffer+ + can design an efficient speculation mechanismcan design an efficient speculation mechanism
–– want to avoid large speculation-specific structureswant to avoid large speculation-specific structures
Proc
Recover From Failed Speculation: Option 2Recover From Failed Speculation: Option 2
34Thread-Level Speculation SteffanCarnegie Mellon
Augment the cacheAugment the cache+ + very common structurevery common structure
+ + relatively largerelatively large
Cache
Proc
just maintain single-thread performance
Recover From Failed Speculation: Option 3Recover From Failed Speculation: Option 3
35Thread-Level Speculation SteffanCarnegie Mellon
Tracking Data Dependences: Option 1Tracking Data Dependences: Option 1
Add a dedicated “3Add a dedicated “3rdrd-party” entity-party” entity–– want to avoid large speculation-specific structureswant to avoid large speculation-specific structures
–– does not scaledoes not scale
C
P
C
P
DependenceTracker
Load XStore X
violationdetected
36Thread-Level Speculation SteffanCarnegie Mellon
Tracking Data Dependences: Option 2Tracking Data Dependences: Option 2
Detection at the producerDetection at the producer• producer informed of all addresses consumedproducer informed of all addresses consumed
–– awkward: producer must notify consumer of any violationawkward: producer must notify consumer of any violation
C
P
C
P
Load X Store X
load address
violationdetected
Producer Consumer
37Thread-Level Speculation SteffanCarnegie Mellon
Tracking Data Dependences: Option 3Tracking Data Dependences: Option 3
Detection at the consumer Detection at the consumer • consumers informed of all addresses producedconsumers informed of all addresses produced
C
P
C
P
Load X Store X
store address violation
detected
similar to invalidation-based cache coherence!
Producer Consumer
38Thread-Level Speculation SteffanCarnegie Mellon
Augmenting the CacheAugmenting the Cache
CacheTagState Data
-- --- -
-- --- -
P
39Thread-Level Speculation SteffanCarnegie Mellon
Augmenting the CacheAugmenting the Cache
CacheState Data
- -- -
- -
Tag--
--- -
SL--
--
SM--
--
SpeculativelyModified
SpeculativelyLoaded
modest amount of extra space
P
40Thread-Level Speculation SteffanCarnegie Mellon
valid
Augmenting the CacheAugmenting the Cache
CacheState Datavalid #valid #
valid #
TagXV
YZ #
SL00
00
SM11
01
P
when speculation fails…
41Thread-Level Speculation SteffanCarnegie Mellon
invalid
Augmenting the CacheAugmenting the Cache
CacheState Datainvalid -invalid -
valid #
Tag--
Y- -
SL0
0
00
SM00
00
P
…can quickly discard speculative state
42Thread-Level Speculation SteffanCarnegie Mellon
Extending Cache CoherenceExtending Cache Coherence
C
P
C
P
Load X Store X
invalidate X; from 4 violation
detected (4<5)
4 5
X is speculativelyloaded
straightforward extension of cache coherence
43Thread-Level Speculation SteffanCarnegie Mellon
Detailed Performance ModelDetailed Performance Model
Underlying architectureUnderlying architecture– single-chip multiprocessorsingle-chip multiprocessor
– implements speculative coherenceimplements speculative coherence
SimulatorSimulator– superscalar, a modernized superscalar, a modernized MIPS R10KMIPS R10K– models all bandwidth and contentionmodels all bandwidth and contention
detailed simulation!
C
C
P
C
P
Crossbar
44Thread-Level Speculation SteffanCarnegie Mellon
Will it Work at All of These Scales?Will it Work at All of These Scales?
Supercomputers
Threads
Desktops
yes: coherence scales up and down
Chip Multiprocessor (CMP)
Cache
Proc Proc
Simultaneous-Multithreading
Cache
Proc
45Thread-Level Speculation SteffanCarnegie Mellon
Performance on Multi-Chip SystemsPerformance on Multi-Chip Systems
our scheme is scalable
46Thread-Level Speculation SteffanCarnegie Mellon
Performance on General-Purpose ApplicationsPerformance on General-Purpose Applications
significant performance improvements
47Thread-Level Speculation SteffanCarnegie Mellon
OutlineOutline
The Software/Hardware Sweet SpotThe Software/Hardware Sweet Spot
Compiler SupportCompiler Support
Industry-Friendly HardwareIndustry-Friendly Hardware
Improving Value CommunicationImproving Value Communication
• ConclusionsConclusions
48Thread-Level Speculation SteffanCarnegie Mellon
SpeculateSpeculate
good when p != q
Store *p
Load *q
Memory
49Thread-Level Speculation SteffanCarnegie Mellon
Synchronize (and forward)Synchronize (and forward)
good when p == q
Store *p
Load *q
Memory
SignalWait(stall)
Store *pLoad *q
Memory
(Speculate)
50Thread-Level Speculation SteffanCarnegie Mellon
Reduce the Critical Forwarding PathReduce the Critical Forwarding Path
Wait
Load X
Store X
Signal
Overview Big Critical Path Small Critical Path
decreases execution time
criticalpath
stall execution time
execution time
51Thread-Level Speculation SteffanCarnegie Mellon
PredictPredict
good when p == q and *q is predictable
Store *p
Load *q
Memory
ValuePredictor
Store *p
Load *q
Memory
SignalWait(stall)
(Synchronize)
Store *pLoad *q
Memory
(Speculate)
52Thread-Level Speculation SteffanCarnegie Mellon
Improving on Compile-Time DecisionsImproving on Compile-Time Decisions
Predict
Speculate
Synchronize
Compiler
Speculate
Synchronize
Hardware
reduce criticalforwarding path
reduce criticalforwarding path
improve the efficiency of value communication
53Thread-Level Speculation SteffanCarnegie Mellon
TechniquesTechniques
Prediction Prediction – memory value predictionmemory value prediction
– forwarded value predictionforwarded value prediction
– silent storessilent stores
SynchronizationSynchronization– dynamic synchronizationdynamic synchronization
– compiler scheduling to reduce the critical pathcompiler scheduling to reduce the critical path
– hardware prioritization to reduce the critical path hardware prioritization to reduce the critical path $$$$$$
inexpensive, except for hardware prioritization
54Thread-Level Speculation SteffanCarnegie Mellon
Execution Time BreakdownExecution Time Breakdown
55Thread-Level Speculation SteffanCarnegie Mellon
Performance on 4 ProcessorsPerformance on 4 Processors
S=Sequential, B=Baseline
lots of failed speculation and synchronization
56Thread-Level Speculation SteffanCarnegie Mellon
Performance on 4 ProcessorsPerformance on 4 Processors
S=Sequential, B=Baseline, O=Optimizations
significant improvement
57Thread-Level Speculation SteffanCarnegie Mellon
ConclusionsConclusions
• TLS may be the next big winTLS may be the next big win
• Industry-friendly hardware is possibleIndustry-friendly hardware is possible
• Efficient value communication is keyEfficient value communication is key
Ongoing/future work:Ongoing/future work:– compiler: improving region selection and coveragecompiler: improving region selection and coverage
– hardware: improve cache localityhardware: improve cache locality