idempotent processor architecture marc de kruijf karthikeyan sankaralingam vertical research group...
TRANSCRIPT
![Page 1: Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre](https://reader035.vdocuments.net/reader035/viewer/2022062803/56649f335503460f94c5096d/html5/thumbnails/1.jpg)
Idempotent Processor ArchitectureMarc de Kruijf
Karthikeyan SankaralingamVertical Research Group
UW-Madison
MICRO 2011, Porto Alegre
![Page 2: Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre](https://reader035.vdocuments.net/reader035/viewer/2022062803/56649f335503460f94c5096d/html5/thumbnails/2.jpg)
2
Idempotent Processor Architecture
applications naturally decompose into idempotent regions
=
idempotence: re-execution has no side-effects (inputs are preserved)
region entry point:implicit checkpoint in the live state of the program
“live state”
idempotent regions
![Page 3: Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre](https://reader035.vdocuments.net/reader035/viewer/2022062803/56649f335503460f94c5096d/html5/thumbnails/3.jpg)
3
Idempotent Processor Architecture
CHECKPOINT
conventionalrecovery
idempotentrecovery
recovery without checkpoints – just re-execute
idempotent recovery
![Page 4: Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre](https://reader035.vdocuments.net/reader035/viewer/2022062803/56649f335503460f94c5096d/html5/thumbnails/4.jpg)
4
Issue
Idempotent Processor Architecture
idempotentprocessor
conventionalprocessor
ExecRF
IssueFetch Decode WB
simpler decode, issue, execute, and writeback
ExecFetch Decode WBIssue
ExecDecode WBIssue
Decode Exec WBRF
idempotent processors
![Page 5: Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre](https://reader035.vdocuments.net/reader035/viewer/2022062803/56649f335503460f94c5096d/html5/thumbnails/5.jpg)
5
Presentation Overview
❶ Idempotent Recovery
❷ Idempotent Processors
❸ Evaluation
![Page 6: Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre](https://reader035.vdocuments.net/reader035/viewer/2022062803/56649f335503460f94c5096d/html5/thumbnails/6.jpg)
Idempotent Recovery
6
1. R2 = add R0, R12. R3 = ld [R2 + 4]3. R2 = add R2, R44. beqz R2, NEXT
“Something went wrong executing instruction 2!”(page fault, corrupted write, etc.)
“No problem – just re-execute from instruction 1!”
idempote
ntre
gion
example
![Page 7: Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre](https://reader035.vdocuments.net/reader035/viewer/2022062803/56649f335503460f94c5096d/html5/thumbnails/7.jpg)
7
Idempotent Recovery
idempotent “regions”: freely re-executable program regions
normal compiler:
custom compiler:
idempotent regions
=
sizes inhibited by clobber antidependences (WAR after no RAW)
special marker instructionadds some runtime overhead (typically 2-10%)
![Page 8: Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre](https://reader035.vdocuments.net/reader035/viewer/2022062803/56649f335503460f94c5096d/html5/thumbnails/8.jpg)
Idempotent Recovery
8
average idempotent region sizes
SPEC
INT
SPEC
FP
PARSEC
Parboil
401.bzip2
456.hmmer
OVERALL
10
100
1000
custom compiler AR
M m
icro
-ops
select benchmarksbenchmark suites (geo-mean) (geo-mean)
43
frequent clobber antidependences – can be removed, but requires large
restructuring effortlimited aliasing
information
with func params markedusing C/C++ “restrict”
![Page 9: Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre](https://reader035.vdocuments.net/reader035/viewer/2022062803/56649f335503460f94c5096d/html5/thumbnails/9.jpg)
9
Presentation Overview
❶ Idempotent Recovery
❷ Idempotent Processors
❸ Evaluation
![Page 10: Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre](https://reader035.vdocuments.net/reader035/viewer/2022062803/56649f335503460f94c5096d/html5/thumbnails/10.jpg)
10
Idempotent Processorsexploring the opportunity
Vdd
OutIn
issue execute retire
branch
hardware faultsexceptions & out-of-order retirement
branch misprediction
in-order out-of-order multi-core
![Page 11: Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre](https://reader035.vdocuments.net/reader035/viewer/2022062803/56649f335503460f94c5096d/html5/thumbnails/11.jpg)
Idempotent Processors
11
steps one, two, and three
Step 1: construct a high-performance in-order processor
Step 2: prune out unnecessary parts
Step 3: optimize for energy efficiency
ARM Cortex-A8 (‘05) IBM Cell SPE (‘05) Intel Atom (‘08)
↓ power, area, & complexity
↑ performance at low cost
![Page 12: Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre](https://reader035.vdocuments.net/reader035/viewer/2022062803/56649f335503460f94c5096d/html5/thumbnails/12.jpg)
12
Step 1: Construction
Fetch Decode & Issue
Integer
Integer
Branch
Multiply
Load/Store
FP
RFAdd
LdException!
…
v1.0
![Page 13: Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre](https://reader035.vdocuments.net/reader035/viewer/2022062803/56649f335503460f94c5096d/html5/thumbnails/13.jpg)
13
Fetch
Integer
Integer
Multiply
Load/Store
RF
Bypass
Branch
FP…
Staged instruction completion
Decode, Rename,& Issue
Step 1: Constructionv1.0
Flush?
![Page 14: Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre](https://reader035.vdocuments.net/reader035/viewer/2022062803/56649f335503460f94c5096d/html5/thumbnails/14.jpg)
14
Decode, Rename,& Issue
Fetch
Integer
Integer
Multiply
Load/Store
RF
Branch
FP…
IEEE FP… ?
Step 1: Constructionv1.0 v1.1
Bypass
FP exceptions handled in hardware.Separate FP unit implements full IEEE 754.
Flush?
![Page 15: Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre](https://reader035.vdocuments.net/reader035/viewer/2022062803/56649f335503460f94c5096d/html5/thumbnails/15.jpg)
15
Decode, Rename,& Issue
Fetch
Integer
Integer
Multiply
Load/Store
RF
Branch
FP…
IEEE FP…
Step 1: Constructionv1.0 v1.1 v1.2
BypassLoad miss?
Have to flush!
Replay queue
Flush?Flush?Replay?
![Page 16: Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre](https://reader035.vdocuments.net/reader035/viewer/2022062803/56649f335503460f94c5096d/html5/thumbnails/16.jpg)
16
Decode, Rename,& Issue
Fetch
Integer
Integer
Multiply
Load/Store
RF
Branch
FP…
IEEE FP
Step 1: Constructionv1.0 v1.1 v1.2 v1.3
Bypass
Replay queue
Flush?Replay?
TOO COMPLICATED!
…
![Page 17: Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre](https://reader035.vdocuments.net/reader035/viewer/2022062803/56649f335503460f94c5096d/html5/thumbnails/17.jpg)
17
Step 2: Simplification
Fetch Decode & Issue
Integer
Integer
Branch
Multiply
Load/Store
FP
RF
…
idempotent edition (simple)
WHAT IS GONE?• staging register file (6-entries)• replay queue (8-entries)• entire rename pipeline stage• IEEE-compliant floating point unit• pipeline flush for exceptions and replays• all associated control logic
![Page 18: Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre](https://reader035.vdocuments.net/reader035/viewer/2022062803/56649f335503460f94c5096d/html5/thumbnails/18.jpg)
18
Step 3: Optimization
Fetch Decode & Issue
Integer
Integer
Branch
Multiply
Load/Store
FP
RF
…
idempotent edition (fast)
SDB*
* Slice Data Buffer (SDB) – Continual Flow Pipelines. ASPLOS ‘04
details in paper…
![Page 19: Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre](https://reader035.vdocuments.net/reader035/viewer/2022062803/56649f335503460f94c5096d/html5/thumbnails/19.jpg)
19
Presentation Overview
❶ Idempotent Recovery
❷ Idempotent Processors
❸ Evaluation
![Page 20: Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre](https://reader035.vdocuments.net/reader035/viewer/2022062803/56649f335503460f94c5096d/html5/thumbnails/20.jpg)
Evaluation
20
idempotent processor performance
SPEC
INT
SPEC
FP
PARSEC
Parboil
401.bzip2
456.hmmer
OVERALL
-25%
0%
25%
50%
Simple Idem
Fast Idem
OoO
spee
d-up
ove
r in-
orde
r
select benchmarksbenchmark suites (geo-mean) (geo-mean)
![Page 21: Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre](https://reader035.vdocuments.net/reader035/viewer/2022062803/56649f335503460f94c5096d/html5/thumbnails/21.jpg)
21
Evaluationsummary
Processor Type Performance (vs. In-Order)
Power/Complexity(vs. In-Order)
simple idempotentworse by ~5%(compilation &
serialization overheads)better
fast idempotentbetter by ~5%
(modest OoO execution)
same or better
out-of-order better by ~30% worse
![Page 22: Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre](https://reader035.vdocuments.net/reader035/viewer/2022062803/56649f335503460f94c5096d/html5/thumbnails/22.jpg)
22
Presentation Overview
❶ Idempotent Recovery
❷ Idempotent Processors
❸ Evaluation
![Page 23: Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre](https://reader035.vdocuments.net/reader035/viewer/2022062803/56649f335503460f94c5096d/html5/thumbnails/23.jpg)
23
Future Work
– quantify power/complexity benefits(build real hardware prototype)
– more general error conditions(hardware faults, branch prediction, etc.)
– impact on multithreading/multiprocessors(re-execution currently assumes no interference)
– region overlapping (“region pipelining”)(analagous to overlapping checkpoints)
![Page 24: Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre](https://reader035.vdocuments.net/reader035/viewer/2022062803/56649f335503460f94c5096d/html5/thumbnails/24.jpg)
24
Conclusions
recovery using idempotence – recovery without checkpoints
multiple uses and multiple designs – uses: exception, speculation, fault recovery, and more – designs: in-order, out-of-order, multi-core, GPU, and more
in this work: exception recovery + in-order design – simplified out-of-order execution – better performance at equal or lower power/complexity
![Page 25: Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre](https://reader035.vdocuments.net/reader035/viewer/2022062803/56649f335503460f94c5096d/html5/thumbnails/25.jpg)
25
Back-up Slides
![Page 26: Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre](https://reader035.vdocuments.net/reader035/viewer/2022062803/56649f335503460f94c5096d/html5/thumbnails/26.jpg)
26
Optimal Idempotent Region Size?
region size
over
head
serialization overhead compiler formation overhead (many factors)
re-execution overhead
(rough sketch – graph not to scale)
![Page 27: Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre](https://reader035.vdocuments.net/reader035/viewer/2022062803/56649f335503460f94c5096d/html5/thumbnails/27.jpg)
27
Optimal Processor Design?
Single-issue in-order
Dual-issue in-order
Dual-issue OoO
Quad-issue OoO
Compiler overheadsdominate?
Re-execution overheadsdominate?
Best potential?
~ 250mW ~ 2.5W
![Page 28: Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre](https://reader035.vdocuments.net/reader035/viewer/2022062803/56649f335503460f94c5096d/html5/thumbnails/28.jpg)
28
Out-of-Order Issue Processors?
Some additional challenges…. Re-execution overhead high if mis-speculation frequent
cannot restart from point of mis-speculation, and hence… re-execution overhead on average ≈ half the region
Example: branch misprediction With in-order issue, simple to flush/drain pipeline With out-of-order issue, we can use idempotence but…
5 branches/region @ 90% confidence ≈ 41% re-execution rate
1. Speculate only high-confidence branches2. Hybrid checkpointing/idempotence3. …?