compiler construction of idempotent regions and applications in architecture design
DESCRIPTION
Compiler Construction of Idempotent Regions and Applications in Architecture Design. Marc de Kruijf Advisor: Karthikeyan Sankaralingam. PhD Defense 07/20/2012. Example. source code. int sum( int *array, int len ) { int x = 0; for ( int i = 0; i < len ; ++ i ) - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/1.jpg)
Compiler Construction of Idempotent Regions and Applications in Architecture
DesignMarc de Kruijf
Advisor: Karthikeyan Sankaralingam
PhD Defense 07/20/2012
![Page 2: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/2.jpg)
Example
2
int sum(int *array, int len) { int x = 0; for (int i = 0; i < len; ++i) x += array[i]; return x;}
source code
![Page 3: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/3.jpg)
Example
3
R2 = load [R1] R3 = 0LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOPEXIT: return R3
assembly code
F F F F0
faults
exceptions
x
load ?
mis-speculations
![Page 4: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/4.jpg)
Example
4
R2 = load [R1] R3 = 0LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOPEXIT: return R3
assembly code
BAD STUFF HAPPENS!
![Page 5: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/5.jpg)
R0 and R1 are unmodified
R2 = load [R1] R3 = 0LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOPEXIT: return R3
Example
5
assembly code
just re-execute!
convention:use checkpoints/buffers
![Page 6: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/6.jpg)
It’s Idempotent!
6
idempoh… what…?
int sum(int *data, int len) { int x = 0; for (int i = 0; i < len; ++i) x += data[i]; return x;}
=
![Page 7: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/7.jpg)
7
Thesis
idempotent regionsALL THE TIME
specifically… – using compiler analysis (intra-procedural) – transparent; no programmer intervention – hardware/software co-design – software analysis hardware execution
the thing that I am defending
![Page 8: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/8.jpg)
8
Thesisprelim.pptx defense.pptx
preliminary exam (11/2010) – idempotence: concept and simple empirical analysis – compiler: preliminary design & partial implementation – architecture: some area and power savings…?
defense (07/2012) – idempotence: formalization and detailed empirical analysis – compiler: complete design, source code release*
– architecture: compelling benefits (various)
* http://research.cs.wisc.edu/vertical/iCompiler
![Page 9: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/9.jpg)
9
Contributions & Findingsa summary
contribution areas – idempotence: models and analysis framework – compiler: design, implementation, and evaluation – architecture: design and evaluation
findings – potentially large idempotent regions exist in applications – for compilation, larger is better – small regions (5-15 instructions), 10-15% overheads – large regions (50+ instructions), 0-2% overheads – enables efficient exception and hardware fault recovery
![Page 10: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/10.jpg)
10
Overview
❶ Idempotence Models in Architecture
❷ Compiler Design & Evaluation
❸ Architecture Design & Evaluation
![Page 11: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/11.jpg)
11
Idempotence Models
MODEL AAn input is a variable that is live-in to the region. A region
preserves an input if the input is not overwritten.
idempotence: what does it mean?
DEFINITION(1) a region is idempotent iff its re-execution has no side-effects
(2) a region is idempotent iff it preserves its inputs
OK, but what does it mean to preserve an input?four models (next slides): A, B, C, & D
![Page 12: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/12.jpg)
12
Idempotence Model A
R1 = R2 + R3if R1 > 0
mem[R4] = R1
SP = SP - 16
false
true
Live-ins:{all registers}
a starting point
{all memory}\{R1}
?? = mem[R4]
…
?
![Page 13: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/13.jpg)
13
Idempotence Model A
R1 = R2 + R3if R1 > 0
mem[R4] = R1
SP = SP - 16
false
true
Live-ins:{all registers}
a starting point
{all memory}
![Page 14: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/14.jpg)
14
?? = mem[R4]
…
?
Idempotence Model A
R1 = R2 + R3if R1 > 0
mem[R4] = R1
SP = SP - 16
false
true
Live-ins:{all registers}
a starting point
TWO REGIONS. SOME STUFF NOT
COVERED.
{all memory} \{mem[R4]}
![Page 15: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/15.jpg)
15
?? = mem[R4]
…
?
Idempotence Model B
R1 = R2 + R3if R1 > 0
mem[R4] = R1
SP = SP - 16
false
true
live-in but dynamically dead at time of write– OK to overwrite if control flow invariable
varying control flow assumptions
ONE REGION. SOME STUFF STILL NOT
COVERED.
![Page 16: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/16.jpg)
16
Idempotence Model C
R1 = R2 + R3if R1 > 0
mem[R4] = R1
SP = SP - 16
false
true
allow final instruction to overwrite input(to include otherwise ineligible instructions)
varying sequencing assumptions
ONE REGION. EVERYTHING COVERED.
![Page 17: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/17.jpg)
17
Idempotence Model D
R1 = R2 + R3if R1 > 0
mem[R4] = R1
SP = SP - 16
false
true
may be concurrently read in another thread– consider as input
varying isolation assumptions
TWO REGIONS. EVERYTHING COVERED.
![Page 18: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/18.jpg)
18
Idempotence Modelsan idempotence taxonomy
sequencing axis
control axis
isolation axis
(Model C)
(Model B)
(Model D)
![Page 19: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/19.jpg)
19
what are the implications?
![Page 20: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/20.jpg)
20
Empirical Analysismethodology
measurement – dynamic region size (path length) – subject to axis constraints – x86 dynamic instruction count (using PIN)
benchmarks – SPEC 2006, PARSEC, and Parboil suites
experimental configurations – unconstrained: ideal upper bound (Model C) – oblivious: actual in normal compiled code (Model C) – X-constrained: ideal upper bound constrained by axis X
![Page 21: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/21.jpg)
21
Empirical Analysis
SPEC INT SPEC FP PARSEC Parboil OVERALL1
10
100
1000
obliviousunconstrained
*geometrically averaged across suites
oblivious vs. unconstrained
5.2
160.2
30X IMPROVEMENT ACHIEVABLE*
aver
age
regi
on si
ze*
![Page 22: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/22.jpg)
22
Empirical Analysis
SPEC INT SPEC FP PARSEC Parboil OVERALL1
10
100
1000
obliviousunconstrainedcontrol-constrained
control axis sensitivity
160.2
5.2
40.1
*geometrically averaged across suites
aver
age
regi
on si
ze*
4X DROP IF VARIABLE CONTROL FLOW
![Page 23: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/23.jpg)
23
Empirical Analysis
SPEC INT SPEC FP PARSEC Parboil OVERALL1
10
100
1000
obliviousunconstrainedcontrol-constrainedisolation-constrained
isolation axis sensitivity
160.2
5.2
40.127.4
*geometrically averaged across suites
aver
age
regi
on si
ze*
1.5X DROP IF NO ISOLATION
![Page 24: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/24.jpg)
24
non-
idem
pote
nt in
stru
ction
s*
Empirical Analysis
SPEC INT SPEC FP PARSEC Parboil OVERALL0%
1%
2%
sequencing axis sensitivity
0.19%
*geometrically averaged across suites
INHERENTLY NON-IDEMPOTENT
INSTRUCTIONS:
RARE AND OF LIMITED TYPE
![Page 25: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/25.jpg)
25
Idempotence Modelsa summary
a spectrum of idempotence models – significant opportunity: 100+ sizes possible – 4x reduction constraining control axis – 1.5x reduction constraining isolation axis
two models going forward – architectural idempotence & contextual idempotence – both are effectively the ideal case (Model C) – architectural idempotence: invariable control always – contextual idempotence: variable control w.r.t. locals
![Page 26: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/26.jpg)
26
Overview
❶ Idempotence Models in Architecture
❷ Compiler Design & Evaluation
❸ Architecture Design & Evaluation
![Page 27: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/27.jpg)
27
Compiler Design
PARTITION: ANALYZE:
choose your own adventure
CODE GEN:
PARTITION
ANALYZE CODE GENX
COMPILER EVALUATION
identify semantic clobber antidependencescut semantic clobber antidependencespreserve semantic idempotence
![Page 28: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/28.jpg)
28
Compiler Evaluationpreamble
NOT FreeFreeFree
WHAT DO YOU MEAN:PERFORMANCE OVERHEADS?
PARTITION: ANALYZE:
CODE GEN:
identify semantic clobber antidependencescut semantic clobber antidependencespreserve semantic idempotence
![Page 29: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/29.jpg)
29
Compiler Evaluationpreamble
region size
over
head
register pressure
- preserve input values in registers- spill other values (if needed)
- spill input values to stack- allocate other values to registers
![Page 30: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/30.jpg)
30
Compiler Evaluation
compiler implementation – LLVM, support for both x86 and ARM
methodology
measurements – performance overhead: dynamic instruction count – for x86, using PIN – for ARM, using gem5 (just for ISA comparison at end)
– region size: instructions between boundaries (path length) – x86 only, using PIN
benchmarks – SPEC 2006, PARSEC, and Parboil suites
![Page 31: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/31.jpg)
31
Results, Take 1/3
SPEC INT SPEC FP PARSEC Parboil OVERALL0%2%4%6%8%
10%12%14%16%
initial results – overheadpe
rform
ance
ove
rhea
d
percentage increase in x86 dynamic instruction countgeometrically averaged across suites
13.1%
NON-TRIVIAL OVERHEAD
![Page 32: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/32.jpg)
0%
10%
20%
30%
40%
50%
Results, Take 1/3
region size
over
head
YOU ARE HERE(typically 10-30 instructions)
?
analysis of trade-offs
32
10+ instructions register pressure
![Page 33: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/33.jpg)
0%
10%
20%
30%
40%
50%
Results, Take 1/3
region size
over
head
register pressure
analysis of trade-offs
33
detectionlatency
?
𝒍= 𝒅𝒓+𝒅
?
![Page 34: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/34.jpg)
34
Results, Take 2/3
SPEC INT SPEC FP PARSEC Parboil OVERALL0%2%4%6%8%
10%12%14%16%
minimizing register pressurepe
rform
ance
ove
rhea
d
13.1%
STILL NON-TRIVIAL OVERHEAD
11.1%
Before After
![Page 35: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/35.jpg)
0%
10%
20%
30%
40%
50%
Results, Take 2/3
region size
over
head
register pressure
analysis of trade-offs
35
detectionlatency
𝒍= 𝒅𝒓+𝒅 𝒆∝( 𝟏
𝟏−𝒑 )𝒓
re-executiontime
?
![Page 36: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/36.jpg)
Big Regions
36
how do we get there?
Problem #1: aliasing analysis – no flow-sensitive analysis in LLVM; really hurts loops
Problem #2: loop optimizations – boundaries in loops are bad for everyone – loop blocking, fission/fusion, inter-change, peeling, unrolling, scalarization, etc. can all help
Problem #3: large array structures – awareness of array access patterns can help
Problem #4: intra-procedural scope – limited scope aggravates all effects listed above
![Page 37: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/37.jpg)
Big Regions
37
how do we get there?
solutions can be automated – a lot of work… what would be the gain?
ad hoc for now – consider PARSEC and Parboil suites as a case study – aliasing annotations – manual loop refactoring, scalarization, etc. – partitioning algorithm refinements (application-specific) – inlining annotations
![Page 38: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/38.jpg)
38
Results Take 3/3
PARSEC Parboil OVERALL0%2%4%6%8%
10%12%14%
big regionspe
rform
ance
ove
rhea
d 13.1%
0.06%
Before After
GAIN = BIG; TRIVIAL OVERHEAD
![Page 39: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/39.jpg)
39
1 10 100 1000 10000 100000 1000000100000000%
10%
20%
30%
40%
50%
Results Take 3/350+ instructions is good enough
region size
over
head
registerpressure
50+ instructions
(mis-optimized)
![Page 40: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/40.jpg)
40
ISA Sensitivityyou might be curious
ISA matters? (1) two-address (e.g. x86) vs. three-address (e.g. ARM) (2) register-memory (e.g. x86) vs. register-register (e.g. ARM) (3) number of available registers
the short version ( ) – impact of (1) & (2) not significant (+/- 2% overall) – even less significant as regions grow larger – impact of (3): to get same performance w/ idempotence – increase registers by 0% (large) to ~60% (small regions)
![Page 41: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/41.jpg)
41
Compiler Design & Evaluationa summary
design and implementation – static analysis algorithms, modular and perform well – code-gen algorithms, modular and perform well – LLVM implementation source code available*
findings – pressure-related performance overheads range from: – 0% (large regions) to ~15% (small regions) – greatest opportunity: loop-intensive applications – ISA effects are insignificant* http://research.cs.wisc.edu/vertical/iCompiler
![Page 42: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/42.jpg)
42
Overview
❶ Idempotence Models in Architecture
❷ Compiler Design & Evaluation
❸ Architecture Design & Evaluation
![Page 43: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/43.jpg)
Architecture Recovery: It’s Real
43
safety first speed first (safety second)
![Page 44: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/44.jpg)
44
1 Fetch
4 Write-back
3 Execute
2 Decode
Architecture Recovery: It’s Reallots of sharp turns
1 Fetch 2 Decode 3 Execute 4 Write-back
closer to the truth
![Page 45: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/45.jpg)
45
1 Fetch
4 Write-back
3 Execute
2 Decode
1 Fetch 2 Decode 3 Execute 4 Write-back
!!!
Architecture Recovery: It’s Reallots of interaction
too late!
![Page 46: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/46.jpg)
46
Architecture Recovery: It’s Realbad stuff can happen
mis-speculation
(a) branch mis-prediction,(b) memory re-ordering,(c) transaction violation,
etc.
hardware faults
(d) wear-out fault,(e) particle strike,(f) voltage spike,
etc.
exceptions
(g) page fault,(h) divide-by-zero,
(i) mis-aligned access,etc.
![Page 47: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/47.jpg)
47
Architecture Recovery: It’s Realbad stuff can happen
mis-speculation
(a) branch mis-prediction,(b) memory re-ordering,(c) transaction violation,
etc.
hardware faults
(d) wear-out fault,(e) particle strike,(f) voltage spike,
etc.
exceptions
(g) page fault,(h) divide-by-zero,
(i) mis-aligned access,etc.
register pressure
detectionlatency
re-executiontime
![Page 48: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/48.jpg)
48
Architecture Recovery: It’s Realbad stuff can happen
mis-speculation
(a) branch mis-prediction,(b) memory re-ordering,(c) transaction violation,
etc.
hardware faults
(d) wear-out fault,(e) particle strike,(f) voltage spike,
etc.
exceptions
(g) page fault,(h) divide-by-zero,
(i) mis-aligned access,etc.
![Page 49: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/49.jpg)
49
Architecture Recovery: It’s Realbad stuff can happen
hardware faultsexceptions
(d) wear-out fault,(e) particle strike,(f) voltage spike,
etc.
(g) page fault,(h) divide-by-zero,
(i) mis-aligned access,etc.
integrated GPU
low-power CPU high-reliability systems
![Page 50: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/50.jpg)
50
GPU Exception Support
![Page 51: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/51.jpg)
51
GPU Exception Supportwhy would we want it?
GPU/CPU integration– unified address space: support for demand paging– numerous secondary benefits as well…
![Page 52: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/52.jpg)
52
GPU Exception Supportwhy is it hard?
the CPU solution
pipeline registersbuffers
![Page 53: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/53.jpg)
53
GPU Exception Supportwhy is it hard?
CPU: 10s of registers/core
…
GPU: 10s of registers/thread32 threads/warp48 warps per “core”
10,000s of registers/core
![Page 54: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/54.jpg)
54
GPU Exception Supportidempotence on GPUs
GPUs hit the sweet spot (1) extractably large regions (low compiler overheads) (2) detection latencies long or hard to bound (large is good) (3) exceptions are infrequent (low re-execution overheads)
registerpressure
detectionlatency
re-executiontime
![Page 55: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/55.jpg)
55
GPU Exception Supportidempotence on GPUs
GPU design topics – compiler flow – hardware support – exception live-lock – bonus: fast context switching
DETAILS
GPUs hit the sweet spot (1) extractably large regions (low compiler overheads) (2) detection latencies long or hard to bound (large is good) (3) exceptions are infrequent (low re-execution overheads)
![Page 56: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/56.jpg)
56
GPU Exception Supportevaluation methodology
compiler – LLVM targeting ARM
benchmarks – Parboil GPU benchmarks for CPUs, modified
simulation – gem5 for ARM: simple dual-issue in-order (e.g. Fermi) – 10-cycle page fault detection latency
measurement – performance overhead in execution cycles
![Page 57: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/57.jpg)
57
GPU Exception Support
cutcp fft histo mri-q sad tpacf gmean0.0%0.2%0.4%0.6%0.8%1.0%1.2%1.4%1.6%
evaluation resultspe
rform
ance
ove
rhea
d
0.54%NEGLIGIBLE OVERHEADS
![Page 58: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/58.jpg)
58
CPU Exception Support
![Page 59: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/59.jpg)
59
CPU Exception Supportwhy is it a problem?
the CPU solution
pipeline registersbuffers
![Page 60: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/60.jpg)
60
CPU Exception Supportwhy is it a problem?
Decode, Rename,& Issue
Fetch
Integer
Integer
Multiply
Load/Store
RF
Branch
FP…
IEEE FP
Bypass
Replay queue
Flush?Replay?
…
Before
![Page 61: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/61.jpg)
61
CPU Exception Supportwhy is it a problem?
Fetch Decode & Issue
Integer
Integer
Branch
Multiply
Load/Store
FP
RF
…
After
![Page 62: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/62.jpg)
62
CPU Exception Supportidempotence on CPUs
CPU design simplification – in ARM Cortex-A8 (dual-issue in-order) can remove: – bypass / staging register file, replay queue – rename pipeline stage – IEEE-compliant floating point unit – pipeline flush for exceptions and replays – all associated control logic
DETAILS
leaner hardware – bonus: cheap (but modest) OoO issue
![Page 63: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/63.jpg)
63
CPU Exception Supportevaluation methodology
compiler – LLVM targeting ARM, minimize pressure (take 2/3)
benchmarks – SPEC 2006 & PARSEC suites (unmodified)
simulation – gem5 for ARM: aggressive dual-issue in-order (e.g. A8) – stall on potential in-flight exception
measurement – performance overhead in execution cycles
![Page 64: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/64.jpg)
64
CPU Exception Support
SPEC INT SPEC FP PARSEC OVERALL0%2%4%6%8%
10%12%14%
evaluation resultspe
rform
ance
ove
rhea
d
9.1%
LESS THAN 10%... BUT NOT AS GOOD AS
GPU (REGIONS TOO SMALL)
![Page 65: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/65.jpg)
65
Hardware Fault Tolerance
![Page 66: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/66.jpg)
66
Hardware Fault Tolerancewhat is the opportunity?
reliability trends – CMOS reliability is a growing problem – future CMOS alternatives are no better
architecture trends – hardware power and complexity are premium – desire for simple hardware + efficient recovery
application trends – emerging workloads consist of large idempotent regions – increasing levels of software abstraction
![Page 67: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/67.jpg)
67
Hardware Fault Tolerancedesign topics
hardware organizations – homogenous: idempotence everywhere – statically heterogeneous: e.g. accelerators – dynamically heterogeneous: adaptive cores
FAULT MODEL
fault detection capability – fine-grained in hardware (e.g. Argus, MICRO ‘07) or – fine-grained in software (e.g. instruction/region DMR)
fault model (aka ISA semantics) – similar to pipeline-based (e.g. ROB) recovery
![Page 68: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/68.jpg)
68
Hardware Fault Toleranceevaluation methodology
compiler – LLVM targeting ARM (compiled to minimize pressure)
benchmarks – SPEC 2006, PARSEC, and Parboil suites (unmodified)
simulation – gem5 for ARM: simple dual-issue in-order – DMR detection; compare against checkpoint/log and TMR
measurement – performance overhead in execution cycles
![Page 69: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/69.jpg)
69
Hardware Fault Toleranceevaluation results
SPEC INTSPEC FP
PARSECParboil
OVERALL0%5%
10%15%20%25%30%35%
idempotencecheckpoint/logTMR9.1
22.229.3
perfo
rman
ce o
verh
ead
IDEMPOTENCE: BEATS THE
COMPETITION
![Page 70: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/70.jpg)
70
Overview
❶ Idempotence Models in Architecture
❷ Compiler Design & Evaluation
❸ Architecture Design & Evaluation
![Page 71: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/71.jpg)
71
RELATED WORK
CONCLUSIONS
![Page 72: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/72.jpg)
72
Conclusions
idempotence: not good for everything – small regions are expensive – preserving register state is difficult with limited flexibility – large regions are cheap – preserving register state is easy with amortization effect – preserving memory state is mostly “for free”
idempotence: synergistic with modern trends – programmability (for GPUs) – low power (for everyone) – high-level software efficient recovery (for everyone)
![Page 73: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/73.jpg)
73
The End
![Page 74: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/74.jpg)
74
Back-Up: Chronology
MapReduce for CELL
Time
SELSE ’09: Synergy
ISCA ‘10: Relax
MICRO ’11:Idempotent Processors
PLDI ’12:Static Analysis and Compiler Design
ISCA ’12:iGPU
DSN ’10:TS model
CGO ??:Code Gen
TACO ??:Models
prelim defense
![Page 75: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/75.jpg)
75
Choose Your Own Adventure Slides
![Page 76: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/76.jpg)
Idempotence Analysis
76
Yes
2
is this idempotent?
![Page 77: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/77.jpg)
Idempotence Analysis
77
No
2
how about this?
![Page 78: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/78.jpg)
Idempotence Analysis
78
Yes
2
maybe this?
![Page 79: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/79.jpg)
Idempotence Analysis
79
operation sequence dependence chain idempotent?
write
read, write
write, read, write Yes
No
Yes
it’s all about the data dependences
![Page 80: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/80.jpg)
Idempotence Analysis
80
operation sequence dependence chain idempotent?
write, read
read, write
write, read, write Yes
No
Yes
it’s all about the data dependences
CLOBBER ANTIDEPENDENCEantidependence with an exposed read
![Page 81: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/81.jpg)
Semantic Idempotence
81
(1) local (“pseudoregister”) state:can be renamed to remove clobber antidependences*
does not semantically constrain idempotence
two types of program state
(2) non-local (“memory”) state:cannot “rename” to avoid clobber antidependences
semantically constrains idempotence
semantic idempotence = no non-local clobber antidep.
preserve local state by renaming and careful allocation
![Page 82: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/82.jpg)
Region Partitioning Algorithm
82
steps one, two, and three
Step 1: transform functionremove artificial dependences, remove non-clobbers
Step 3: refine for correctness & performanceaccount for loops, optimize for dynamic behavior
Step 2: construct regions around antidependencescut all non-local antidependences in the CFG
![Page 83: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/83.jpg)
Step 1: Transform
83
Transformation 1: SSA for pseudoregister antidependences
not one, but two transformations
clobber antidependences
region boundaries
region identification
But we still have a problem:
depends on
![Page 84: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/84.jpg)
Step 1: Transform
84
Transformation 1: SSA for pseudoregister antidependences
not one, but two transformations
Transformation 2: Scalar replacement of memory variables
[x] = a;b = [x];[x] = c;
[x] = a;b = a;[x] = c;
non-clobber antidependences… GONE!
![Page 85: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/85.jpg)
Step 1: Transform
85
Transformation 1: SSA for pseudoregister antidependences
not one, but two transformations
Transformation 2: Scalar replacement of memory variables
clobber antidependences
region boundaries
region identification depends on
![Page 86: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/86.jpg)
Region Partitioning Algorithm
86
steps one, two, and three
Step 1: transform functionremove artificial dependences, remove non-clobbers
Step 3: refine for correctness & performanceaccount for loops, optimize for dynamic behavior
Step 2: construct regions around antidependencescut all non-local antidependences in the CFG
![Page 87: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/87.jpg)
Step 2: Cut the CFG
87
cut, cut, cut…
construct regions by “cutting” non-local antidependences
antidependence
![Page 88: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/88.jpg)
larger is (generally) better:large regions amortize the cost of input preservation
88
region size
over
head sources of overhead
optimal region size?
Step 2: Cut the CFG
rough sketch
but where to cut…?
![Page 89: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/89.jpg)
Step 2: Cut the CFG
89
but where to cut…?
goal: the minimum set of cuts that cuts allantidependence paths
intuition: minimum cuts fewest regions large regions
approach: a series of reductions:minimum vertex multi-cut (NP-complete)minimum hitting set among pathsminimum hitting set among “dominating nodes”
details omitted
![Page 90: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/90.jpg)
Region Partitioning Algorithm
90
steps one, two, and three
Step 1: transform functionremove artificial dependences, remove non-clobbers
Step 3: refine for correctness & performanceaccount for loops, optimize for dynamic behavior
Step 2: construct regions around antidependencescut all non-local antidependences in the CFG
![Page 91: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/91.jpg)
Step 3: Loop-Related Refinements
91
correctness: Not all local antidependences removed by SSA…
loops affect correctness and performance
loop-carried antidependences may clobberdepends on boundary placement; handled as a post-pass
performance: Loops tend to execute multiple times…
to maximize region size, place cuts outside of loopalgorithm modified to prefer cuts outside of loops
details omitted
![Page 92: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/92.jpg)
Code Generation Algorithms
92
idempotence preservation
background & concepts:live intervals, region intervals, and shadow intervals
compiling for architectural idempotence:invariable control flow upon re-execution
compiling for contextual idempotence:potentially variable control flow upon re-execution
![Page 93: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/93.jpg)
Code Generation Algorithmslive intervals and region intervals
x = ...
... = f(x)y = ...
93
region boundaries
region interval
x’s live interval
![Page 94: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/94.jpg)
Code Generation Algorithmsshadow intervals
94
shadow intervalthe interval over which a variable must not be overwritten
specifically to preserve idempotence
different for architectural and contextual idempotence
X
![Page 95: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/95.jpg)
Code Generation Algorithmsfor contextual idempotence
x = ...
... = f(x)y = ...
95
region boundaries
x’s shadow interval
x’s live interval
![Page 96: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/96.jpg)
Code Generation Algorithmsfor architectural idempotence
x = ...
... = f(x)y = ...
96
region boundaries
x’s shadow interval
x’s live interval
![Page 97: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/97.jpg)
Code Generation Algorithmsfor architectural idempotence
x = ...
... = f(x)y = ...
97
region boundaries
x’s shadow interval
x’s live interval
y’s live interval
![Page 98: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/98.jpg)
Big Regions
98
Re: Problem #2 (cut in loops are bad)
i0 = φ(0, i1)
i1 = i0 + 1if (i1 < X)
for (i = 0; i < X; i++) { ...}
C code CFG + SSA
![Page 99: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/99.jpg)
Big Regions
99
Re: Problem #2 (cut in loops are bad)
R0 = 0
R0 = R0 + 1if (R0 < X)
for (i = 0; i < X; i++) { ...}
C code machine code
NO BOUNDARIES = NO PROBLEM
![Page 100: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/100.jpg)
Big Regions
100
Re: Problem #2 (cut in loops are bad)
R0 = 0
R0 = R0 + 1if (R0 < X)
for (i = 0; i < X; i++) { ...}
C code machine code
![Page 101: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/101.jpg)
Big Regions
101
Re: Problem #2 (cut in loops are bad)
R1 = 0
R0 = R1
R1 = R0 + 1if (R1 < X)
for (i = 0; i < X; i++) { ...}
C code machine code
– “redundant” copy– extra boundary (pressure)
![Page 102: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/102.jpg)
Big Regions
102
Re: Problem #3 (array access patterns)
[x] = a;b = [x];[x] = c;
[x] = a;b = a;[x] = c;
non-clobber antidependences… GONE!
algorithm makes this simplifying assumption:
cheap for scalars, expensive for arrays
![Page 103: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/103.jpg)
Big Regions
103
Re: Problem #3 (array access patterns)
not really practical for large arraysbut if we don’t do it, non-clobber antidependences remain
solution: handle potential non-clobbers in a post-pass(same way we deal with loop clobbers in static analysis)
// initialize:int[100] array;memset(&array, 100*4, 0);// accumulate:for (...) array[i] += foo(i);
![Page 104: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/104.jpg)
Big Regions
104
Benchmark Problems Size Before Size After
blackscholes ALIASING, SCOPE 78.9 >10,000,000
canneal SCOPE 35.3 187.3
fluidanimate ARRAYS, LOOPS, SCOPE
9.4 >10,000,000
streamcluster ALIASING 120.7 4,928
swaptions ALIASING, ARRAYS 10.8 211,000
cutcp LOOPS 21.9 612.4
fft ALIASING 24.7 2,450
histo ARRAYS, SCOPE 4.4 4,640,000
mri-q – 22,100 22,100
sad ALIASING 51.3 90,000
tpacf ARRAYS, SCOPE 30.2 107,000
results: sizes
![Page 105: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/105.jpg)
Big Regions
105
Benchmark Problems Overhead Before Overhead After
blackscholes ALIASING, SCOPE -2.93% -0.05%
canneal SCOPE 5.31% 1.33%
fluidanimate ARRAYS, LOOPS, SCOPE
26.67% -0.62%
streamcluster ALIASING 13.62% 0.00%
swaptions ALIASING, ARRAYS 17.67% 0.00%
cutcp LOOPS 6.344% -0.01%
fft ALIASING 11.12% 0.00%
histo ARRAYS, SCOPE 23.53% 0.00%
mri-q – 0.00% 0.00%
sad ALIASING 4.17% 0.00%
tpacf ARRAYS, SCOPE 12.36% -0.02%
results: overheads
![Page 106: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/106.jpg)
Big Regions
106
problem labels
Problem #1: aliasing analysis – no flow-sensitive analysis in LLVM; really hurts loops
Problem #2: loop optimizations – boundaries in loops are bad for everyone – loop blocking, fission/fusion, inter-change, peeling, blocking, unrolling, scalarization, etc. can all help
Problem #3: large array structures – awareness of array access patterns can help
Problem #4: intra-procedural scope – limited scope aggravates all effects listed above
(ALIASING)
(LOOPS)
(ARRAYS)
(SCOPE)
![Page 107: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/107.jpg)
107
ISA Sensitivity
SPEC INT SPEC FP PARSEC Parboil OVERALL0
5
10
15
20
perc
enta
ge o
verh
ead
x86-64 ARMv7
same configuration as take 1/3
x86-64 vs. ARMv7
![Page 108: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/108.jpg)
108
ISA Sensitivitygeneral purpose register (GPR) sensititivity
14-GPR 12-GPR 10-GPR take 2/302468
101214
ARMv7, 16 GPR baseline; data as geometric mean across SPEC INT
perc
enta
ge o
verh
ead
![Page 109: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/109.jpg)
ISA Sensitivity
109
more registers isn’t always enough
x = 0;if (y > 0) x = 1;z = x + y;
C codeR0 = 0
if (R1 > 0)
R0 = 1
R2 = R0 + R1
![Page 110: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/110.jpg)
ISA Sensitivity
110
more registers isn’t always enough
R0 = 0
if (R1 > 0)
R3 = R0
x = 0;if (y > 0) x = 1;z = x + y;
C code
R3 = 1
R2 = R3 + R1
![Page 111: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/111.jpg)
111
GPU Exception Supportcompiler flow & hardware support
kernel source
source code
compilerIR
device code generator
partitioning preservation
idempotent device code
L2 cache
core
L1, TLB
general purpose registers RPCs
fetchFU
…
… decodeFU FU
compiler
hardware
![Page 112: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/112.jpg)
112
GPU Exception Supportexception live-lock and fast context switching
bonus: fast context switching – boundary locations are configurable at compile time – observation 1: save/restore only live state – observation 2: place boundaries to minimize liveness
exception live-lock – multiple recurring exceptions can cause live-lock – detection: save PC and compare – recovery: single-stepped re-execution or re-compilation
![Page 113: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/113.jpg)
113
CPU Exception Supportdesign simplification
idempotence OoO retirement – simplify result bypassing – simplifies exception support for long latency instructions – simplifies scheduling of variable-latency instructions
OoO issue?
![Page 114: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/114.jpg)
114
CPU Exception Supportdesign simplification
what about branch prediction, etc.? high re-execution costs; live-lock issues
registerpressure
detectionlatency
re-executiontime
!!!
region placement to minimize re-execution...?
![Page 115: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/115.jpg)
115
CPU Exception Support
SPEC INT SPEC FP PARSEC OVERALL0
5
10
15
20
25
minimizing branch re-execution costpe
rcen
tage
ove
rhea
d
9.1%
take 2/3 cut at branch
18.1%
![Page 116: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/116.jpg)
116
Hardware Fault Tolerancefault semantics
hardware fault model (fault semantics) – side-effects are temporally contained to region execution – side-effects are spatially contained to target resources – control flow is legal (follows static CFG edges)
![Page 117: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/117.jpg)
117
Related Workon idempotence
Very Related Year Domain
Sentinel Scheduling 1992 Speculative memory re-ordering
Reference Idempotency 2006 Reducing speculative storage
Restart Markers 2006 Virtual memory in vector machines
Encore 2011 Hardware fault recovery
Somewhat Related Year Domain
Multi-Instruction Retry 1995 Branch and hardware fault recovery
Atomic Heap Transactions 1999 Atomic memory allocation
![Page 118: Compiler Construction of Idempotent Regions and Applications in Architecture Design](https://reader035.vdocuments.net/reader035/viewer/2022081420/56816910550346895de02a52/html5/thumbnails/118.jpg)
118
Related Workon idempotence
what’s new? – idempotence model classification and analysis – first work to decompose entire programs – static analysis in terms of clobber (anti-)dependences – static analysis and code generation algorithms – overhead analysis: detection, pressure, re-execution – comprehensive (and general) compiler implementation – comprehensive compiler evaluation – a spectrum of architecture designs & applications