replay : a hardware framework for dynamic optimization * slides adapted from a talk by sanjay patel...
Post on 22-Dec-2015
221 views
TRANSCRIPT
rePLay : A Hardware Framework for Dynamic Optimization
*Slides adapted from a talk by Sanjay Patel
C SA
2
rePLay as an evolution
Trace Caches[high bandwidth fetch]
Branch Characterization[correlation, predictability]
Dynamic Optimization[Dynamo and HW-based]
rePLay
Effective at capturing short dynamic traces
dynamically, few “unpredictable” branches
Good potential, but watch out for overhead
3
The rePLay Framework : supporting dynamic optimization
in the microarchitecture
FetchEngine
Execution Engine
Frame Constructor
OptimizationEngine
Frame Cache Sequencer
Completing instructions
recove
ry sup
port
• Optimization engine provides for concurrent optimization.• HW recovery support enables speculative optimization without requiring recovery code.
• Frame constructor builds atomic optimization regions.
• Frame cache and sequencer provide for quick dispatch.
4
rePLay Key Innovations
• Atomic optimization regions (frames)
• Assertion architecture
• Use of hardware rollback for dynamic, speculative optimization
5
FramesrePLay optimization regionsA
B
C
D
E
A
B
C
D
E
6
Dealing with side exits
C
BEQ R3, D
X D
C
D
ASSERT R3==0
7
Frames become building blocks
A
B
C
D
E
A
A
8
Why are atomic blocks desirable?
AtomicBlock
•Because they contain no intermediate branches, they permit sustained fetch
•Because the code is control-independent, they permit code reordering without requiring recovery code
•Also, they require less effort to optimize and schedule
inst A
inst B
inst B
9
Mechanisms
10
rePLay Microarchitecture
• Frame Construction
• Frame Sequencer
• Frame Cache
• Optimizer
11
Frame Constructor
FetchEngine
Execution Engine
Frame Constructor
OptimizationEngine
Frame Cache Sequencer
Completing instructions
recove
ry sup
port
12
Frame Construction
• Objective 1: Frames should be long• Objective 2: Frames should completely execute• Objective 3: Should cover a significant fraction of
istream
• Approach: Use path-based history-based branch promotion to convert highly-biased branches into assertions.
13
The rePLay Frame Constructor
BranchBiasTable
PendingFrame
FrameConstruction
Buffer
incomingblock
from executionengine
Promotebranch toassertion
Keep enlargingframe if branchis promoted
History
14
Branch History forms Frame Context
X
Y
A
B
C
D
A
B
C
D
XY
15
Frame Construction Strategies[averaged across SPEC2000 integer benchmarks]
0
20
40
60
80
100
120
0 2 4 6 8 10 12 14
Path
Global
Static
History Length
Ave
in
stru
cti
on
s p
er
fram
e
16
Fraction of Dynamic Assertions[averaged across SPEC2000 integer benchmarks]
0%
20%
40%
60%
80%
100%
0 2 4 6 8 10 12 14
Assertions Fired
Assertions
Branches
Path History
18
Frame Construction Summary:With a 9-element path history
Frame Size 102 insts
Completion Rate
Coverage
Assertions/Frame
97%
82%
9
Unique Frames 6191
24
Optimization Engine
FetchEngine
Execution Engine
Frame Constructor
OptimizationEngine
Frame Cache Sequencer
Completing instructions
recove
ry sup
port
25
The rePLay Optimization Engine
• Programmable, with local memory
• Hardware blocks to assist with dead code, etc…
• Possible implementation: optimizer running as a thread on an MT processor
Dead CodeDetector
OptimizationDatapath
OptEngineMemory
UnoptimizedFrame
FrameBuffer
OptimizedFrame
26
rePLay Optimizations
• Removal of Partially Dead Code• Reassociation• Constant Propagation• Common Subexpression Elimination• Subroutine In-lining• Fetch Scheduling
27
Example of Optimization[from gcc]
asst_i r1,NE
asst_i r16,EQ
srl r16,48,r16
sll r29,48,r16
lda r0,0(r29)
sll r0,48,r0
srl r0,48,r0
cmpeq r0,28,r0
asst_i r0,EQ
lda r16,0(r29)
sll r16,48,r16
srl r16,48,r16
cmpeq r16,27,r16
beq r16,0x14
ldq r1,24(r1)
asst_i r1, NE
lda r29,0(r1)
cmpeq r16,29,r16
and r1,255,r30
cmpeq r30,29,r16
asst_i r16,EQ
cmpeq r30,28,r0
asst_i r0,EQ
cmpeq r30,27,r16
beq r16,0x14
beq r1, A
bne r1, B
ldq r1,24(r1)
beq r16, C
srl r16,48,r16
sll r29,48,r16
lda r29, 0(r1)
cmpeq r16,29,r16
lda r0,0(r29)
sll r0,48,r0
srl r0,48,r0
cmpeq r0,28,r0
beq r0, D
lda r16,0(r29)
sll r16,48,r16
srl r16,48,r16
cmpeq r16,27,r16
beq r16, E
FrameConstr
FrameConstr OptimizerOptimizer
asst_i r1, NE
ldq r1,24(r1)
asst_i r1,NE
lda r29,0(r1)
28
asst_i r1,NE
asst_i r16,EQ
srl r16,48,r16
sll r29,48,r16
lda r0,0(r29)
sll r0,48,r0
srl r0,48,r0
cmpeq r0,28,r0
asst_i r0,EQ
lda r16,0(r29)
sll r16,48,r16
srl r16,48,r16
cmpeq r16,27,r16
beq r16,0x14
ldq r1,24(r1)
asst_i r1, NE
lda r29,0(r1)
cmpeq r16,29,r16
asst_i r1,NE
asst_i r16,EQ
srl r16,48,r16
sll r1,48,r16
lda r0,0(r1)
sll r0,48,r0
srl r0,48,r0
cmpeq r0,28,r0
asst_i r0,EQ
lda r16,0(r1)
sll r16,48,r16
srl r16,48,r16
cmpeq r16,27,r16
beq r16,0x14
ldq r1,24(r1)
asst_i r1, NE
lda r29,0(r1)
cmpeq r16,29,r16
Reassociation
asst_i r1,NE
asst_i r16,EQ
srl r16,48,r30
sll r1,48,r16
cmpeq r30,28,r0
asst_i r0,EQ
cmpeq r30,27,r16
beq r16,0x14
ldq r1,24(r1)
asst_i r1, NE
lda r29,0(r1)
cmpeq r16,29,r16
Common sub-expression
eliminationMiscellaneous
and r1,255,r30
cmpeq r30,29,r16
asst_i r16,EQ
cmpeq r30,28,r0
asst_i r0,EQ
cmpeq r30,27,r16
beq r16,0x14
asst_i r1, NE
ldq r1,24(r1)
asst_i r1,NE
lda r29,0(r1)
29
Experimental Results
30
Processor Configurations
• ICache (IC) • Simple Trace Cache (TC)• rePLay without Optimizations (RP)• rePLay with Optimizations (RPO)Fetch: 8-wide, 64KB of cache space for insts.Core: 8-wide OOO, 64KB DCache, 1 exe cluster
7 cycles (min) for branch resolution
Compaq Alpha Compiler used for optimizations (-O4)
31
0
50
100
150
200
250
300
350M
illio
ns
of
Cyc
les
FrameCycles NormalCycles WrongPathCycles AssertCycles StalledCycles FetchMissCycles
8.7%
9.5%
15.5%
22.3%
7.1%
6.5%
14.4%
10.2%
34.9%
9.9%
18.3%
6.5%
Performance on statically optimized binaries
32
Performance on unoptimized binaries
0
50
100
150
200
250
300
350
bzp.
IC
bzp.
TC
bzp.
RP
bzp.
RPOcr
a.IC
cra.
TC
cra.
RP
cra.
RPO
eon.
IC
eon.
TC
eon.
RP
eon.
RPO
gap.
IC
gap.
TC
gap.
RP
gap.
RPO
gcc.I
C
gcc.T
C
gcc.R
P
gcc.R
PO
gzp.
IC
gzp.
TC
gzp.
RP
gzp.
RPO
mcf.
IC
mcf.
TC
mcf.
RP
mcf.
RPO
par.I
C
par.T
C
par.R
P
par.R
PO
per.I
C
per.T
C
per.R
P
per.R
POtw
f.IC
twf.T
C
twf.R
P
twf.R
POvo
r.IC
vor.T
C
vor.R
P
vor.R
POvp
r.IC
vpr.T
C
vpr.R
P
vpr.R
PO
Mill
ion
s o
f C
ycle
s
FrameCycles NormalCycles WrongPathCycles AssertCycles StalledCycles FetchMissCycles
11.7%
11.8%
15.9%
26.2%
7.8%
10.5%
16.3%
11.9%
28.7%
10.5%
14.0%
15.4%
33
Some observations
• Optimizations boost performance on optimized code by 14% on average, 17% over IC and TC. In terms of effective IPC : 16% and 21%.
• Unoptimized code, over 20% (25% IPC) than IC, TC.
• In some cases, rePLay optimizer does better than the static optimizer (gap, mcf, perlbmk when comparing RP vs. RPO).
34
Effect of Individual Optimizations
-0.2
0
0.2
0.4
0.6
0.8
1
bzp cra eon gap gcc gzp par per tw f vor vpr
Re
lati
ve P
erf
orm
ance
dcr cp ra inlining misc cse fs
Dead code removal, constant prop, reassociation, constant subexp elim, fetch scheduling
35
Effects of Frame Length
0%
20%
40%
60%
80%
100%
120%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Percent of Maximum Frame Length
Rel
ativ
e P
erfo
rman
ce
bzp cra eon gap gcc gzp mcf par per twf vor vpr gap-RP
36
Note about fragile code
• Often caused by invalid accesses to memory – overwrites, uninitialized memory– barrier to optimizations on production code– our sampling of x86 code reveals little static optimization
• Because rePLay optimizations preserve architectural equivalence at frame boundaries, fragile bugs not exposed
• There is a tradeoff between exploiting semantic info (and exposing bugs) and aggressive optimization
37
Current Work
• Generalized Assertion Model
• x86-based rePLay Machine
• Idioms
38
Generalized Assertion Model
A
B
C
D
E
A
simple optscontrol flow assertions
A
speculative optsgeneral assertionsmemory assertions
40
The potential of rePLay on x86 code may be very significant
• Production code appears not to be highly optimized
• The larger granularity and small register set of x86 ISA provide ample opportunities for uop optimization
41
mov [ebp-1Ch],eax stl rax, -1c(rbp) mov ecx,[ebp-1Ch] ldl rcx, -1c(rbp) add ecx,[ebp-0Ch] ldl rt1, -0c(rbp) ldl rt1, -0c(rbp)
add rt1, rcx, rcx add rt1, rax, rcx mov [ebp-1Ch],ecx stl rcx, -1c(rbp) stl rcx, -1c(rbp) cmp [L00993A00],00000320h ldl rt1, M1 ldl rt1, M1
cmp rt1, 320, flags cmp rt1, 320, flags jnz L00406A89 asst nz, flags asst nz, flagsL00406A89: mov eax,[L00993A00] ldl rax, M1 sub eax,0000016Eh sub rax, 16e, rax sub rt1, 16e, rax cmp [ebp-1Ch],eax ldl rt1, -1c(rbp)
cmp rt1, rax, flags cmp rcx, rax, flags jge L00406AA0 asst ge, flags asst ge, flags mov ecx,[ebp-1Ch] ldl rcx, -1c(rbp) mov [ebp-24h],ecx stl rcx, -24(rbp) jmp L00406AAF L00406AA0: mov edx,[L00993A00] ldl rdx, M1 asst ne, M1, -24(rbp) sub edx,0000016Eh sub rdx, 16e, rdx mov [ebp-24h],edx stl rdx, -24(rbp) stl rax, -24(rbp)L00406AAF: mov eax,[ebp-24h] ldl rax, -24(rbp) mov [ebp-1Ch],eax stl rax, -1c(rbp) stl rax, -1c(rbp)L00406AB5: mov ecx,[ebp-1Ch] ldl rcx, -1c(rbp) push ecx stl rcx, 0(rsp) stl rax, 0(rsp)
sub rsp, 4, rsp mov edx,[ebp-04h] ldl rdx, -4(rbp) ldl rdx, -4(rbp) mov eax,[edx+000021D0h] ldl rax, 21d0(rdx) ldl rax, 21d0(rdx) mov ecx,[eax+08h] ldl rcx, 8(rax) ldl rcx, 8(rax) mov edx,[ebp-04h] ldl rdx, -4(rbp) lea ecx,[edx+ecx+000021D0h] add rcx, rdx, rt1 add rcx, rdx, rt1
add rt1, 21d0, rcx add rt1, 21d0, rcx call L00406FB0L00406FB0: push ebp stl rbp, 0(rsp) stl rbp, 4(rsp)
sub rsp, 4, rsp mov ebp,esp add rsp, 0, rbp add rsp, 8, rbp sub esp,00000044h sub rsp, 44, rsp sub rsp, 4c, rsp push ebx stl rbx, 0(rsp)
sub rsp, 4, rsp push esi stl rsi, 0(rsp)
sub rsp, 4, rsp push edi stl rdi, 0(rsp)
sub rsp, 4, rsp mov [ebp-04h],ecx stl rcx, -4(rbp) stl rcx, -4(rbp) mov eax,[ebp-04h] ldl rax, -4(rbp) add rcx, 0, rax mov ecx,[ebp+08h] ldl rcx, 8(rbp) ldl rcx, 8(rbp) mov [eax+2Ch],ecx stl rcx, 2c(rax) stl rcx, 2c(rax) pop edi ldl rdi, 0(rsp) asst range, 2c(rax), rsp
add rsp, 4, rsp pop esi ldl rsi, 0(rsp)
add rsp, 4, rsp pop ebx ldl rbx, 0(rsp)
add rsp, 4, rsp mov esp,ebp add rbp, 0, rsp add rbp, 0, rsp pop ebp ldl rbp, 0(rsp) ldl rbp, 4(rsp)
add rsp, 4, rsp retn 0004hL00406AD4: mov eax,[ebp-04h] ldl rax, -4(rbp) add rdx, 0, rax cmp [eax+00003168h],00000000h add rax, 3168, rt1 add rax, 3168, rt1
cmp rt1, 0, flags cmp rt1, 0, flags jnz L00406AED jnz flags, X jnz flags, X
42
Idioms : A Motivation
• The Frame Cache has many redundant basic blocks. The effective cache size is reduced.
• Traditional approaches to reducing this redundancy don’t work.
• Our solution: idioms. Extract and encode common segments of dataflow into smaller codewords.
43
What is an idiom?
An idiom is a connected dataflow graph of instructions from the dynamic instruction stream.
45
Example Idioms
46
Results and Analysis
47
Fetch Compression
Idiom Decoding logic expands the idiom’s codeword into the constituent
operations.
48
• Optimizer provides 15% reduction in cycles over basic rePLay, 20% over other configurations.
• Dead Code, Reassociation, Fetch Scheduling most important. Dead Code removes almost 1 in 5 instructions.
• High potential for x86 uop optimization using rePLay
• Small number of code idioms cover over a quarter of the I-stream. Many potential opportunities.