replay : a hardware framework for dynamic optimization * slides adapted from a talk by sanjay patel...

rePLay : A Hardware Framework for Dynamic Optimization

*Slides adapted from a talk by Sanjay Patel

C SA

2

rePLay as an evolution

Trace Caches[high bandwidth fetch]

Branch Characterization[correlation, predictability]

Dynamic Optimization[Dynamo and HW-based]

rePLay

Effective at capturing short dynamic traces

dynamically, few “unpredictable” branches

Good potential, but watch out for overhead

3

The rePLay Framework : supporting dynamic optimization

in the microarchitecture

FetchEngine

Execution Engine

Frame Constructor

OptimizationEngine

Frame Cache Sequencer

Completing instructions

recove

ry sup

port

• Optimization engine provides for concurrent optimization.• HW recovery support enables speculative optimization without requiring recovery code.

• Frame constructor builds atomic optimization regions.

• Frame cache and sequencer provide for quick dispatch.

4

rePLay Key Innovations

• Atomic optimization regions (frames)

• Assertion architecture

• Use of hardware rollback for dynamic, speculative optimization

5

FramesrePLay optimization regionsA

B

C

D

E

A

B

C

D

E

6

Dealing with side exits

C

BEQ R3, D

X D

C

D

ASSERT R3==0

7

Frames become building blocks

A

B

C

D

E

A

A

8

Why are atomic blocks desirable?

AtomicBlock

•Because they contain no intermediate branches, they permit sustained fetch

•Because the code is control-independent, they permit code reordering without requiring recovery code

•Also, they require less effort to optimize and schedule

inst A

inst B

inst B

9

Mechanisms

10

rePLay Microarchitecture

• Frame Construction

• Frame Sequencer

• Frame Cache

• Optimizer

11

Frame Constructor

FetchEngine

Execution Engine

Frame Constructor

OptimizationEngine



recove

ry sup

port

12

Frame Construction

• Objective 1: Frames should be long• Objective 2: Frames should completely execute• Objective 3: Should cover a significant fraction of

istream

• Approach: Use path-based history-based branch promotion to convert highly-biased branches into assertions.

13

The rePLay Frame Constructor

BranchBiasTable

PendingFrame

FrameConstruction

Buffer

incomingblock

from executionengine

Promotebranch toassertion

Keep enlargingframe if branchis promoted

History

14

Branch History forms Frame Context

X

Y

A

B

C

D

A

B

C

D

XY

15

Frame Construction Strategies[averaged across SPEC2000 integer benchmarks]

0

20

40

60

80

100

120

0 2 4 6 8 10 12 14

Path

Global

Static

History Length

Ave

in

stru

cti

on

s p

er

fram

e

16

Fraction of Dynamic Assertions[averaged across SPEC2000 integer benchmarks]

0%

20%

40%

60%

80%

100%

0 2 4 6 8 10 12 14

Assertions Fired

Assertions

Branches

Path History

18

Frame Construction Summary:With a 9-element path history

Frame Size 102 insts

Completion Rate

Coverage

Assertions/Frame

97%

82%

9

Unique Frames 6191

24

Optimization Engine

FetchEngine

Execution Engine

Frame Constructor

OptimizationEngine



recove

ry sup

port

25

The rePLay Optimization Engine

• Programmable, with local memory

• Hardware blocks to assist with dead code, etc…

• Possible implementation: optimizer running as a thread on an MT processor

Dead CodeDetector

OptimizationDatapath

OptEngineMemory

UnoptimizedFrame

FrameBuffer

OptimizedFrame

26

rePLay Optimizations

• Removal of Partially Dead Code• Reassociation• Constant Propagation• Common Subexpression Elimination• Subroutine In-lining• Fetch Scheduling

27

Example of Optimization[from gcc]

asst_i r1,NE

asst_i r16,EQ

srl r16,48,r16

sll r29,48,r16

lda r0,0(r29)

sll r0,48,r0

srl r0,48,r0

cmpeq r0,28,r0

asst_i r0,EQ

lda r16,0(r29)

sll r16,48,r16

srl r16,48,r16

cmpeq r16,27,r16

beq r16,0x14

ldq r1,24(r1)

asst_i r1, NE

lda r29,0(r1)

cmpeq r16,29,r16

and r1,255,r30

cmpeq r30,29,r16

asst_i r16,EQ

cmpeq r30,28,r0

asst_i r0,EQ

cmpeq r30,27,r16

beq r16,0x14

beq r1, A

bne r1, B

ldq r1,24(r1)

beq r16, C

srl r16,48,r16

sll r29,48,r16

lda r29, 0(r1)

cmpeq r16,29,r16

lda r0,0(r29)

sll r0,48,r0

srl r0,48,r0

cmpeq r0,28,r0

beq r0, D

lda r16,0(r29)

sll r16,48,r16

srl r16,48,r16

cmpeq r16,27,r16

beq r16, E

FrameConstr

FrameConstr OptimizerOptimizer

asst_i r1, NE

ldq r1,24(r1)

asst_i r1,NE

lda r29,0(r1)

28

asst_i r1,NE

asst_i r16,EQ

srl r16,48,r16

sll r29,48,r16

lda r0,0(r29)

sll r0,48,r0

srl r0,48,r0

cmpeq r0,28,r0

asst_i r0,EQ

lda r16,0(r29)

sll r16,48,r16

srl r16,48,r16

cmpeq r16,27,r16

beq r16,0x14

ldq r1,24(r1)

asst_i r1, NE

lda r29,0(r1)

cmpeq r16,29,r16

asst_i r1,NE

asst_i r16,EQ

srl r16,48,r16

sll r1,48,r16

lda r0,0(r1)

sll r0,48,r0

srl r0,48,r0

cmpeq r0,28,r0

asst_i r0,EQ

lda r16,0(r1)

sll r16,48,r16

srl r16,48,r16

cmpeq r16,27,r16

beq r16,0x14

ldq r1,24(r1)

asst_i r1, NE

lda r29,0(r1)

cmpeq r16,29,r16

Reassociation

asst_i r1,NE

asst_i r16,EQ

srl r16,48,r30

sll r1,48,r16

cmpeq r30,28,r0

asst_i r0,EQ

cmpeq r30,27,r16

beq r16,0x14

ldq r1,24(r1)

asst_i r1, NE

lda r29,0(r1)

cmpeq r16,29,r16

Common sub-expression

eliminationMiscellaneous

and r1,255,r30

cmpeq r30,29,r16

asst_i r16,EQ

cmpeq r30,28,r0

asst_i r0,EQ

cmpeq r30,27,r16

beq r16,0x14

asst_i r1, NE

ldq r1,24(r1)

asst_i r1,NE

lda r29,0(r1)

29

Experimental Results

30

Processor Configurations

• ICache (IC) • Simple Trace Cache (TC)• rePLay without Optimizations (RP)• rePLay with Optimizations (RPO)Fetch: 8-wide, 64KB of cache space for insts.Core: 8-wide OOO, 64KB DCache, 1 exe cluster

7 cycles (min) for branch resolution

Compaq Alpha Compiler used for optimizations (-O4)

31

0

50

100

150

200

250

300

350M

illio

ns

of

Cyc

les

FrameCycles NormalCycles WrongPathCycles AssertCycles StalledCycles FetchMissCycles

8.7%

9.5%

15.5%

22.3%

7.1%

6.5%

14.4%

10.2%

34.9%

9.9%

18.3%

6.5%

Performance on statically optimized binaries

32

Performance on unoptimized binaries

0

50

100

150

200

250

300

350

bzp.

IC

bzp.

TC

bzp.

RP

bzp.

RPOcr

a.IC

cra.

TC

cra.

RP

cra.

RPO

eon.

IC

eon.

TC

eon.

RP

eon.

RPO

gap.

IC

gap.

TC

gap.

RP

gap.

RPO

gcc.I

C

gcc.T

C

gcc.R

P

gcc.R

PO

gzp.

IC

gzp.

TC

gzp.

RP

gzp.

RPO

mcf.

IC

mcf.

TC

mcf.

RP

mcf.

RPO

par.I

C

par.T

C

par.R

P

par.R

PO

per.I

C

per.T

C

per.R

P

per.R

POtw

f.IC

twf.T

C

twf.R

P

twf.R

POvo

r.IC

vor.T

C

vor.R

P

vor.R

POvp

r.IC

vpr.T

C

vpr.R

P

vpr.R

PO

Mill

ion

s o

f C

ycle

s

FrameCycles NormalCycles WrongPathCycles AssertCycles StalledCycles FetchMissCycles

11.7%

11.8%

15.9%

26.2%

7.8%

10.5%

16.3%

11.9%

28.7%

10.5%

14.0%

15.4%

33

Some observations

• Optimizations boost performance on optimized code by 14% on average, 17% over IC and TC. In terms of effective IPC : 16% and 21%.

• Unoptimized code, over 20% (25% IPC) than IC, TC.

• In some cases, rePLay optimizer does better than the static optimizer (gap, mcf, perlbmk when comparing RP vs. RPO).

34

Effect of Individual Optimizations

-0.2

0

0.2

0.4

0.6

0.8

1

bzp cra eon gap gcc gzp par per tw f vor vpr

Re

lati

ve P

erf

orm

ance

dcr cp ra inlining misc cse fs

Dead code removal, constant prop, reassociation, constant subexp elim, fetch scheduling

35

Effects of Frame Length

0%

20%

40%

60%

80%

100%

120%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Percent of Maximum Frame Length

Rel

ativ

e P

erfo

rman

ce

bzp cra eon gap gcc gzp mcf par per twf vor vpr gap-RP

36

Note about fragile code

• Often caused by invalid accesses to memory – overwrites, uninitialized memory– barrier to optimizations on production code– our sampling of x86 code reveals little static optimization

• Because rePLay optimizations preserve architectural equivalence at frame boundaries, fragile bugs not exposed

• There is a tradeoff between exploiting semantic info (and exposing bugs) and aggressive optimization

37

Current Work

• Generalized Assertion Model

• x86-based rePLay Machine

• Idioms

38

Generalized Assertion Model

A

B

C

D

E

A

simple optscontrol flow assertions

A

speculative optsgeneral assertionsmemory assertions

40

The potential of rePLay on x86 code may be very significant

• Production code appears not to be highly optimized

• The larger granularity and small register set of x86 ISA provide ample opportunities for uop optimization

41

mov [ebp-1Ch],eax stl rax, -1c(rbp) mov ecx,[ebp-1Ch] ldl rcx, -1c(rbp) add ecx,[ebp-0Ch] ldl rt1, -0c(rbp) ldl rt1, -0c(rbp)

add rt1, rcx, rcx add rt1, rax, rcx mov [ebp-1Ch],ecx stl rcx, -1c(rbp) stl rcx, -1c(rbp) cmp [L00993A00],00000320h ldl rt1, M1 ldl rt1, M1

cmp rt1, 320, flags cmp rt1, 320, flags jnz L00406A89 asst nz, flags asst nz, flagsL00406A89: mov eax,[L00993A00] ldl rax, M1 sub eax,0000016Eh sub rax, 16e, rax sub rt1, 16e, rax cmp [ebp-1Ch],eax ldl rt1, -1c(rbp)

cmp rt1, rax, flags cmp rcx, rax, flags jge L00406AA0 asst ge, flags asst ge, flags mov ecx,[ebp-1Ch] ldl rcx, -1c(rbp) mov [ebp-24h],ecx stl rcx, -24(rbp) jmp L00406AAF L00406AA0: mov edx,[L00993A00] ldl rdx, M1 asst ne, M1, -24(rbp) sub edx,0000016Eh sub rdx, 16e, rdx mov [ebp-24h],edx stl rdx, -24(rbp) stl rax, -24(rbp)L00406AAF: mov eax,[ebp-24h] ldl rax, -24(rbp) mov [ebp-1Ch],eax stl rax, -1c(rbp) stl rax, -1c(rbp)L00406AB5: mov ecx,[ebp-1Ch] ldl rcx, -1c(rbp) push ecx stl rcx, 0(rsp) stl rax, 0(rsp)

sub rsp, 4, rsp mov edx,[ebp-04h] ldl rdx, -4(rbp) ldl rdx, -4(rbp) mov eax,[edx+000021D0h] ldl rax, 21d0(rdx) ldl rax, 21d0(rdx) mov ecx,[eax+08h] ldl rcx, 8(rax) ldl rcx, 8(rax) mov edx,[ebp-04h] ldl rdx, -4(rbp) lea ecx,[edx+ecx+000021D0h] add rcx, rdx, rt1 add rcx, rdx, rt1

add rt1, 21d0, rcx add rt1, 21d0, rcx call L00406FB0L00406FB0: push ebp stl rbp, 0(rsp) stl rbp, 4(rsp)

sub rsp, 4, rsp mov ebp,esp add rsp, 0, rbp add rsp, 8, rbp sub esp,00000044h sub rsp, 44, rsp sub rsp, 4c, rsp push ebx stl rbx, 0(rsp)

sub rsp, 4, rsp push esi stl rsi, 0(rsp)

sub rsp, 4, rsp push edi stl rdi, 0(rsp)

sub rsp, 4, rsp mov [ebp-04h],ecx stl rcx, -4(rbp) stl rcx, -4(rbp) mov eax,[ebp-04h] ldl rax, -4(rbp) add rcx, 0, rax mov ecx,[ebp+08h] ldl rcx, 8(rbp) ldl rcx, 8(rbp) mov [eax+2Ch],ecx stl rcx, 2c(rax) stl rcx, 2c(rax) pop edi ldl rdi, 0(rsp) asst range, 2c(rax), rsp

add rsp, 4, rsp pop esi ldl rsi, 0(rsp)

add rsp, 4, rsp pop ebx ldl rbx, 0(rsp)

add rsp, 4, rsp mov esp,ebp add rbp, 0, rsp add rbp, 0, rsp pop ebp ldl rbp, 0(rsp) ldl rbp, 4(rsp)

add rsp, 4, rsp retn 0004hL00406AD4: mov eax,[ebp-04h] ldl rax, -4(rbp) add rdx, 0, rax cmp [eax+00003168h],00000000h add rax, 3168, rt1 add rax, 3168, rt1

cmp rt1, 0, flags cmp rt1, 0, flags jnz L00406AED jnz flags, X jnz flags, X

42

Idioms : A Motivation

• The Frame Cache has many redundant basic blocks. The effective cache size is reduced.

• Traditional approaches to reducing this redundancy don’t work.

• Our solution: idioms. Extract and encode common segments of dataflow into smaller codewords.

43

What is an idiom?

An idiom is a connected dataflow graph of instructions from the dynamic instruction stream.

45

Example Idioms

46

Results and Analysis

47

Fetch Compression

Idiom Decoding logic expands the idiom’s codeword into the constituent

operations.

48

• Optimizer provides 15% reduction in cycles over basic rePLay, 20% over other configurations.

• Dead Code, Reassociation, Fetch Scheduling most important. Dead Code removes almost 1 in 5 instructions.

• High potential for x86 uop optimization using rePLay

• Small number of code idioms cover over a quarter of the I-stream. Many potential opportunities.

replay : a hardware framework for dynamic optimization * slides adapted from a talk by sanjay patel...

Documents

frame slide

b slide

speculative optimization

frame construction objective

frame construction strategies

frame construction summary

overhead slide

frame length distribution