replay : a hardware framework for dynamic optimization * slides adapted from a talk by sanjay patel...

40
rePLay : A Hardware Framework for Dynamic Optimization *Slides adapted from a talk by Sanjay Patel C S A

Post on 22-Dec-2015

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

rePLay : A Hardware Framework for Dynamic Optimization

*Slides adapted from a talk by Sanjay Patel

C SA

Page 2: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

2

rePLay as an evolution

Trace Caches[high bandwidth fetch]

Branch Characterization[correlation, predictability]

Dynamic Optimization[Dynamo and HW-based]

rePLay

Effective at capturing short dynamic traces

dynamically, few “unpredictable” branches

Good potential, but watch out for overhead

Page 3: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

3

The rePLay Framework : supporting dynamic optimization

in the microarchitecture

FetchEngine

Execution Engine

Frame Constructor

OptimizationEngine

Frame Cache Sequencer

Completing instructions

recove

ry sup

port

• Optimization engine provides for concurrent optimization.• HW recovery support enables speculative optimization without requiring recovery code.

• Frame constructor builds atomic optimization regions.

• Frame cache and sequencer provide for quick dispatch.

Page 4: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

4

rePLay Key Innovations

• Atomic optimization regions (frames)

• Assertion architecture

• Use of hardware rollback for dynamic, speculative optimization

Page 5: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

5

FramesrePLay optimization regionsA

B

C

D

E

A

B

C

D

E

Page 6: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

6

Dealing with side exits

C

BEQ R3, D

X D

C

D

ASSERT R3==0

Page 7: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

7

Frames become building blocks

A

B

C

D

E

A

A

Page 8: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

8

Why are atomic blocks desirable?

AtomicBlock

•Because they contain no intermediate branches, they permit sustained fetch

•Because the code is control-independent, they permit code reordering without requiring recovery code

•Also, they require less effort to optimize and schedule

inst A

inst B

inst B

Page 9: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

9

Mechanisms

Page 10: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

10

rePLay Microarchitecture

• Frame Construction

• Frame Sequencer

• Frame Cache

• Optimizer

Page 11: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

11

Frame Constructor

FetchEngine

Execution Engine

Frame Constructor

OptimizationEngine

Frame Cache Sequencer

Completing instructions

recove

ry sup

port

Page 12: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

12

Frame Construction

• Objective 1: Frames should be long• Objective 2: Frames should completely execute• Objective 3: Should cover a significant fraction of

istream

• Approach: Use path-based history-based branch promotion to convert highly-biased branches into assertions.

Page 13: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

13

The rePLay Frame Constructor

BranchBiasTable

PendingFrame

FrameConstruction

Buffer

incomingblock

from executionengine

Promotebranch toassertion

Keep enlargingframe if branchis promoted

History

Page 14: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

14

Branch History forms Frame Context

X

Y

A

B

C

D

A

B

C

D

XY

Page 15: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

15

Frame Construction Strategies[averaged across SPEC2000 integer benchmarks]

0

20

40

60

80

100

120

0 2 4 6 8 10 12 14

Path

Global

Static

History Length

Ave

in

stru

cti

on

s p

er

fram

e

Page 16: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

16

Fraction of Dynamic Assertions[averaged across SPEC2000 integer benchmarks]

0%

20%

40%

60%

80%

100%

0 2 4 6 8 10 12 14

Assertions Fired

Assertions

Branches

Path History

Page 17: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

18

Frame Construction Summary:With a 9-element path history

Frame Size 102 insts

Completion Rate

Coverage

Assertions/Frame

97%

82%

9

Unique Frames 6191

Page 18: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

24

Optimization Engine

FetchEngine

Execution Engine

Frame Constructor

OptimizationEngine

Frame Cache Sequencer

Completing instructions

recove

ry sup

port

Page 19: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

25

The rePLay Optimization Engine

• Programmable, with local memory

• Hardware blocks to assist with dead code, etc…

• Possible implementation: optimizer running as a thread on an MT processor

Dead CodeDetector

OptimizationDatapath

OptEngineMemory

UnoptimizedFrame

FrameBuffer

OptimizedFrame

Page 20: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

26

rePLay Optimizations

• Removal of Partially Dead Code• Reassociation• Constant Propagation• Common Subexpression Elimination• Subroutine In-lining• Fetch Scheduling

Page 21: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

27

Example of Optimization[from gcc]

asst_i r1,NE

asst_i r16,EQ

srl r16,48,r16

sll r29,48,r16

lda r0,0(r29)

sll r0,48,r0

srl r0,48,r0

cmpeq r0,28,r0

asst_i r0,EQ

lda r16,0(r29)

sll r16,48,r16

srl r16,48,r16

cmpeq r16,27,r16

beq r16,0x14

ldq r1,24(r1)

asst_i r1, NE

lda r29,0(r1)

cmpeq r16,29,r16

and r1,255,r30

cmpeq r30,29,r16

asst_i r16,EQ

cmpeq r30,28,r0

asst_i r0,EQ

cmpeq r30,27,r16

beq r16,0x14

beq r1, A

bne r1, B

ldq r1,24(r1)

beq r16, C

srl r16,48,r16

sll r29,48,r16

lda r29, 0(r1)

cmpeq r16,29,r16

lda r0,0(r29)

sll r0,48,r0

srl r0,48,r0

cmpeq r0,28,r0

beq r0, D

lda r16,0(r29)

sll r16,48,r16

srl r16,48,r16

cmpeq r16,27,r16

beq r16, E

FrameConstr

FrameConstr OptimizerOptimizer

asst_i r1, NE

ldq r1,24(r1)

asst_i r1,NE

lda r29,0(r1)

Page 22: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

28

asst_i r1,NE

asst_i r16,EQ

srl r16,48,r16

sll r29,48,r16

lda r0,0(r29)

sll r0,48,r0

srl r0,48,r0

cmpeq r0,28,r0

asst_i r0,EQ

lda r16,0(r29)

sll r16,48,r16

srl r16,48,r16

cmpeq r16,27,r16

beq r16,0x14

ldq r1,24(r1)

asst_i r1, NE

lda r29,0(r1)

cmpeq r16,29,r16

asst_i r1,NE

asst_i r16,EQ

srl r16,48,r16

sll r1,48,r16

lda r0,0(r1)

sll r0,48,r0

srl r0,48,r0

cmpeq r0,28,r0

asst_i r0,EQ

lda r16,0(r1)

sll r16,48,r16

srl r16,48,r16

cmpeq r16,27,r16

beq r16,0x14

ldq r1,24(r1)

asst_i r1, NE

lda r29,0(r1)

cmpeq r16,29,r16

Reassociation

asst_i r1,NE

asst_i r16,EQ

srl r16,48,r30

sll r1,48,r16

cmpeq r30,28,r0

asst_i r0,EQ

cmpeq r30,27,r16

beq r16,0x14

ldq r1,24(r1)

asst_i r1, NE

lda r29,0(r1)

cmpeq r16,29,r16

Common sub-expression

eliminationMiscellaneous

and r1,255,r30

cmpeq r30,29,r16

asst_i r16,EQ

cmpeq r30,28,r0

asst_i r0,EQ

cmpeq r30,27,r16

beq r16,0x14

asst_i r1, NE

ldq r1,24(r1)

asst_i r1,NE

lda r29,0(r1)

Page 23: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

29

Experimental Results

Page 24: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

30

Processor Configurations

• ICache (IC) • Simple Trace Cache (TC)• rePLay without Optimizations (RP)• rePLay with Optimizations (RPO)Fetch: 8-wide, 64KB of cache space for insts.Core: 8-wide OOO, 64KB DCache, 1 exe cluster

7 cycles (min) for branch resolution

Compaq Alpha Compiler used for optimizations (-O4)

Page 25: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

31

0

50

100

150

200

250

300

350M

illio

ns

of

Cyc

les

FrameCycles NormalCycles WrongPathCycles AssertCycles StalledCycles FetchMissCycles

8.7%

9.5%

15.5%

22.3%

7.1%

6.5%

14.4%

10.2%

34.9%

9.9%

18.3%

6.5%

Performance on statically optimized binaries

Page 26: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

32

Performance on unoptimized binaries

0

50

100

150

200

250

300

350

bzp.

IC

bzp.

TC

bzp.

RP

bzp.

RPOcr

a.IC

cra.

TC

cra.

RP

cra.

RPO

eon.

IC

eon.

TC

eon.

RP

eon.

RPO

gap.

IC

gap.

TC

gap.

RP

gap.

RPO

gcc.I

C

gcc.T

C

gcc.R

P

gcc.R

PO

gzp.

IC

gzp.

TC

gzp.

RP

gzp.

RPO

mcf.

IC

mcf.

TC

mcf.

RP

mcf.

RPO

par.I

C

par.T

C

par.R

P

par.R

PO

per.I

C

per.T

C

per.R

P

per.R

POtw

f.IC

twf.T

C

twf.R

P

twf.R

POvo

r.IC

vor.T

C

vor.R

P

vor.R

POvp

r.IC

vpr.T

C

vpr.R

P

vpr.R

PO

Mill

ion

s o

f C

ycle

s

FrameCycles NormalCycles WrongPathCycles AssertCycles StalledCycles FetchMissCycles

11.7%

11.8%

15.9%

26.2%

7.8%

10.5%

16.3%

11.9%

28.7%

10.5%

14.0%

15.4%

Page 27: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

33

Some observations

• Optimizations boost performance on optimized code by 14% on average, 17% over IC and TC. In terms of effective IPC : 16% and 21%.

• Unoptimized code, over 20% (25% IPC) than IC, TC.

• In some cases, rePLay optimizer does better than the static optimizer (gap, mcf, perlbmk when comparing RP vs. RPO).

Page 28: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

34

Effect of Individual Optimizations

-0.2

0

0.2

0.4

0.6

0.8

1

bzp cra eon gap gcc gzp par per tw f vor vpr

Re

lati

ve P

erf

orm

ance

dcr cp ra inlining misc cse fs

Dead code removal, constant prop, reassociation, constant subexp elim, fetch scheduling

Page 29: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

35

Effects of Frame Length

0%

20%

40%

60%

80%

100%

120%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Percent of Maximum Frame Length

Rel

ativ

e P

erfo

rman

ce

bzp cra eon gap gcc gzp mcf par per twf vor vpr gap-RP

Page 30: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

36

Note about fragile code

• Often caused by invalid accesses to memory – overwrites, uninitialized memory– barrier to optimizations on production code– our sampling of x86 code reveals little static optimization

• Because rePLay optimizations preserve architectural equivalence at frame boundaries, fragile bugs not exposed

• There is a tradeoff between exploiting semantic info (and exposing bugs) and aggressive optimization

Page 31: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

37

Current Work

• Generalized Assertion Model

• x86-based rePLay Machine

• Idioms

Page 32: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

38

Generalized Assertion Model

A

B

C

D

E

A

simple optscontrol flow assertions

A

speculative optsgeneral assertionsmemory assertions

Page 33: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

40

The potential of rePLay on x86 code may be very significant

• Production code appears not to be highly optimized

• The larger granularity and small register set of x86 ISA provide ample opportunities for uop optimization

Page 34: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

41

mov [ebp-1Ch],eax stl rax, -1c(rbp) mov ecx,[ebp-1Ch] ldl rcx, -1c(rbp) add ecx,[ebp-0Ch] ldl rt1, -0c(rbp) ldl rt1, -0c(rbp)

add rt1, rcx, rcx add rt1, rax, rcx mov [ebp-1Ch],ecx stl rcx, -1c(rbp) stl rcx, -1c(rbp) cmp [L00993A00],00000320h ldl rt1, M1 ldl rt1, M1

cmp rt1, 320, flags cmp rt1, 320, flags jnz L00406A89 asst nz, flags asst nz, flagsL00406A89: mov eax,[L00993A00] ldl rax, M1 sub eax,0000016Eh sub rax, 16e, rax sub rt1, 16e, rax cmp [ebp-1Ch],eax ldl rt1, -1c(rbp)

cmp rt1, rax, flags cmp rcx, rax, flags jge L00406AA0 asst ge, flags asst ge, flags mov ecx,[ebp-1Ch] ldl rcx, -1c(rbp) mov [ebp-24h],ecx stl rcx, -24(rbp) jmp L00406AAF L00406AA0: mov edx,[L00993A00] ldl rdx, M1 asst ne, M1, -24(rbp) sub edx,0000016Eh sub rdx, 16e, rdx mov [ebp-24h],edx stl rdx, -24(rbp) stl rax, -24(rbp)L00406AAF: mov eax,[ebp-24h] ldl rax, -24(rbp) mov [ebp-1Ch],eax stl rax, -1c(rbp) stl rax, -1c(rbp)L00406AB5: mov ecx,[ebp-1Ch] ldl rcx, -1c(rbp) push ecx stl rcx, 0(rsp) stl rax, 0(rsp)

sub rsp, 4, rsp mov edx,[ebp-04h] ldl rdx, -4(rbp) ldl rdx, -4(rbp) mov eax,[edx+000021D0h] ldl rax, 21d0(rdx) ldl rax, 21d0(rdx) mov ecx,[eax+08h] ldl rcx, 8(rax) ldl rcx, 8(rax) mov edx,[ebp-04h] ldl rdx, -4(rbp) lea ecx,[edx+ecx+000021D0h] add rcx, rdx, rt1 add rcx, rdx, rt1

add rt1, 21d0, rcx add rt1, 21d0, rcx call L00406FB0L00406FB0: push ebp stl rbp, 0(rsp) stl rbp, 4(rsp)

sub rsp, 4, rsp mov ebp,esp add rsp, 0, rbp add rsp, 8, rbp sub esp,00000044h sub rsp, 44, rsp sub rsp, 4c, rsp push ebx stl rbx, 0(rsp)

sub rsp, 4, rsp push esi stl rsi, 0(rsp)

sub rsp, 4, rsp push edi stl rdi, 0(rsp)

sub rsp, 4, rsp mov [ebp-04h],ecx stl rcx, -4(rbp) stl rcx, -4(rbp) mov eax,[ebp-04h] ldl rax, -4(rbp) add rcx, 0, rax mov ecx,[ebp+08h] ldl rcx, 8(rbp) ldl rcx, 8(rbp) mov [eax+2Ch],ecx stl rcx, 2c(rax) stl rcx, 2c(rax) pop edi ldl rdi, 0(rsp) asst range, 2c(rax), rsp

add rsp, 4, rsp pop esi ldl rsi, 0(rsp)

add rsp, 4, rsp pop ebx ldl rbx, 0(rsp)

add rsp, 4, rsp mov esp,ebp add rbp, 0, rsp add rbp, 0, rsp pop ebp ldl rbp, 0(rsp) ldl rbp, 4(rsp)

add rsp, 4, rsp retn 0004hL00406AD4: mov eax,[ebp-04h] ldl rax, -4(rbp) add rdx, 0, rax cmp [eax+00003168h],00000000h add rax, 3168, rt1 add rax, 3168, rt1

cmp rt1, 0, flags cmp rt1, 0, flags jnz L00406AED jnz flags, X jnz flags, X

Page 35: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

42

Idioms : A Motivation

• The Frame Cache has many redundant basic blocks. The effective cache size is reduced.

• Traditional approaches to reducing this redundancy don’t work.

• Our solution: idioms. Extract and encode common segments of dataflow into smaller codewords.

Page 36: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

43

What is an idiom?

An idiom is a connected dataflow graph of instructions from the dynamic instruction stream.

Page 37: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

45

Example Idioms

Page 38: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

46

Results and Analysis

Page 39: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

47

Fetch Compression

Idiom Decoding logic expands the idiom’s codeword into the constituent

operations.

Page 40: RePLay : A Hardware Framework for Dynamic Optimization * Slides adapted from a talk by Sanjay Patel CSA

48

• Optimizer provides 15% reduction in cycles over basic rePLay, 20% over other configurations.

• Dead Code, Reassociation, Fetch Scheduling most important. Dead Code removes almost 1 in 5 instructions.

• High potential for x86 uop optimization using rePLay

• Small number of code idioms cover over a quarter of the I-stream. Many potential opportunities.