replication & consolidation — a grab bag of transformations — 1comp 512, rice university...

Replication & Consolidation

— a grab bag of transformations —

1COMP 512, Rice University

Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved.

Students enrolled in Comp 512 at Rice University have explicit permission to make copies of these materials for their personal use.

Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved.

Comp 512Spring 2011

COMP 512, Rice University

2

Dead code

Dead code elimination

Clean

Simp. of Algebraic identities

The Comp 512 Taxonomy

Machine Independent

Redundancy

LVN, SVN, DVNT

GCSE

Red. Store Elim.

Code motion

Constant Folding

Create opportunities

Loop Unrolling

Inline Substitution

Specialization

Loop Unrolling

Loop Fusion (unroll & jam)

Inline Substitution


3


Machine Dependent

Hide latency

Block Placement

Manage resources

Allocate (registers, tlb slots)

Scheduling

Copy coalescing (O-O-SSA)

Special features

Instruction selection

We have seen many fewer examples of machine dependent transformations. Most of these were in the undergraduate class.

Analysis techniques:

DFA: DOM, LIVE, AVAIL, VERYBUSY, REACHES, CONSTANTS

Plus dominance frontiers & the SSA construction


4

Dead code

Dead code elimination

Clean

Simp. of Algebraic identities


Machine Independent

Redundancy

LVN, SVN, DVNT

GCSE

Red. Store Elim.

Code motion

Constant Folding

Create opportunities

Loop Unrolling

Inline Substitution

Specialization

Loop Unrolling

Loop Fusion (unroll & jam)

Inline Substitution

Today’s lecture: Replication and consolidation


5

Looking at Specialization in More Depth

Major impediments to specialization are

• Lack of specific compile-time knowledge about values

• Presence of multiple conflicting context for the code

Code replication can improve both these problems

Replication

• Lets the compiler tailor code for a more constrained context

• Creates new points in the code where particular facts are true

• Replication can take many forms

Perkin-Elmer’s Universal Optimizing Compiler, Fortran VIIz

• Inlined every call (no recursion in Fortran)

• Subjected result to global optimization

• Speedups to 4x; compile times to match


6

Concrete Example

By cloning parts of the program, the compiler can create circumstances where new or important facts hold true

• Known values or known properties (value of x)

• Simplified control flow (eliminate branches & combine blocks)

• Better opportunities for optimization

• However, larger code is a potential problem

x 7 x -2 x 13

if x > 0 then fee( … ) else fie( … )

x 7 x 13 x -2

fee( … ) fie( … )

Clean will combine these

Safety: Code on path is sameProfit: Control-flow + indirectOpp: Need to look at DF infoDC: Simple


7

Superblock Cloning

Superblock cloning creates maximal-length basic blocks

A

B C

E

F

G

D

Figure 8.6 with added backedge

Cloning to form Superblocks

2 copies of F3 copies of G

A

B C

E

F

G

DG

F

G

Hwu et al, [201] in EaC2e


8

Superblock Cloning

General idea:

• At merge point, clone successor to avoid join point Creates single-predecessor block Merge new block with predecessor Stop at a back edge

(w.r.t. DFST)

• New opportunities Creates long basic blocks for strong local algorithms Control predicate may provide additional facts Eliminates jumps & their overhead (minor

effect) A

B C

D

x > 0x ≤ 0

D Now know sign(x) in D

A

B C

D

x > 0x ≤ 0

x ? 0

Widely used in literature (scheduling, Dynamo)


9

Superblock Cloning

A

B C

D D

A

B C

D

Sources of Improvement

• Longer blocks Better local optimization More ops per branch

• Longer EBBs*

Join points replicated New blocks with new

facts

What about code size?

• I-Cache locality If control stays in one

EBB, may improve locality

• Virtual memory Worse unless compiler

finds lots of optimization


10

Superblock Cloning

A

B C

D D

A

B C

D

Value Numbering

• DVNT on original graph gets (A,B) and (A,C)

• DVNT on cloned graph gets (A,B,D) and (A,B,D)

• Often leads to improvement

Scheduling

• Local scheduling gets larger region

• EBB scheduling extends to D and tailors it to (A,B) & (A,C)

• Some code growth offset by less compensation code

Safety: Code on path is sameProfit: Control-flow + indirectOpp: Find loops (back edges)DC: Simple


11

Fall-through Branch Optimization

while ( … ) { if ( expr ) then block 1 else block 2}

if

b1 b2

(FT)

Some branches have inter-iteration locality

• Taken this time makes taken next more likely

• Clone to make FT case more likely

while

12

Fall-through Branch Optimization

while ( … ) { if ( expr ) then block 1 else block 2}

if

b1 b2

(FT)

Some branches have inter-iteration locality

• Taken this time makes taken next more likely

• Clone to make FT case more likely

• This version has FT for same condition, switches loops for change in expr

• Hopkins suggests that it paid off in PL.8

• Predication eliminates it completely

while

(FT)

if

b1 b2

while

(FT)

if

b2 b1

while

Not expr is FT case

expr is FT case

Wienskoski carried this idea to its logical conclusion, using replication to improve software pipelining in the presence of control flow. This simple form resembles software branch prediction — and hardware is pretty good at that …

Safety: Code on path is sameProfit: Lower branch costOpp: Branches in loopsDC: Simple



13

Loop Unrolling

Replicate loop body & adjust loop header

Factors of four

• Reduction in loop-end tests

• Reduction in number of branches (predictable, but executed)

• Reduction in ratio of delay slots to useful work

Other benefits

• Loop-ending copies can be eliminated (Kennedy thesis)

• Common subexpressions in address calculations

for i = 1 to n a[i] = a[i] * b[i]

for i = 1 to n by 4 a[i] = a[i] * b[i] a[i+1] = a[i+1] * b[i+1] a[i+2] = a[i+2] * b[i+2] a[i+3] = a[i+3] * b[i+3]

Assume mod(n,4) = 0

More complex cases in Fig. 10.12

Dasgupta’s thesis has an extreme example …

Safety: Same ops in bodyProfit: Lower o.h. + indirectOpp: Any loopDC: Choosing factor is hard


14

Inline Substitution

Replace a procedure call with the body of the called procedure

• Textual substitution to create effects of parameter binding

• Private copy of code can be tailored to call site’s context Constants, unambiguous pointers, aliases, …

• Eliminates overhead of procedure call Register save & restore Disruption of call & return

• Eliminates benefits of procedure call Call resets state of register allocator Procedure abstraction keeps name space small

• Usually assumed to be profitable, although studies disagree …

The oldest interprocedural optimization [Ershov 1966]

Relate also to OOLs, with their sky high ratio of overhead to work and the difficulty of converting virtual calls to concrete calls.

Inlining one call can reveal the relevant class for another …

Safety: Paths execute same programmer-specified opsProfit: Lower o.h. + indirectOpp: Any callDC: Complex problem


15

Inline Substitution

Example

call foe

fiecall foe

feefoe

Potential for exponential growth


16

Inline Substitution — After inlining foe

fie

foe

fee

foe


17

Procedure Cloning

Implement multiple copies of a procedure & tailor them to different calling environments

• Idea is to gain some of inlining’s benefit with limited growth

• Careful assignment of calls to clones can improve DFA results

Why clone procedures?

• Conservative alternative to inlining Split based on deterioration of forward data-flow sets Limit growth to cases where knowledge improves

• Avoids pitfalls of inlining Code growth & compile time (2x in

CHT study) Deterioration in global optimization (CHT

study) Recompilation problems

In 1985, this idea generated interest as a technique for implementing ADA generics


18

Procedure Cloning

The Concept

b

helper2helper1

dca

solver

main

Assume that a, b, & d have unit stride access to memory, but c does

not

• Maybe solver and helper1 can be specialized for unit stride

• Would like to stop c from preventing that specialization

• Isolate c with its own copy of the routines

• If solver is in a library, might implement solver/helper1 in several pre-packaged ways→ Telescoping languages strategy

→ Connect call sites to appropriate implementations

→ Move optimization cost to library preparation time

• Avoid most of the code growth that would accrue with inlining


19

Procedure Cloning

The Concept

b

helper2helper1

dca

solver

main

Assume that a, b, & d have unit stride access to memory, but c does

not

b

helper2helper1

d ca

solver

main

solver’

helper1’

Optimize for unit stride access

Use general version


20

Procedure Cloning

Practical algorithms

• Clone on forward constants Metzger & Stroud in Convex compiler (ACM LOPLAS,

1993 ) Significant increase in constants found & folded

• Hall et al. gave a complex algorithm for score-based cloning

Useful as analysis tool

• Clone to sharpen analysis Join points combine & lose information Replicate for analysis & keep it if profitable

• Idea has been applied in partial evaluation, in compilation of both APL and Smalltalk-80, and in SELF

Safety: Code on path is sameProfit: Control-flow + indirectOpp: Need to look at DF infoDC: Complex (code size)

Consolidation

Consolidation is the opposite of replication

• Find common code sequences & replace with shared code

• Reduce code size

We will see several consolidation transformations

• Procedure abstraction Finding common code sequences & creating procedures to

hold them

• Hoisting Replacing multiple instances with one, earlier in the CFG

• Sinking Replacing multiple instances with one, later in the CFG


21

VeryBusy expressions

Consolidation

Procedure Abstraction

If replication can enable specialization, abstraction can undo the negative effects of excess replication, whether done by the programmer or the compiler

• Pattern matching to identify common sequences Abstract across register names and branch target names

• Replace common sequence with inexpensive call/return Use same register names ⇒ only need a return address Common sequence ⇒ storage map is same, too

Procedure abstraction was originally proposed as a means of reducing working set sizes Goal was to reduce virtual memory demands of

timesharing


22

Safety: Code on path is sameProfit: Smaller codeOpp: Pattern matching probDC: Simple

Consolidation

Hoisting

• Compute very busy expressions Each block b is annotated with VERYBUSY(b), the set of

expressions that are evaluated along every path leaving b without redefinition of their constituent subexpressions

For e ∈VERYBUSY(b), evaluating e at end of b makes those subsequent evaluations redundant

• Insert a computation of each e ∈VERYBUSY(b) at end of b

• Either replace subsequent computations with a reference, or run some form of global common subexpression elimination

Hoisting should reduce code size.

It does not directly reduce running times.


23

Safety: Code on path is sameProfit: Code sizeOpp: Compute VERYBUSYDC: Simple

Consolidation

Sinking

• Conceptually, sinking is the inverse of hoisting

• Locate expressions that are computed on every path that reaches b, with no subsequent use or redefinition

• Insert evaluation at b and eliminate earlier evaluations

• Same code space benefits

Common implementation technique: Cross jumping or tail merging

• At each join point, look back across the branch or jump

• If identical ops lie along each path, pull them across the join Can use a window to address minor differences in order

Particularly effective at merging procedure epilog code


24

YADFA problem

Safety: Code on path is sameProfit: Code sizeOpp: Simple matchingDC: Simple

replication & consolidation — a grab bag of transformations — 1comp 512, rice university...

Documents