replication & consolidation — a grab bag of transformations — 1comp 512, rice university...
TRANSCRIPT
Replication & Consolidation
— a grab bag of transformations —
1COMP 512, Rice University
Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved.
Students enrolled in Comp 512 at Rice University have explicit permission to make copies of these materials for their personal use.
Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved.
Comp 512Spring 2011
COMP 512, Rice University
2
Dead code
Dead code elimination
Clean
Simp. of Algebraic identities
The Comp 512 Taxonomy
Machine Independent
Redundancy
LVN, SVN, DVNT
GCSE
Red. Store Elim.
Code motion
Constant Folding
Create opportunities
Loop Unrolling
Inline Substitution
Specialization
Loop Unrolling
Loop Fusion (unroll & jam)
Inline Substitution
COMP 512, Rice University
3
The Comp 512 Taxonomy
Machine Dependent
Hide latency
Block Placement
Manage resources
Allocate (registers, tlb slots)
Scheduling
Copy coalescing (O-O-SSA)
Special features
Instruction selection
We have seen many fewer examples of machine dependent transformations. Most of these were in the undergraduate class.
Analysis techniques:
DFA: DOM, LIVE, AVAIL, VERYBUSY, REACHES, CONSTANTS
Plus dominance frontiers & the SSA construction
COMP 512, Rice University
4
Dead code
Dead code elimination
Clean
Simp. of Algebraic identities
The Comp 512 Taxonomy
Machine Independent
Redundancy
LVN, SVN, DVNT
GCSE
Red. Store Elim.
Code motion
Constant Folding
Create opportunities
Loop Unrolling
Inline Substitution
Specialization
Loop Unrolling
Loop Fusion (unroll & jam)
Inline Substitution
Today’s lecture: Replication and consolidation
COMP 512, Rice University
5
Looking at Specialization in More Depth
Major impediments to specialization are
• Lack of specific compile-time knowledge about values
• Presence of multiple conflicting context for the code
Code replication can improve both these problems
Replication
• Lets the compiler tailor code for a more constrained context
• Creates new points in the code where particular facts are true
• Replication can take many forms
Perkin-Elmer’s Universal Optimizing Compiler, Fortran VIIz
• Inlined every call (no recursion in Fortran)
• Subjected result to global optimization
• Speedups to 4x; compile times to match
COMP 512, Rice University
6
Concrete Example
By cloning parts of the program, the compiler can create circumstances where new or important facts hold true
• Known values or known properties (value of x)
• Simplified control flow (eliminate branches & combine blocks)
• Better opportunities for optimization
• However, larger code is a potential problem
x 7 x -2 x 13
if x > 0 then fee( … ) else fie( … )
x 7 x 13 x -2
fee( … ) fie( … )
Clean will combine these
Safety: Code on path is sameProfit: Control-flow + indirectOpp: Need to look at DF infoDC: Simple
COMP 512, Rice University
7
Superblock Cloning
Superblock cloning creates maximal-length basic blocks
A
B C
E
F
G
D
Figure 8.6 with added backedge
Cloning to form Superblocks
2 copies of F3 copies of G
A
B C
E
F
G
DG
F
G
Hwu et al, [201] in EaC2e
COMP 512, Rice University
8
Superblock Cloning
General idea:
• At merge point, clone successor to avoid join point Creates single-predecessor block Merge new block with predecessor Stop at a back edge
(w.r.t. DFST)
• New opportunities Creates long basic blocks for strong local algorithms Control predicate may provide additional facts Eliminates jumps & their overhead (minor
effect) A
B C
D
x > 0x ≤ 0
D Now know sign(x) in D
A
B C
D
x > 0x ≤ 0
x ? 0
Widely used in literature (scheduling, Dynamo)
COMP 512, Rice University
9
Superblock Cloning
A
B C
D D
A
B C
D
Sources of Improvement
• Longer blocks Better local optimization More ops per branch
• Longer EBBs*
Join points replicated New blocks with new
facts
What about code size?
• I-Cache locality If control stays in one
EBB, may improve locality
• Virtual memory Worse unless compiler
finds lots of optimization
COMP 512, Rice University
10
Superblock Cloning
A
B C
D D
A
B C
D
Value Numbering
• DVNT on original graph gets (A,B) and (A,C)
• DVNT on cloned graph gets (A,B,D) and (A,B,D)
• Often leads to improvement
Scheduling
• Local scheduling gets larger region
• EBB scheduling extends to D and tailors it to (A,B) & (A,C)
• Some code growth offset by less compensation code
Safety: Code on path is sameProfit: Control-flow + indirectOpp: Find loops (back edges)DC: Simple
COMP 512, Rice University
11
Fall-through Branch Optimization
while ( … ) { if ( expr ) then block 1 else block 2}
if
b1 b2
(FT)
Some branches have inter-iteration locality
• Taken this time makes taken next more likely
• Clone to make FT case more likely
while
12
Fall-through Branch Optimization
while ( … ) { if ( expr ) then block 1 else block 2}
if
b1 b2
(FT)
Some branches have inter-iteration locality
• Taken this time makes taken next more likely
• Clone to make FT case more likely
• This version has FT for same condition, switches loops for change in expr
• Hopkins suggests that it paid off in PL.8
• Predication eliminates it completely
while
(FT)
if
b1 b2
while
(FT)
if
b2 b1
while
Not expr is FT case
expr is FT case
Wienskoski carried this idea to its logical conclusion, using replication to improve software pipelining in the presence of control flow. This simple form resembles software branch prediction — and hardware is pretty good at that …
Safety: Code on path is sameProfit: Lower branch costOpp: Branches in loopsDC: Simple
COMP 512, Rice University
COMP 512, Rice University
13
Loop Unrolling
Replicate loop body & adjust loop header
Factors of four
• Reduction in loop-end tests
• Reduction in number of branches (predictable, but executed)
• Reduction in ratio of delay slots to useful work
Other benefits
• Loop-ending copies can be eliminated (Kennedy thesis)
• Common subexpressions in address calculations
for i = 1 to n a[i] = a[i] * b[i]
for i = 1 to n by 4 a[i] = a[i] * b[i] a[i+1] = a[i+1] * b[i+1] a[i+2] = a[i+2] * b[i+2] a[i+3] = a[i+3] * b[i+3]
Assume mod(n,4) = 0
More complex cases in Fig. 10.12
Dasgupta’s thesis has an extreme example …
Safety: Same ops in bodyProfit: Lower o.h. + indirectOpp: Any loopDC: Choosing factor is hard
COMP 512, Rice University
14
Inline Substitution
Replace a procedure call with the body of the called procedure
• Textual substitution to create effects of parameter binding
• Private copy of code can be tailored to call site’s context Constants, unambiguous pointers, aliases, …
• Eliminates overhead of procedure call Register save & restore Disruption of call & return
• Eliminates benefits of procedure call Call resets state of register allocator Procedure abstraction keeps name space small
• Usually assumed to be profitable, although studies disagree …
The oldest interprocedural optimization [Ershov 1966]
Relate also to OOLs, with their sky high ratio of overhead to work and the difficulty of converting virtual calls to concrete calls.
Inlining one call can reveal the relevant class for another …
Safety: Paths execute same programmer-specified opsProfit: Lower o.h. + indirectOpp: Any callDC: Complex problem
COMP 512, Rice University
15
Inline Substitution
Example
call foe
fiecall foe
feefoe
Potential for exponential growth
COMP 512, Rice University
16
Inline Substitution — After inlining foe
fie
foe
fee
foe
COMP 512, Rice University
17
Procedure Cloning
Implement multiple copies of a procedure & tailor them to different calling environments
• Idea is to gain some of inlining’s benefit with limited growth
• Careful assignment of calls to clones can improve DFA results
Why clone procedures?
• Conservative alternative to inlining Split based on deterioration of forward data-flow sets Limit growth to cases where knowledge improves
• Avoids pitfalls of inlining Code growth & compile time (2x in
CHT study) Deterioration in global optimization (CHT
study) Recompilation problems
In 1985, this idea generated interest as a technique for implementing ADA generics
COMP 512, Rice University
18
Procedure Cloning
The Concept
b
helper2helper1
dca
solver
main
Assume that a, b, & d have unit stride access to memory, but c does
not
• Maybe solver and helper1 can be specialized for unit stride
• Would like to stop c from preventing that specialization
• Isolate c with its own copy of the routines
• If solver is in a library, might implement solver/helper1 in several pre-packaged ways→ Telescoping languages strategy
→ Connect call sites to appropriate implementations
→ Move optimization cost to library preparation time
• Avoid most of the code growth that would accrue with inlining
COMP 512, Rice University
19
Procedure Cloning
The Concept
b
helper2helper1
dca
solver
main
Assume that a, b, & d have unit stride access to memory, but c does
not
b
helper2helper1
d ca
solver
main
solver’
helper1’
Optimize for unit stride access
Use general version
COMP 512, Rice University
20
Procedure Cloning
Practical algorithms
• Clone on forward constants Metzger & Stroud in Convex compiler (ACM LOPLAS,
1993 ) Significant increase in constants found & folded
• Hall et al. gave a complex algorithm for score-based cloning
Useful as analysis tool
• Clone to sharpen analysis Join points combine & lose information Replicate for analysis & keep it if profitable
• Idea has been applied in partial evaluation, in compilation of both APL and Smalltalk-80, and in SELF
Safety: Code on path is sameProfit: Control-flow + indirectOpp: Need to look at DF infoDC: Complex (code size)
Consolidation
Consolidation is the opposite of replication
• Find common code sequences & replace with shared code
• Reduce code size
We will see several consolidation transformations
• Procedure abstraction Finding common code sequences & creating procedures to
hold them
• Hoisting Replacing multiple instances with one, earlier in the CFG
• Sinking Replacing multiple instances with one, later in the CFG
COMP 512, Rice University
21
VeryBusy expressions
Consolidation
Procedure Abstraction
If replication can enable specialization, abstraction can undo the negative effects of excess replication, whether done by the programmer or the compiler
• Pattern matching to identify common sequences Abstract across register names and branch target names
• Replace common sequence with inexpensive call/return Use same register names ⇒ only need a return address Common sequence ⇒ storage map is same, too
Procedure abstraction was originally proposed as a means of reducing working set sizes Goal was to reduce virtual memory demands of
timesharing
COMP 512, Rice University
22
Safety: Code on path is sameProfit: Smaller codeOpp: Pattern matching probDC: Simple
Consolidation
Hoisting
• Compute very busy expressions Each block b is annotated with VERYBUSY(b), the set of
expressions that are evaluated along every path leaving b without redefinition of their constituent subexpressions
For e ∈VERYBUSY(b), evaluating e at end of b makes those subsequent evaluations redundant
• Insert a computation of each e ∈VERYBUSY(b) at end of b
• Either replace subsequent computations with a reference, or run some form of global common subexpression elimination
Hoisting should reduce code size.
It does not directly reduce running times.
COMP 512, Rice University
23
Safety: Code on path is sameProfit: Code sizeOpp: Compute VERYBUSYDC: Simple
Consolidation
Sinking
• Conceptually, sinking is the inverse of hoisting
• Locate expressions that are computed on every path that reaches b, with no subsequent use or redefinition
• Insert evaluation at b and eliminate earlier evaluations
• Same code space benefits
Common implementation technique: Cross jumping or tail merging
• At each join point, look back across the branch or jump
• If identical ops lie along each path, pull them across the join Can use a window to address minor differences in order
Particularly effective at merging procedure epilog code
COMP 512, Rice University
24
YADFA problem
Safety: Code on path is sameProfit: Code sizeOpp: Simple matchingDC: Simple