terminology, principles, and concerns, iii with examples from dom (ch 9) and dvnt (ch 10) copyright...
TRANSCRIPT
Terminology, Principles, and Concerns, III
With examples from DOM (Ch 9) and DVNT (Ch 10)
Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved.
Students enrolled in Comp 512 at Rice University have explicit permission to make copies of these materials for their personal use.
Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved.
Comp 512Spring 2011
Last Lecture
• Extended Basic Blocks
• Superlocal value numbering> Treat each path as a single basic block> Use a scoped hash table & SSA names to make it efficient
COMP 512, Rice University
2
This Lecture
• Dominator Trees Computing dominator information Global data-flow analysis
• Dominator-based Value Numbering Enhance the Superlocal Value Numbering algorithm so that
it can cover more blocks
• Optimizing a loop nest Finding loop nests Loop unrolling as an initial transformation
COMP 512, Rice University
3
COMP 512, Rice University
4
This is in SSA Form
Superlocal Value Numbering
m0 a + bn0 a + b
A
p0 c + dr0 c + d
B
r2 (r0,r1)y0 a + bz0 c + d
G
q0 a + br1 c + d
C
e0 b + 18s0 a + bu0 e + f
D e1 a + 17t0 c + du1 e + f
E
e3 (e0,e1)
u2 (u0,u1)v0 a + bw0 c + dx0 e + f
F
With all the bells & whistles
• Find more redundancy
• Pay little additional cost
• Still does nothing for F & G
Superlocal techniques
• Some local methods extend cleanly to superlocal scopes
• VN does not back up
• If C adds to A, it’s a problem
COMP 512, Rice University
5
What About Larger Scopes?
We have not helped with F or G
• Multiple predecessors
• Must decide what facts hold in F and in G For G, combine B & F? Merging state is expensive Fall back on what’s known
G
m0 a + bn0 a + b
A
p0 c + dr0 c + d
B
r2 (r0,r1)y0 a + bz0 c + d
q0 a + br1 c + d
C
e0 b + 18s0 a + bu0 e + f
D e1 a + 17t0 c + du1 e + f
E
e3 (e0,e1)
u2 (u0,u1)v0 a + bw0 c + dx0 e + f
F
COMP 512, Rice University
6
Dominators
Definitionsx dominates y if and only if every path from the entry of the
control-flow graph to the node for y includes x
• By definition, x dominates x
• We associate a DOM set with each node
• |DOM(x )| ≥ 1
Immediate dominators
• For any node x, there must be a y in DOM(x ) closest to x
• We call this y the immediate dominator of x
• As a matter of notation, we write this as IDOM(x )
COMP 512, Rice University
7
Dominators
Dominators have many uses in analysis & transformation
• Finding loops
• Building SSA form
• Making code motion decisions
We’ll look at how to compute dominators later
A
B C G
FED
Dominator tree
Dominator sets
Back to the discussion of value numbering over larger scopes ...
*
m0 a + bn0 a + b
A
p0 c + dr0 c + d
B
r2 (r0,r1)y0 a + bz0 c + d
G
q0 a + br1 c + d
C
e0 b + 18s0 a + bu0 e + f
D e1 a + 17t0 c + du1 e + f
E
e3 (e0,e1)
u2 (u0,u1)v0 a + bw0 c + dx0 e + f
F
Original idea: R.T. Prosser. “Applications of Boolean matrices to the analysis of flow diagrams,” Proceedings of the Eastern Joint Computer Conference, Spartan Books, New York, pages 133-138, 1959.
COMP 512, Rice University
8
What About Larger Scopes?
We have not helped with F or G
• Multiple predecessors
• Must decide what facts hold in F and in G For G, combine B & F? Merging state is expensive Fall back on what’s known
• Can use table from IDOM(x ) to start x Use C for F and A for G Imposes a Dom-based application order
Leads to Dominator VN Technique (DVNT)
*
m0 a + bn0 a + b
A
p0 c + dr0 c + d
B
r2 (r0,r1)y0 a + bz0 c + d
G
q0 a + br1 c + d
C
e0 b + 18s0 a + bu0 e + f
D e1 a + 17t0 c + du1 e + f
E
e3 (e0,e1)
u2 (u0,u1)v0 a + bw0 c + dx0 e + f
F
COMP 512, Rice University
9
Dominator Value Numbering
The DVNT Algorithm
• Use superlocal algorithm on extended basic blocks Retain use of scoped hash tables & SSA name space
• Start each node with table from its IDOM DVNT generalizes the superlocal algorithm
• No values flow along back edges (i.e., around loops)
• Constant folding, algebraic identities as before
Larger scope leads to (potentially) better results LVN + SVN + good start for EBBs missed by SVN
COMP 512, Rice University
10
Dominator Value Numbering
m a + bn a + b
A
p c + dr c + d
B
r2 (r0,r1)y a + bz c + d
G
q a + br c + d
C
e b + 18s a + bu e + f
D e a + 17t c + du e + f
E
e3 (e1,e2)
u2 (u0,u1)v a + bw c + dx e + f
F
DVNT advantages
•Find more redundancy
•Little additional cost
•Retains online character
DVNT shortcomings
•Misses some opportunities
•No loop-carried CSEs or constants
COMP 512, Rice University
11
Computing Dominators
Critical first step in SSA construction and in DVNT
• A node n dominates m iff n is on every path from n0 to m Every node dominates itself n’s immediate dominator is its closest dominator, IDOM(n)†
DOM(n0 ) = { n0 }
DOM(n) = { n } (ppreds(n) DOM(p))
Computing DOM
• These simultaneous set equations define a simple problem in data-flow analysis
• Equations have a unique fixed point solution
• An iterative fixed-point algorithm will solve them quickly
†IDOM(n ) ≠ n, unless n is n0, by convention.
Initially, DOM(n) = N, n≠n0
COMP 512, Rice University
12
Round-robin Iterative Algorithm
Termination
• Makes sweeps over the nodes
• Halts when some sweep produces no change
DOM(b0 ) Ø
for i 1 to NDOM(bi ) { all nodes in graph }
change true
while (change)change false
for i 0 to NTEMP { i } (xpred (b) DOM(x ))
if DOM(bi ) ≠ TEMP thenchange trueDOM(bi ) TEMP
COMP 512, Rice University
13
Example
B1
B2 B3
B4 B5
B6
B7
B0
Flow Graph
Progress of iterative solution for DOM
Results of iterative solution for DOM
*
COMP 512, Rice University
14
Example
Dominance Tree
Progress of iterative solution for DOM
Results of iterative solution for DOM
B1
B2 B3
B4 B5
B6
B7
B0
There are asymptotically faster algorithms.
With the right data structures, the iterative algorithm can be made extremely fast.
See Cooper, Harvey, & Kennedy, on the web site, or algorithm in Chapter 9 of EaC.
Aside on Data-Flow Analysis
The iterative DOM calculation is an example of data-flow analysis
• Data-flow analysis is a collection of techniques for compile-time reasoning about the run-time flow of values
• Data-flow analysis almost always operates on a graph Problems are trivial in a basic block Global problems use the control-flow graph (or derivative) Interprocedural problems use call graph (or derivative)
• Data-flow problems are formulated as simultaneous equations Sets attached to nodes and edges One solution technique is the iterative algorithm
• Desired result is usually meet over all paths (MOP) solution “What is true on every path from the entry node?” “Can this event happen on any path from the entry?”
COMP 512, Rice University
15Related to safety
Aside on Data-Flow Analysis
Why did the iterative algorithm work?
Termination
• The DOM sets are initialized to the (finite) set of nodes
• The DOM sets shrink monotonically
• The algorithm reaches a fixed point where they stop changing
Correctness
• We can prove that the fixed point solution is also the MOP
• That proof is beyond today’s lecture, but we’ll revisit it
Efficiency
• The round-robin algorithm is not particularly efficient
• Order in which we visit nodes is important for efficient solutions
COMP 512, Rice University
16
COMP 512, Rice University
17
Regional Optimization: Improving Loops
Compilers have always focused on loops
• Higher execution counts inside loop than outside loops
• Repeated, related operations
• Much of the real work takes place in loops (linear algebra)
• Several effects to attack in a loop or loop nest
• Overhead Decrease control-structure cost per iteration
• Locality Spatial locality ⇒ use of co-resident data Temporal locality ⇒ reuse of same data item
• Parallelism Move loops with independent iterations to outer position
Inner positions for vector hardware & SSE
Regional Optimization: Improving Loops
Loop unrolling (the oldest trick in the book)
• To reduce overhead, replicate the loop body
Sources of improvement
• Less overhead per useful operation
• Longer basic blocks for local optimization
COMP 512, Rice University
18
Doesn’t mess up spatial locality on either y or m (column-major order)
Regional Optimization: Improving Loops
With loop nest, may unroll inner loop
COMP 512, Rice University
19
do 60 j = 1, n2 do 50 i = 1 to n1 y(i) = y(i) + x(j) * m(i,j)50 continue60 continue
Critical inner loop from dmxpy in
Linpack
Doesn’t mess up reuse on x(j)
do 60 j = 1, n2 nextra = mod(n1,4) if (nextra .ge. 0) then do 49 i = 1, nextra, 1 y(i) = y(i) + x(j) * m(i,j)49 continue
do 50 i = nextra+1, n1, 4 y(i) = y(i) + x(j) * m(i,j) y(i+1) = y(i+1) + x(j) * m(i+1,j) y(i+2) = y(i+2) + x(j) * m(i+2,j) y(i+3) = y(i+3) + x(j) * m(i+3,j)50 continue60 continue
Regional Optimization: Improving Loops
With loop nest, may unroll outer loop
• Trick is to unroll outer loop and fuse resulting inner loops Loop fusion combines the bodies of two similar loops
COMP 512, Rice University
20
do 60 j = 1, n2 do 50 i = 1 to n1 y(i) = y(i) + x(j) * m(i,j)50 continue60 continue
Critical inner loop from dmxpy in
Linpack
do 60 j = 1, n2 nextra = mod(n1,4) if (nextra .ge. 1) then do 49 i, nextra, 1 y(i) = y(i) + x(j) * m(i,j)49 continue
do 50 i = nextra+1, n1, 4 y(i) = y(i) + x(j) * m(i,j) y(i) = y(i) + x(j+1) * m(i,j+1) y(i) = y(i) + x(j+2) * m(i,j+2) y(i) = y(i) + x(j+3) * m(i,j+3)50 continue60 continue
This is clearly
wrong
Regional Optimization: Improving Loops
With loop nest, may unroll outer loop
• Trick is to unroll outer loop and fuse resulting inner loops
COMP 512, Rice University
21
do 60 j = 1, n2 do 50 i = 1 to n1 y(i) = y(i) + x(j) * m(i,j)50 continue60 continue
Critical inner loop from dmxpy in
Linpack
do 60 j = 1, n2 nextra = mod(n1,4) if (nextra .ge. 1) then do 49 i, nextra, 1 y(i) = y(i) + x(j) * m(i,j)49 continue
do 50 i = nextra+1, n1, 4 y(i) = y(i) + x(j) * m(i,j) + x(j+1) * m(i,j+1) + x(j+2) * m(i,j+2) + x(j+3) * m(i,j+3)50 continue60 continue
Save on loads & stores of y(i)?
Spatial reuse in x and m
The author of Linpack, after much testing, chose outer loop unrolling.
Regional Optimization: Improving Loops
• Other effects of loop unrolling
• Increases number of independent operations inside loop May be good for scheduling multiple functional units
• Moving consecutive accesses into same iteration Scheduler may move them together (locality in big
loop)
• May make cross-iteration redundancies obvious Expose address expressions in example to LVN
• May increase demand for registers Spills can overcome any benefits
• Can unroll to eliminate copies at end of loop Often rediscovered result of Ken Kennedy’s thesis
• Can change other optimizations Weights in spill code (Das Gupta’s example)
COMP 512, Rice University
22
Regional Optimization: Improving Loops
Many other loop transformations appear in the literature
• We will have a lecture devoted to them later in the course
• See also COMP 515 and the Allen-Kennedy book
Next class
• Examples of Global Optimization
COMP 512, Rice University
23