spring 2014jim hogg - uw - cse - p501n-1 cse p501 – compiler construction compiler backend...
TRANSCRIPT
Spring 2014 Jim Hogg - UW - CSE - P501 N-1
CSE P501 – Compiler Construction
Compiler Backend Organization
Instruction Selection Instruction Scheduling Registers Allocation
Instruction Selection Peephole Optimization Peephole Instruction Selection
Spring 2014 Jim Hogg - UW - CSE - P501 N-2
Source TargetFront End Back End
Scan
chars
tokens
AST
IR
AST = Abstract Syntax Tree
IR = Intermediate Representation
‘Middle End’
Optimize
Select Instructions
Parse
Convert
Allocate Registers
Emit
Machine Code
IR
IR
IR
IR
IR
A Compiler
Instruction Selection: processors don't support IR; need to convert IR into real machine code
Spring 2014 Jim Hogg - UW - CSE - P501 N-3
The Big Picture
Compiler = lots of fast stuff, followed by hard problems
Scanner: O(n) Parser: O(n) Analysis & Optimization: ~ O(n log n) Instruction selection: fast or NP-Complete Instruction scheduling: NP-Complete Register allocation: NP-Complete
• Recall: approach in P501 to describing backend is 'survey' level
• A deeper dive would require a further full, 10-week course
Compiler Backend: 3 Parts
Select Instructions eg: Best x86 instruction to implement: vr17 = 12(arp) ?
Schedule Instructions eg: Given instructions a,b,c,d,e,f,g would the program run
faster if they were issued in a different order, such as d,c,a,b,g,e,f ?
Allocate Registers eg: which variables to store in registers? which to store in
memory?
Spring 2014 Jim Hogg - UW - CSE - P501 N-4
Instruction Selection is . . .
Spring 2014 Jim Hogg - UW - CSE - P501 N-5
Select Real Instructions
Mid-LevelIR
Low-LevelIR
• Specific to one chip/ISA• Use chip addressing modes• But still not decided on
registers
• TAC - Three Address Code• t1 a op b
• Tree or Linear• supply of temps• Array address calcs.
explicit• Optimizations all done• Storage locations
decided
Instruction Scheduling is . . .
Spring 2014 Jim Hogg - UW - CSE - P501 N-6
Schedule
• Execute in-order to get correct answer
abcdefgh
badfcghf
• Issue in new order• eg: memory fetch is
slow• eg: divide is slow• Overall faster• Still get correct
answer!
Originally devised for super-computersNow used everywhere:
• in-order procs - Atom, older ARM• out-of-order procs - newer x86
Compiler does 'heavy lifting' - reduce chip power
Register Allocation is . . .
Spring 2014 Jim Hogg - UW - CSE - P501 N-7
Allocate
• Virtual registers or temps
• supply
R1 = = R1push R1R1 = = R1pop R1
• Real machine registers• eg: EAX, EBP
• Very finite supply!• Enregister/Spill
After allocating registers, may do a second pass of scheduling to improve speed of spill code
Instruction selection chooses which target instructions (eg: for x86, for ARM) to use for each IR instruction
Selecting the best instructions is massively difficult because modern ISAs provide:
huge choice of instructions wide choice of addressing modes Eg: Intel's x64 Instruction Set Reference Manual = 1422
pages
Spring 2014 Jim Hogg - UW - CSE - P501 N-8
Instruction Selection
Spring 2014 Jim Hogg - UW - CSE - P501 N-9
Choices, choices ...
Most chip ISAs provide many ways to do the same thing:
eg: to set eax to 0 on x86 has several alternatives:
mov eax, 0 xor eax, eaxsub eax, eax imul eax, 0
Many machine instructions do several things at once – eg: register arithmetic and effective address calculation. Recall:
lea rdst, [rbase + rindex*scale + offset]
Spring 2014 Jim Hogg - UW - CSE - P501 N-10
Overview: Instruction Selection
Map IR into near-assembly code For MiniJava, we emitted textual assembly code Commercial compilers emit binary code directly
Assume known storage layout and code shape ie: optimization phases have already done their thing
Combine low-level IR operations into machine instructions (take advantage of addressing modes, etc)
Spring 2014 Jim Hogg - UW - CSE - P501 N-11
Criteria for Instruction Selection
Several possibilities Fastest Smallest Minimal power (eg: don’t use a function-unit if leaving it
powered-down is a win)
Sometimes not obvious eg: if one of the function-units in the processor is idle and
we can select an instruction that uses that unit, it effectively executes for free, even if that instruction wouldn’t be chosen normally
(Some interaction with scheduling here…)
Instruction Selection: Approaches
Two main techniques:
1. Tree-based matches (eg: maximul munch algorithm)
2. Peephole-based generation
Spring 2014 Jim Hogg - UW - CSE - P501 N-12
• Note that we select instructions from the target ISA.• We do not decide on which physical registers to use. That comes
later during Register Allocation• We have a few generic registers: ARP, StackPointer, ResultReg
Spring 2014 Jim Hogg - UW - CSE - P501 N-13
Tree-Based Instruction Selection
ID<e,arp,4>
ID<f,arp,8>
e f
case ID: t1 = offset(node) t2 = base(node) reg = nextReg() emit(loadA0, t1, t2, res)
How to generate target code for this simple tree?
We could use a template approach - similar to converting an AST into IR:
Resulting codegen is correct, but dumbDoesn't even cover: call-by-val, call-by-ref, enregistered, different data type, etc
Spring 2014 Jim Hogg - UW - CSE - P501 N-14
loadI 4 => r5
loadAO rarp, r5=> r6
loadI 8 => r7
loadAO rarp, r7=> r8
mult r6, r8 => r9
ID<e,arp,4>
ID<f,arp,8>
e f
loadAI rarp, 4 => r5
loadAI rarp, 8 => r6
mult r5, r6 => r7
Naive Ideal
case ID: t1 = offset(node) t2 = base(node) reg = nextReg() emit(loadA0, t1, t2, res)
Template Code Generation
IR (3-address Code) Tree-lets
Production ILOC Template5 ...6 Reg Lab load l => rn
8 Reg Num9 Reg Reg1 load r1 => rn
10 Reg + Reg1 Reg2 loadA0 r1, r2 => rn
11 Reg + Reg1 Num2 loadAI r1, n2 => rn
14 Reg + Lab1 Reg2 loadAI r2, l1 => rn
15 Reg + Reg1 Reg2 add r1, r2 => rn
16 Reg + Reg1 Num2 addI r1, 22 => rn
19 Reg + Lab1 Reg2 addI r2, l1 => rn
20 ...
Spring 2014 Jim Hogg - UW - CSE - P501 N-15
Memory Dereference
+
Reg1 Num2
11
Rules for Tree-to-Target Conversion (prefix notation)
+ Reg1 Num2
Example: Tiling a tiny 4-node tree
Spring 2014 Jim Hogg - UW - CSE - P501 N-16
+
Lab@G
Num12
Load variable <c,@G,12>
• The tree above shows code to access variable, c, stored at offset 12 bytes from label G
• How many ways can we tile this tree into equivalent ILOC code?
Potential Matches
N-17
+
Lab@G
Num12
11
6+
Lab@G
Num12
14
8+
Lab@G
Num12
10
86+
Lab@G
Num12
10
86
+
Lab@G
Num12
9
6
16
+
Lab@G
Num12
9
19
8+
Lab@G
Num12
9
15
86+
Lab@G
Num12
9
15
86
<6,11> <8,14> <6,8,10> <8,6,10>
<6,16,9> <8,19,9> <6,8,15,9> <8,6,15,9>
loadI @G => ri
loadAI ri 12 => rj
loadI l2 => ri
loadAI ri @G => rj
loadI @G => ri
loadI l2 => rj
loadAO ri rj => rk
loadI l2 => ri
loadI @G => rj
loadAO ri rj => rk
Spring 2014 Jim Hogg - UW - CSE - P501 N-18
Tree-Pattern Matching
A given tiling “implements” a tree if:
covers every node in the tree, and overlap between any two tiles (trees) is limited to a single
node
If <node,op> is in the tiling, then node is also covered by a leaf in another operation tree in the tiling – unless it is the root
Where two operation trees meet, they must be compatible (ie: expect the same value in the same location)
IR AST for Simple Expression
Spring 2014 Jim Hogg - UW - CSE - P501 N-19
=
-
Num4
Valarp
+
Num-16
Valarp
+
Num2
Num12
Val@G
+
a = b - 2 c
a local var, offset 4 from arpb call-by-ref paramc var, offset 12 from label @G
= + Val1 Num1 - + Val2 Num2 Num3 + Lab1 Num4
Prefix form - same info as IR tree:
IR AST as Prefix Text String
Spring 2014 Jim Hogg - UW - CSE - P501 N-20
= + Val1 Num1 - + Val2 Num2 Num3 + Lab1 Num4
No parentheses? - don't need them: evaluate this expression from right-to-left, using simple stack machine
Production ILOC Template5 ...6 Reg Lab1 load l1 => rn
7 Reg Val18 Reg Num1 loadl n1 => rn
9 Reg Reg1 load r1 => rn
10 Reg + Reg1 Reg2 loadA0 r1, r2 => rn
11 Reg + Reg1 Num2 loadAI r1, n2 => rn
14 Reg + Lab1 Reg2 loadAI r2, l1 => rn
15 Reg + Reg1 Reg2 add r1, r2 => rn
16 Reg + Reg1 Num2 addI r1, 22 => rn
19 Reg + Lab1 Reg2 addI r2, l1 => rn
20 ...
Rewrite Rules - Another BNF, but including ambiguity!
Select Instructions with LR Parser
Spring 2014 Jim Hogg - UW - CSE - P501 N-21
= + Val1 Num1 - + Val2 Num2 Num3 + Lab1 Num4
Production ILOC Template5 ...6 Reg Lab1 load l1 => rn
7 Reg Val18 Reg Num1 loadl n1 => rn
9 Reg Reg1 load r1 => rn
10 Reg + Reg1 Reg2 loadA0 r1, r2 => rn
11 Reg + Reg1 Num2 loadAI r1, n2 => rn
14 Reg + Lab1 Reg2 loadAI r2, l1 => rn
15 Reg + Reg1 Reg2 add r1, r2 => rn
16 Reg + Reg1 Num2 addI r1, 22 => rn
19 Reg + Lab1 Reg2 addI r2, l1 => rn
20 ...
• Use LR parser for Grammar to parse prefix expression• Ambiguous, so lots of parse-action conflicts: resolve with tie-
breaker rules:• Lowest cost (need to augment Grammar productions with
cost)• Favor large reductions over smaller ("maximal munch")
• reduce-reduce conflict - choose longer reduction• shift-reduce conflict - choose shift
• => largest number of prefix ops translated into machine instruction
Originally devised as the last optimization pass of a compiler
Examine a small, sliding window of target code (a few adjacent instructions) and optimize
Spring 2014 Jim Hogg - UW - CSE - P501 N-22
storeAI r1 => rarp, 8loadAI rarp, 8 => r15
Memory[rarp + 8] = r1
r15 = Memory[rarp + 8]
Reminder
Original
storeAI r1 => rarp, 8i2i r1 => r15
Memory[rarp + 8] = r1
r15 = r1
Optimized
Cooper&Torczon "ILOC"
Peephole Optimization
Peeps, 2
Spring 2014 Jim Hogg - UW - CSE - P501 N-23
addI r2, 0 => r7
mult r4, r7 => r10
r7 = r2 + 0r10 = r4 * r7
Reminder
Original
mult r4, r2 => r10 r10 = r4 * r2Optimized
jumpI L10L10: jumpI L20
Original
jumpI L20L10: jumpI L20
Optimized
Modern Peephole Optimizer
Spring 2014 Jim Hogg - UW - CSE - P501 N-24
Expander Simplifier Matcher
IR LIRLIR LIR
• Modern ISAs are enormous• Linear search for a match no longer fast
enough• So . . .
• Like a miniature compiler• Use for both peeps and instruction-selection
Instruction Selection via Peeps
Spring 2014 Jim Hogg - UW - CSE - P501 N-25
t1 = 2 ca = b - t1
r10 = 2r11 = @Gr12 = 12r13 = r11 + r12
r14 = M[r13]r15 = r10 * r14
r16 = -16r17 = rarp + r16
r18 = M[r17]r19 = M[r18]r20 = r19 - r15
r21 = 4r22 = rarp + r21
M[r22] = r20
LIR
r10 = 2r11 = @Gr14 = M(r11+r12)r15 = r10 * r14
r18 = M(rarp-16)r19 = M(r18)r20 = r19 - r15
M(rarp+4) = r20
Simplified LIR
a 4(arp) Local Variableb -16(arp) Call-by-reference parameterc 12(@G) Variable
IR
loadI 2 => r10loadI @G => r11loadAI r11, 12 => r14mult r10, r14 => r15loadAI rarp, -16 => r18load r18 => r19sub r19, r15 => r20storeAI r20 => rarp,4
LIR
14 Instructions
8 Instructions
The Simplifier
Spring 2014 Jim Hogg - UW - CSE - P501 N-26
r10 = 2r11 = @Gr12 = 12
r11 = @Gr12 = 12r13 = r11 + r12
r11 = @Gr13 = r11 + 12r14 = M[r13]
r11 = @Gr14 = M[r11 + 12]r15 = r10 * r14
r14 = M[r11+12]r15 = r10 * r14
r16 = -16
r15 = r10 * r14
r16 = -16r17 = rarp + r16
r15 = r10 * r14
r17 = rarp - 16r18 = M[r17]
r15 = r10 * r14
r18 = M[rarp-16]r19 = M[r18]
r18 = M[rarp-16]r19 = M[r18]r20 = r19 - r15
r19 = M[r18]r20 = r19 - r15
r21 = 4
r20 = r19 - r15
r21 = 4r22 = rarp + r21
r20 = r19 - r15
r22 = rarp + 4M[r22] = r20
r20 = r19 - r15
M[r22] = r20
Roll
Const Prop + DCE
This example is simplified - never re-uses a virtual register: so we can delete the instruction after a constant propagation. In general, we would have to generate liveness info for each virtual register to perform safe DCE
Spring 2014 Jim Hogg - UW - CSE - P501 N-27
Instruction Scheduling
Register Allocation
And more….
Next