spring 2014jim hogg - uw - cse - p501n-1 cse p501 – compiler construction compiler backend...

Spring 2014 Jim Hogg - UW - CSE - P501 N-1

CSE P501 – Compiler Construction

Compiler Backend Organization

Instruction Selection Instruction Scheduling Registers Allocation

Instruction Selection Peephole Optimization Peephole Instruction Selection


Source TargetFront End Back End

Scan

chars

tokens

AST

IR

AST = Abstract Syntax Tree

IR = Intermediate Representation

‘Middle End’

Optimize

Select Instructions

Parse

Convert

Allocate Registers

Emit

Machine Code

IR

IR

IR

IR

IR

A Compiler

Instruction Selection: processors don't support IR; need to convert IR into real machine code


The Big Picture

Compiler = lots of fast stuff, followed by hard problems

Scanner: O(n) Parser: O(n) Analysis & Optimization: ~ O(n log n) Instruction selection: fast or NP-Complete Instruction scheduling: NP-Complete Register allocation: NP-Complete

• Recall: approach in P501 to describing backend is 'survey' level

• A deeper dive would require a further full, 10-week course

Compiler Backend: 3 Parts

Select Instructions eg: Best x86 instruction to implement: vr17 = 12(arp) ?

Schedule Instructions eg: Given instructions a,b,c,d,e,f,g would the program run

faster if they were issued in a different order, such as d,c,a,b,g,e,f ?

Allocate Registers eg: which variables to store in registers? which to store in

memory?


Instruction Selection is . . .


Select Real Instructions

Mid-LevelIR

Low-LevelIR

• Specific to one chip/ISA• Use chip addressing modes• But still not decided on

registers

• TAC - Three Address Code• t1 a op b

• Tree or Linear• supply of temps• Array address calcs.

explicit• Optimizations all done• Storage locations

decided

Instruction Scheduling is . . .


Schedule

• Execute in-order to get correct answer

abcdefgh

badfcghf

• Issue in new order• eg: memory fetch is

slow• eg: divide is slow• Overall faster• Still get correct

answer!

Originally devised for super-computersNow used everywhere:

• in-order procs - Atom, older ARM• out-of-order procs - newer x86

Compiler does 'heavy lifting' - reduce chip power

Register Allocation is . . .


Allocate

• Virtual registers or temps

• supply

R1 = = R1push R1R1 = = R1pop R1

• Real machine registers• eg: EAX, EBP

• Very finite supply!• Enregister/Spill

After allocating registers, may do a second pass of scheduling to improve speed of spill code

Instruction selection chooses which target instructions (eg: for x86, for ARM) to use for each IR instruction

Selecting the best instructions is massively difficult because modern ISAs provide:

huge choice of instructions wide choice of addressing modes Eg: Intel's x64 Instruction Set Reference Manual = 1422

pages


Instruction Selection


Choices, choices ...

Most chip ISAs provide many ways to do the same thing:

eg: to set eax to 0 on x86 has several alternatives:

mov eax, 0 xor eax, eaxsub eax, eax imul eax, 0

Many machine instructions do several things at once – eg: register arithmetic and effective address calculation. Recall:

lea rdst, [rbase + rindex*scale + offset]


Overview: Instruction Selection

Map IR into near-assembly code For MiniJava, we emitted textual assembly code Commercial compilers emit binary code directly

Assume known storage layout and code shape ie: optimization phases have already done their thing

Combine low-level IR operations into machine instructions (take advantage of addressing modes, etc)


Criteria for Instruction Selection

Several possibilities Fastest Smallest Minimal power (eg: don’t use a function-unit if leaving it

powered-down is a win)

Sometimes not obvious eg: if one of the function-units in the processor is idle and

we can select an instruction that uses that unit, it effectively executes for free, even if that instruction wouldn’t be chosen normally

(Some interaction with scheduling here…)

Instruction Selection: Approaches

Two main techniques:

1. Tree-based matches (eg: maximul munch algorithm)

2. Peephole-based generation


• Note that we select instructions from the target ISA.• We do not decide on which physical registers to use. That comes

later during Register Allocation• We have a few generic registers: ARP, StackPointer, ResultReg


Tree-Based Instruction Selection

ID<e,arp,4>

ID<f,arp,8>

e f

case ID: t1 = offset(node) t2 = base(node) reg = nextReg() emit(loadA0, t1, t2, res)

How to generate target code for this simple tree?

We could use a template approach - similar to converting an AST into IR:

Resulting codegen is correct, but dumbDoesn't even cover: call-by-val, call-by-ref, enregistered, different data type, etc


loadI 4 => r5

loadAO rarp, r5=> r6

loadI 8 => r7

loadAO rarp, r7=> r8

mult r6, r8 => r9

ID<e,arp,4>

ID<f,arp,8>

e f

loadAI rarp, 4 => r5

loadAI rarp, 8 => r6

mult r5, r6 => r7

Naive Ideal

case ID: t1 = offset(node) t2 = base(node) reg = nextReg() emit(loadA0, t1, t2, res)

Template Code Generation

IR (3-address Code) Tree-lets

Production ILOC Template5 ...6 Reg Lab load l => rn

8 Reg Num9 Reg Reg1 load r1 => rn

10 Reg + Reg1 Reg2 loadA0 r1, r2 => rn

11 Reg + Reg1 Num2 loadAI r1, n2 => rn

14 Reg + Lab1 Reg2 loadAI r2, l1 => rn

15 Reg + Reg1 Reg2 add r1, r2 => rn

16 Reg + Reg1 Num2 addI r1, 22 => rn

19 Reg + Lab1 Reg2 addI r2, l1 => rn

20 ...


Memory Dereference

+

Reg1 Num2

11

Rules for Tree-to-Target Conversion (prefix notation)

+ Reg1 Num2

Example: Tiling a tiny 4-node tree


+

Lab@G

Num12

Load variable <c,@G,12>

• The tree above shows code to access variable, c, stored at offset 12 bytes from label G

• How many ways can we tile this tree into equivalent ILOC code?

Potential Matches

N-17

+

Lab@G

Num12

11

6+

Lab@G

Num12

14

8+

Lab@G

Num12

10

86+

Lab@G

Num12

10

86

+

Lab@G

Num12

9

6

16

+

Lab@G

Num12

9

19

8+

Lab@G

Num12

9

15

86+

Lab@G

Num12

9

15

86

<6,11> <8,14> <6,8,10> <8,6,10>

<6,16,9> <8,19,9> <6,8,15,9> <8,6,15,9>

loadI @G => ri

loadAI ri 12 => rj

loadI l2 => ri

loadAI ri @G => rj

loadI @G => ri

loadI l2 => rj

loadAO ri rj => rk

loadI l2 => ri

loadI @G => rj

loadAO ri rj => rk


Tree-Pattern Matching

A given tiling “implements” a tree if:

covers every node in the tree, and overlap between any two tiles (trees) is limited to a single

node

If <node,op> is in the tiling, then node is also covered by a leaf in another operation tree in the tiling – unless it is the root

Where two operation trees meet, they must be compatible (ie: expect the same value in the same location)

IR AST for Simple Expression


=

-

Num4

Valarp

+

Num-16

Valarp

+

Num2

Num12

Val@G

+

a = b - 2 c

a local var, offset 4 from arpb call-by-ref paramc var, offset 12 from label @G

= + Val1 Num1 - + Val2 Num2 Num3 + Lab1 Num4

Prefix form - same info as IR tree:

IR AST as Prefix Text String



No parentheses? - don't need them: evaluate this expression from right-to-left, using simple stack machine

Production ILOC Template5 ...6 Reg Lab1 load l1 => rn

7 Reg Val18 Reg Num1 loadl n1 => rn

9 Reg Reg1 load r1 => rn







20 ...

Rewrite Rules - Another BNF, but including ambiguity!

Select Instructions with LR Parser



Production ILOC Template5 ...6 Reg Lab1 load l1 => rn

7 Reg Val18 Reg Num1 loadl n1 => rn

9 Reg Reg1 load r1 => rn







20 ...

• Use LR parser for Grammar to parse prefix expression• Ambiguous, so lots of parse-action conflicts: resolve with tie-

breaker rules:• Lowest cost (need to augment Grammar productions with

cost)• Favor large reductions over smaller ("maximal munch")

• reduce-reduce conflict - choose longer reduction• shift-reduce conflict - choose shift

• => largest number of prefix ops translated into machine instruction

Originally devised as the last optimization pass of a compiler

Examine a small, sliding window of target code (a few adjacent instructions) and optimize


storeAI r1 => rarp, 8loadAI rarp, 8 => r15

Memory[rarp + 8] = r1

r15 = Memory[rarp + 8]

Reminder

Original

storeAI r1 => rarp, 8i2i r1 => r15

Memory[rarp + 8] = r1

r15 = r1

Optimized

Cooper&Torczon "ILOC"

Peephole Optimization

Peeps, 2


addI r2, 0 => r7

mult r4, r7 => r10

r7 = r2 + 0r10 = r4 * r7

Reminder

Original

mult r4, r2 => r10 r10 = r4 * r2Optimized

jumpI L10L10: jumpI L20

Original

jumpI L20L10: jumpI L20

Optimized

Modern Peephole Optimizer


Expander Simplifier Matcher

IR LIRLIR LIR

• Modern ISAs are enormous• Linear search for a match no longer fast

enough• So . . .

• Like a miniature compiler• Use for both peeps and instruction-selection

Instruction Selection via Peeps


t1 = 2 ca = b - t1

r10 = 2r11 = @Gr12 = 12r13 = r11 + r12

r14 = M[r13]r15 = r10 * r14

r16 = -16r17 = rarp + r16

r18 = M[r17]r19 = M[r18]r20 = r19 - r15

r21 = 4r22 = rarp + r21

M[r22] = r20

LIR

r10 = 2r11 = @Gr14 = M(r11+r12)r15 = r10 * r14

r18 = M(rarp-16)r19 = M(r18)r20 = r19 - r15

M(rarp+4) = r20

Simplified LIR

a 4(arp) Local Variableb -16(arp) Call-by-reference parameterc 12(@G) Variable

IR

loadI 2 => r10loadI @G => r11loadAI r11, 12 => r14mult r10, r14 => r15loadAI rarp, -16 => r18load r18 => r19sub r19, r15 => r20storeAI r20 => rarp,4

LIR

14 Instructions

8 Instructions

The Simplifier


r10 = 2r11 = @Gr12 = 12

r11 = @Gr12 = 12r13 = r11 + r12

r11 = @Gr13 = r11 + 12r14 = M[r13]

r11 = @Gr14 = M[r11 + 12]r15 = r10 * r14

r14 = M[r11+12]r15 = r10 * r14

r16 = -16

r15 = r10 * r14

r16 = -16r17 = rarp + r16

r15 = r10 * r14

r17 = rarp - 16r18 = M[r17]

r15 = r10 * r14

r18 = M[rarp-16]r19 = M[r18]

r18 = M[rarp-16]r19 = M[r18]r20 = r19 - r15

r19 = M[r18]r20 = r19 - r15

r21 = 4

r20 = r19 - r15

r21 = 4r22 = rarp + r21

r20 = r19 - r15

r22 = rarp + 4M[r22] = r20

r20 = r19 - r15

M[r22] = r20

Roll

Const Prop + DCE

This example is simplified - never re-uses a virtual register: so we can delete the instruction after a constant propagation. In general, we would have to generate liveness info for each virtual register to perform safe DCE


Instruction Scheduling

Register Allocation

And more….

Next

spring 2014jim hogg - uw - cse - p501n-1 cse p501 – compiler construction compiler backend...

Documents

cse p501n

jim hogg uw

ir instruction

instruction selectionspring

best x86 instruction

xor eax

eaximul eax

eaxsub eax