static code scheduling

Static Code Scheduling

CS 671April 1, 2008

2 CS 671 – Spring 2008

Code Scheduling

Scheduling or reordering instructions to improve performance and/or guarantee correctness• Important for dynamically-scheduled architectures• Crucial (assumed!) for statically-scheduled

architectures, e.g. VLIW or EPIC

Takes into account anticipated latencies• Machine-specific, performed later in the optimization

pass

How does this contrast with our earlier exploration of code motion?

3 CS 671 – Spring 2008

Many machines are pipelined and expose some aspects of pipelining to the user (compiler)

Examples:• Branch delay slots!• Memory-access delays• Multi-cycle operations

Some machines don’t have scheduling hardware

Why Must the Compiler Schedule?

4 CS 671 – Spring 2008

Example

Assume loads take 2 cycles and branches have a delay slot.

____cycles

instruction start time

r2 [r1]

r3 [r1+4]

r4 r2 + r3

r5 r2 + 1

goto L1

nop

5 CS 671 – Spring 2008

Example

Assume loads take 2 cycles and branches have a delay slot.

____cycles

instruction start time

r2 [r1]

r3 [r1+4]

r5 r2 + 1

goto L1

r4 r2 + r3

6 CS 671 – Spring 2008

Code Scheduling Strategy

Get resources operating in parallel• Integer data path• Integer multiply / divide hardware• FP adder, multiplier, divider

Method• Fill with computations that do not

require result or same hardware resources

Drawbacks• Highly hardware dependent

Start Op

Use Op

Try to fill

7 CS 671 – Spring 2008

Scheduling Approaches

Local

Branch scheduling

Basic-block scheduling

Global

Cross-block scheduling

Software pipelining

Trace scheduling

Percolation scheduling

8 CS 671 – Spring 2008

Branch Scheduling

Two problems:

Branches often take some number of cycles to complete

Can be a delay between a compare b and its associated branch

A compiler will try to fill these slots with valid instructions (rather than nop)

Delay slots – present in PA-RISC, SPARC, MIPS

Condition delay – PowerPC, Pentium

9 CS 671 – Spring 2008

Recall from Architecture…

IF – Instruction Fetch

ID – Instruction Decode

EX – Execute

MA – Memory access

WB – Write back

IF

IF

IF

ID

ID

ID

EX

EX

EX

MA

MA

MA

WB

WB

WB

10 CS 671 – Spring 2008

Control Hazards

IF

IF

ID

---

EX

---

MA

--- ---

WB

IF ID EX MA WB

IF ID EX MA WB

Taken Branch

Instr + 1

Branch Target

Branch Target + 1

11 CS 671 – Spring 2008

Data Dependences

If two operations access the same register, they are dependent

Types of data dependences

Flow Output Anti

r1 = r2 + r3

r4 = r1 * 6

r1 = r2 + r3

r1 = r4 * 6

r1 = r2 + r3

r2 = r5 * 6

12 CS 671 – Spring 2008

Data Hazards

IF

IF

ID

ID

EX

EX

MA

MA WB

WBlw R1,0(R2)

add R3,R1,R4 stall

Memory latency: data not ready

13 CS 671 – Spring 2008

Data Hazards

IF

IF

ID

ID

EX EX MA

MA WB

WBaddf R3,R1,R2

addf R3,R3,R4 stall EX EX

Assumes floating point ops take 2 execute cycles

Instruction latency: execute takes > 1 cycle

14 CS 671 – Spring 2008

Multi-cycle Instructions

• Scheduling is particularly important for multi-cycle operations• Alpha instructions > 1 cycle latency (partial

list)

mull (32-bit integer multiply) 8mulq (64-bit integer multiply) 16addt (fp add) 4mult (fp multiply) 4divs (fp single-precision divide) 10divt (fp double-precision divide) 23

15 CS 671 – Spring 2008

Avoiding data hazards

• Move loads earlier and stores later (assuming this does not violate correctness) • Other stalls may require more sophisticated

re-ordering, i.e. ((a+b)+c)+d becomes (a+b)+(c+d) • How can we do this in a systematic way??

16 CS 671 – Spring 2008

Example: Without Scheduling

Start Time

Code

lw r1, w

add r1,r1,r1

lw r2,x

mult r1,r1,r2

lw r2,y

mult r1,r1,r2

lw r2,z

mult r1,r1,r2

sw r1, a

Assume:• memory instrs take 3 cycles• mult takes 2 cycles (to have

result in register)• rest take 1 cycle

____cycles

17 CS 671 – Spring 2008

Basic Block Dependence DAGS

Nodes - instructions

Edges - dependence between I1 and I2• When we cannot determine whether there is

a dependence, we must assume there is one

a) lw R2, (R1)

b) lw R3, (R1) 4

c) R4 R2 + R3

d) R5 R2 - 1

a b

d c

2 2 2

18 CS 671 – Spring 2008

Example – Build the DAG

Code

a lw r1, w

b add r1,r1,r1

c load r2,x

d mult r1,r1,r2

e load r2,y

f mult r1,r1,r2

g load r2,z

h mult r1,r1,r2

i sw r1, a

Assume: memory instrs = 3 mult = 2 (to have result in register) rest = 1 cycle

19 CS 671 – Spring 2008

Creating a schedule

•Create a DAG of dependences

•Determine priority

•Schedule instructions with– Ready operands– Highest priority

•Heuristics: If multiple possibilities, fall back on other priority functions

20 CS 671 – Spring 2008

Operation Priority

Priority – Need a mechanism to decide which ops to schedule first (when you have choices)

Common priority functions• Height – Distance from exit node

– Give priority to amount of work left to do• Slackness – inversely proportional to slack

– Give priority to ops on the critical path• Register use – priority to nodes with more source

operands and fewer destination operands– Reduces number of live registers • Uncover – high priority to nodes with many children

– Frees up more nodes• Original order – when all else fails

21 CS 671 – Spring 2008

Computing Priorities

Height(n) =• exec(n) if n is a leaf• max(height(m)) + exec(n) for m, where m is a successor of n

Critical path(s) = path through the dependence DAG with longest latency

22 CS 671 – Spring 2008

Example – Determine Height and CP

Code

a lw r1, w

b add r1,r1,r1

c lw r2,x

d mult r1,r1,r2

e lw r2,y

f mult r1,r1,r2

g lw r2,z

h mult r1,r1,r2

i sw r1, a

Assume: memory instrs = 3 mult = 2 = (to have result in register) rest = 1 cycle

Critical path: _______

23 CS 671 – Spring 2008

Example – List Scheduling

Code

a lw r1, w

b add r1,r1,r1

c lw r2,x

d mult r1,r1,r2

e lw r2,y

f mult r1,r1,r2

g lw r2,z

h mult r1,r1,r2

i sw r1, a

start

Schedule

_____cycles

24 CS 671 – Spring 2008

Scheduling vs. Register Allocation

Code

a lw r1 (r12)

b lw r2 (r12+4)

c r1 r1+r2

d stw (r12) r1

e lw r1 (r12+8)

f lw r2 (r12+12)

g r2 r1+r2

25 CS 671 – Spring 2008

Register Renaming

Code

a lw r1 (r12)

b lw r2 (r12+4)

c r3 r1+r2

d stw (r12) r3

e lw r4 (r12+8)

f lw r5 (r12+12)

g r6 r4+r5

26 CS 671 – Spring 2008

VLIW

• Very Long Instruction Word• Compiler determines exactly what is issued

every cycle (before the program is run)• Schedules also account for latencies• All hardware changes result in a compiler

change

• Usually embedded systems (hence simple HW)• Itanium is actually an EPIC-style machine

(accounts for most parallelism, not latencies)

27 CS 671 – Spring 2008

Sample VLIW code

c = a + b d = a - b e = a * b ld j = [x] nop

g = c + d h = c - d nop ld k = [y] nop

nop nop i = j * c ld f = [z] br g

Add/Sub Add/Sub Mul/Div Ld/St Branch

VLIW processor: 5 issue2 Add/Sub units (1 cycle)1 Mul/Div unit (2 cycle, unpipelined)1 LD/ST unit (2 cycle, pipelined)1 Branch unit (no delay slots)

28 CS 671 – Spring 2008

Multi-Issue Scheduling Example

RU_map

time ALU MEM0123456789

2m

3m

5m

4

6

98

10

7m

1

Schedule

time Ready Placed0123456789

Machine: 2 issue, 1 memory port, 1 ALUMemory port = 2 cycles, non-pipelinedALU = 1 cycle

29 CS 671 – Spring 2008

Earliest Latest Sets

Machine: 2 issue, 1 memory port, 1 ALUMemory port = 2 cycles, pipelinedALU = 1 cycle

1m 2m

4m

7

3

65

8

10

9m

30 CS 671 – Spring 2008

List Scheduling Algorithm

Build dependence graph, calculate priority

Add all ops to UNSCHEDULED set

time = 0

while (UNSCHEDULED is not empty)

time++

READY = UNSCHEDULED ops whose incoming deps have been satisfied

Sort READY using priority function

For each op in READY (highest to lowest priority)

op can be scheduled at current time? (resources free?)

Yes: schedule it, op.issue_time = time

Mark resources busy in RU_map relative to issue time

Remove op from UNSCHEDULED/READY sets

No: continue

31 CS 671 – Spring 2008

Improving Basic Block Scheduling

• Loop unrolling – creates longer basic blocks• Register renaming – can change register usage

in blocks to remove immediate reuse of registers

Summary• Static scheduling complements (or replaces)

dynamic scheduling by the hardware

static code scheduling

Documents

data dependencesif

data hazardsmove

branch delay slots

cycles mult

number of cycles

r2addf r3

0r2add r3

cycle latency partial