static code scheduling
DESCRIPTION
Static Code Scheduling. CS 671 April 1, 2008. Code Scheduling. Scheduling or reordering instructions to improve performance and/or guarantee correctness Important for dynamically-scheduled architectures Crucial (assumed!) for statically-scheduled architectures, e.g. VLIW or EPIC - PowerPoint PPT PresentationTRANSCRIPT
Static Code Scheduling
CS 671April 1, 2008
2 CS 671 – Spring 2008
Code Scheduling
Scheduling or reordering instructions to improve performance and/or guarantee correctness• Important for dynamically-scheduled architectures• Crucial (assumed!) for statically-scheduled
architectures, e.g. VLIW or EPIC
Takes into account anticipated latencies• Machine-specific, performed later in the optimization
pass
How does this contrast with our earlier exploration of code motion?
3 CS 671 – Spring 2008
Many machines are pipelined and expose some aspects of pipelining to the user (compiler)
Examples:• Branch delay slots!• Memory-access delays• Multi-cycle operations
Some machines don’t have scheduling hardware
Why Must the Compiler Schedule?
4 CS 671 – Spring 2008
Example
Assume loads take 2 cycles and branches have a delay slot.
____cycles
instruction start time
r2 [r1]
r3 [r1+4]
r4 r2 + r3
r5 r2 + 1
goto L1
nop
5 CS 671 – Spring 2008
Example
Assume loads take 2 cycles and branches have a delay slot.
____cycles
instruction start time
r2 [r1]
r3 [r1+4]
r5 r2 + 1
goto L1
r4 r2 + r3
6 CS 671 – Spring 2008
Code Scheduling Strategy
Get resources operating in parallel• Integer data path• Integer multiply / divide hardware• FP adder, multiplier, divider
Method• Fill with computations that do not
require result or same hardware resources
Drawbacks• Highly hardware dependent
Start Op
Use Op
Try to fill
7 CS 671 – Spring 2008
Scheduling Approaches
Local
Branch scheduling
Basic-block scheduling
Global
Cross-block scheduling
Software pipelining
Trace scheduling
Percolation scheduling
8 CS 671 – Spring 2008
Branch Scheduling
Two problems:
Branches often take some number of cycles to complete
Can be a delay between a compare b and its associated branch
A compiler will try to fill these slots with valid instructions (rather than nop)
Delay slots – present in PA-RISC, SPARC, MIPS
Condition delay – PowerPC, Pentium
9 CS 671 – Spring 2008
Recall from Architecture…
IF – Instruction Fetch
ID – Instruction Decode
EX – Execute
MA – Memory access
WB – Write back
IF
IF
IF
ID
ID
ID
EX
EX
EX
MA
MA
MA
WB
WB
WB
10 CS 671 – Spring 2008
Control Hazards
IF
IF
ID
---
EX
---
MA
--- ---
WB
IF ID EX MA WB
IF ID EX MA WB
Taken Branch
Instr + 1
Branch Target
Branch Target + 1
11 CS 671 – Spring 2008
Data Dependences
If two operations access the same register, they are dependent
Types of data dependences
Flow Output Anti
r1 = r2 + r3
r4 = r1 * 6
r1 = r2 + r3
r1 = r4 * 6
r1 = r2 + r3
r2 = r5 * 6
12 CS 671 – Spring 2008
Data Hazards
IF
IF
ID
ID
EX
EX
MA
MA WB
WBlw R1,0(R2)
add R3,R1,R4 stall
Memory latency: data not ready
13 CS 671 – Spring 2008
Data Hazards
IF
IF
ID
ID
EX EX MA
MA WB
WBaddf R3,R1,R2
addf R3,R3,R4 stall EX EX
Assumes floating point ops take 2 execute cycles
Instruction latency: execute takes > 1 cycle
14 CS 671 – Spring 2008
Multi-cycle Instructions
• Scheduling is particularly important for multi-cycle operations• Alpha instructions > 1 cycle latency (partial
list)
mull (32-bit integer multiply) 8mulq (64-bit integer multiply) 16addt (fp add) 4mult (fp multiply) 4divs (fp single-precision divide) 10divt (fp double-precision divide) 23
15 CS 671 – Spring 2008
Avoiding data hazards
• Move loads earlier and stores later (assuming this does not violate correctness) • Other stalls may require more sophisticated
re-ordering, i.e. ((a+b)+c)+d becomes (a+b)+(c+d) • How can we do this in a systematic way??
16 CS 671 – Spring 2008
Example: Without Scheduling
Start Time
Code
lw r1, w
add r1,r1,r1
lw r2,x
mult r1,r1,r2
lw r2,y
mult r1,r1,r2
lw r2,z
mult r1,r1,r2
sw r1, a
Assume:• memory instrs take 3 cycles• mult takes 2 cycles (to have
result in register)• rest take 1 cycle
____cycles
17 CS 671 – Spring 2008
Basic Block Dependence DAGS
Nodes - instructions
Edges - dependence between I1 and I2• When we cannot determine whether there is
a dependence, we must assume there is one
a) lw R2, (R1)
b) lw R3, (R1) 4
c) R4 R2 + R3
d) R5 R2 - 1
a b
d c
2 2 2
18 CS 671 – Spring 2008
Example – Build the DAG
Code
a lw r1, w
b add r1,r1,r1
c load r2,x
d mult r1,r1,r2
e load r2,y
f mult r1,r1,r2
g load r2,z
h mult r1,r1,r2
i sw r1, a
Assume: memory instrs = 3 mult = 2 (to have result in register) rest = 1 cycle
19 CS 671 – Spring 2008
Creating a schedule
•Create a DAG of dependences
•Determine priority
•Schedule instructions with– Ready operands– Highest priority
•Heuristics: If multiple possibilities, fall back on other priority functions
20 CS 671 – Spring 2008
Operation Priority
Priority – Need a mechanism to decide which ops to schedule first (when you have choices)
Common priority functions• Height – Distance from exit node
– Give priority to amount of work left to do• Slackness – inversely proportional to slack
– Give priority to ops on the critical path• Register use – priority to nodes with more source
operands and fewer destination operands– Reduces number of live registers • Uncover – high priority to nodes with many children
– Frees up more nodes• Original order – when all else fails
21 CS 671 – Spring 2008
Computing Priorities
Height(n) =• exec(n) if n is a leaf• max(height(m)) + exec(n) for m, where m is a successor of n
Critical path(s) = path through the dependence DAG with longest latency
22 CS 671 – Spring 2008
Example – Determine Height and CP
Code
a lw r1, w
b add r1,r1,r1
c lw r2,x
d mult r1,r1,r2
e lw r2,y
f mult r1,r1,r2
g lw r2,z
h mult r1,r1,r2
i sw r1, a
Assume: memory instrs = 3 mult = 2 = (to have result in register) rest = 1 cycle
Critical path: _______
23 CS 671 – Spring 2008
Example – List Scheduling
Code
a lw r1, w
b add r1,r1,r1
c lw r2,x
d mult r1,r1,r2
e lw r2,y
f mult r1,r1,r2
g lw r2,z
h mult r1,r1,r2
i sw r1, a
start
Schedule
_____cycles
24 CS 671 – Spring 2008
Scheduling vs. Register Allocation
Code
a lw r1 (r12)
b lw r2 (r12+4)
c r1 r1+r2
d stw (r12) r1
e lw r1 (r12+8)
f lw r2 (r12+12)
g r2 r1+r2
25 CS 671 – Spring 2008
Register Renaming
Code
a lw r1 (r12)
b lw r2 (r12+4)
c r3 r1+r2
d stw (r12) r3
e lw r4 (r12+8)
f lw r5 (r12+12)
g r6 r4+r5
26 CS 671 – Spring 2008
VLIW
• Very Long Instruction Word• Compiler determines exactly what is issued
every cycle (before the program is run)• Schedules also account for latencies• All hardware changes result in a compiler
change
• Usually embedded systems (hence simple HW)• Itanium is actually an EPIC-style machine
(accounts for most parallelism, not latencies)
27 CS 671 – Spring 2008
Sample VLIW code
c = a + b d = a - b e = a * b ld j = [x] nop
g = c + d h = c - d nop ld k = [y] nop
nop nop i = j * c ld f = [z] br g
Add/Sub Add/Sub Mul/Div Ld/St Branch
VLIW processor: 5 issue2 Add/Sub units (1 cycle)1 Mul/Div unit (2 cycle, unpipelined)1 LD/ST unit (2 cycle, pipelined)1 Branch unit (no delay slots)
28 CS 671 – Spring 2008
Multi-Issue Scheduling Example
RU_map
time ALU MEM0123456789
2m
3m
5m
4
6
98
10
7m
1
Schedule
time Ready Placed0123456789
Machine: 2 issue, 1 memory port, 1 ALUMemory port = 2 cycles, non-pipelinedALU = 1 cycle
29 CS 671 – Spring 2008
Earliest Latest Sets
Machine: 2 issue, 1 memory port, 1 ALUMemory port = 2 cycles, pipelinedALU = 1 cycle
1m 2m
4m
7
3
65
8
10
9m
30 CS 671 – Spring 2008
List Scheduling Algorithm
Build dependence graph, calculate priority
Add all ops to UNSCHEDULED set
time = 0
while (UNSCHEDULED is not empty)
time++
READY = UNSCHEDULED ops whose incoming deps have been satisfied
Sort READY using priority function
For each op in READY (highest to lowest priority)
op can be scheduled at current time? (resources free?)
Yes: schedule it, op.issue_time = time
Mark resources busy in RU_map relative to issue time
Remove op from UNSCHEDULED/READY sets
No: continue
31 CS 671 – Spring 2008
Improving Basic Block Scheduling
• Loop unrolling – creates longer basic blocks• Register renaming – can change register usage
in blocks to remove immediate reuse of registers
Summary• Static scheduling complements (or replaces)
dynamic scheduling by the hardware