static code scheduling

31
Static Code Scheduling CS 671 April 1, 2008

Upload: glenda

Post on 15-Jan-2016

73 views

Category:

Documents


0 download

DESCRIPTION

Static Code Scheduling. CS 671 April 1, 2008. Code Scheduling. Scheduling or reordering instructions to improve performance and/or guarantee correctness Important for dynamically-scheduled architectures Crucial (assumed!) for statically-scheduled architectures, e.g. VLIW or EPIC - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Static Code Scheduling

Static Code Scheduling

CS 671April 1, 2008

Page 2: Static Code Scheduling

2 CS 671 – Spring 2008

Code Scheduling

Scheduling or reordering instructions to improve performance and/or guarantee correctness• Important for dynamically-scheduled architectures• Crucial (assumed!) for statically-scheduled

architectures, e.g. VLIW or EPIC

Takes into account anticipated latencies• Machine-specific, performed later in the optimization

pass

How does this contrast with our earlier exploration of code motion?

Page 3: Static Code Scheduling

3 CS 671 – Spring 2008

Many machines are pipelined and expose some aspects of pipelining to the user (compiler)

Examples:• Branch delay slots!• Memory-access delays• Multi-cycle operations

Some machines don’t have scheduling hardware

Why Must the Compiler Schedule?

Page 4: Static Code Scheduling

4 CS 671 – Spring 2008

Example

Assume loads take 2 cycles and branches have a delay slot.

____cycles

instruction start time

r2 [r1]

r3 [r1+4]

r4 r2 + r3

r5 r2 + 1

goto L1

nop

Page 5: Static Code Scheduling

5 CS 671 – Spring 2008

Example

Assume loads take 2 cycles and branches have a delay slot.

____cycles

instruction start time

r2 [r1]

r3 [r1+4]

r5 r2 + 1

goto L1

r4 r2 + r3

Page 6: Static Code Scheduling

6 CS 671 – Spring 2008

Code Scheduling Strategy

Get resources operating in parallel• Integer data path• Integer multiply / divide hardware• FP adder, multiplier, divider

Method• Fill with computations that do not

require result or same hardware resources

Drawbacks• Highly hardware dependent

Start Op

Use Op

Try to fill

Page 7: Static Code Scheduling

7 CS 671 – Spring 2008

Scheduling Approaches

Local

Branch scheduling

Basic-block scheduling

Global

Cross-block scheduling

Software pipelining

Trace scheduling

Percolation scheduling

Page 8: Static Code Scheduling

8 CS 671 – Spring 2008

Branch Scheduling

Two problems:

Branches often take some number of cycles to complete

Can be a delay between a compare b and its associated branch

A compiler will try to fill these slots with valid instructions (rather than nop)

Delay slots – present in PA-RISC, SPARC, MIPS

Condition delay – PowerPC, Pentium

Page 9: Static Code Scheduling

9 CS 671 – Spring 2008

Recall from Architecture…

IF – Instruction Fetch

ID – Instruction Decode

EX – Execute

MA – Memory access

WB – Write back

IF

IF

IF

ID

ID

ID

EX

EX

EX

MA

MA

MA

WB

WB

WB

Page 10: Static Code Scheduling

10 CS 671 – Spring 2008

Control Hazards

IF

IF

ID

---

EX

---

MA

--- ---

WB

IF ID EX MA WB

IF ID EX MA WB

Taken Branch

Instr + 1

Branch Target

Branch Target + 1

Page 11: Static Code Scheduling

11 CS 671 – Spring 2008

Data Dependences

If two operations access the same register, they are dependent

Types of data dependences

Flow Output Anti

r1 = r2 + r3

r4 = r1 * 6

r1 = r2 + r3

r1 = r4 * 6

r1 = r2 + r3

r2 = r5 * 6

Page 12: Static Code Scheduling

12 CS 671 – Spring 2008

Data Hazards

IF

IF

ID

ID

EX

EX

MA

MA WB

WBlw R1,0(R2)

add R3,R1,R4 stall

Memory latency: data not ready

Page 13: Static Code Scheduling

13 CS 671 – Spring 2008

Data Hazards

IF

IF

ID

ID

EX EX MA

MA WB

WBaddf R3,R1,R2

addf R3,R3,R4 stall EX EX

Assumes floating point ops take 2 execute cycles

Instruction latency: execute takes > 1 cycle

Page 14: Static Code Scheduling

14 CS 671 – Spring 2008

Multi-cycle Instructions

• Scheduling is particularly important for multi-cycle operations• Alpha instructions > 1 cycle latency (partial

list)

mull (32-bit integer multiply) 8mulq (64-bit integer multiply) 16addt (fp add) 4mult (fp multiply) 4divs (fp single-precision divide) 10divt (fp double-precision divide) 23

Page 15: Static Code Scheduling

15 CS 671 – Spring 2008

Avoiding data hazards

• Move loads earlier and stores later (assuming this does not violate correctness) • Other stalls may require more sophisticated

re-ordering, i.e. ((a+b)+c)+d becomes (a+b)+(c+d) • How can we do this in a systematic way??

Page 16: Static Code Scheduling

16 CS 671 – Spring 2008

Example: Without Scheduling

Start Time

Code

lw r1, w

add r1,r1,r1

lw r2,x

mult r1,r1,r2

lw r2,y

mult r1,r1,r2

lw r2,z

mult r1,r1,r2

sw r1, a

Assume:• memory instrs take 3 cycles• mult takes 2 cycles (to have

result in register)• rest take 1 cycle

____cycles

Page 17: Static Code Scheduling

17 CS 671 – Spring 2008

Basic Block Dependence DAGS

Nodes - instructions

Edges - dependence between I1 and I2• When we cannot determine whether there is

a dependence, we must assume there is one

a) lw R2, (R1)

b) lw R3, (R1) 4

c) R4 R2 + R3

d) R5 R2 - 1

a b

d c

2 2 2

Page 18: Static Code Scheduling

18 CS 671 – Spring 2008

Example – Build the DAG

Code

a lw r1, w

b add r1,r1,r1

c load r2,x

d mult r1,r1,r2

e load r2,y

f mult r1,r1,r2

g load r2,z

h mult r1,r1,r2

i sw r1, a

Assume: memory instrs = 3 mult = 2 (to have result in register) rest = 1 cycle

Page 19: Static Code Scheduling

19 CS 671 – Spring 2008

Creating a schedule

•Create a DAG of dependences

•Determine priority

•Schedule instructions with– Ready operands– Highest priority

•Heuristics: If multiple possibilities, fall back on other priority functions

Page 20: Static Code Scheduling

20 CS 671 – Spring 2008

Operation Priority

Priority – Need a mechanism to decide which ops to schedule first (when you have choices)

Common priority functions• Height – Distance from exit node

– Give priority to amount of work left to do• Slackness – inversely proportional to slack

– Give priority to ops on the critical path• Register use – priority to nodes with more source

operands and fewer destination operands– Reduces number of live registers • Uncover – high priority to nodes with many children

– Frees up more nodes• Original order – when all else fails

Page 21: Static Code Scheduling

21 CS 671 – Spring 2008

Computing Priorities

Height(n) =• exec(n) if n is a leaf• max(height(m)) + exec(n) for m, where m is a successor of n

Critical path(s) = path through the dependence DAG with longest latency

Page 22: Static Code Scheduling

22 CS 671 – Spring 2008

Example – Determine Height and CP

Code

a lw r1, w

b add r1,r1,r1

c lw r2,x

d mult r1,r1,r2

e lw r2,y

f mult r1,r1,r2

g lw r2,z

h mult r1,r1,r2

i sw r1, a

Assume: memory instrs = 3 mult = 2 = (to have result in register) rest = 1 cycle

Critical path: _______

Page 23: Static Code Scheduling

23 CS 671 – Spring 2008

Example – List Scheduling

Code

a lw r1, w

b add r1,r1,r1

c lw r2,x

d mult r1,r1,r2

e lw r2,y

f mult r1,r1,r2

g lw r2,z

h mult r1,r1,r2

i sw r1, a

start

Schedule

_____cycles

Page 24: Static Code Scheduling

24 CS 671 – Spring 2008

Scheduling vs. Register Allocation

Code

a lw r1 (r12)

b lw r2 (r12+4)

c r1 r1+r2

d stw (r12) r1

e lw r1 (r12+8)

f lw r2 (r12+12)

g r2 r1+r2

Page 25: Static Code Scheduling

25 CS 671 – Spring 2008

Register Renaming

Code

a lw r1 (r12)

b lw r2 (r12+4)

c r3 r1+r2

d stw (r12) r3

e lw r4 (r12+8)

f lw r5 (r12+12)

g r6 r4+r5

Page 26: Static Code Scheduling

26 CS 671 – Spring 2008

VLIW

• Very Long Instruction Word• Compiler determines exactly what is issued

every cycle (before the program is run)• Schedules also account for latencies• All hardware changes result in a compiler

change

• Usually embedded systems (hence simple HW)• Itanium is actually an EPIC-style machine

(accounts for most parallelism, not latencies)

Page 27: Static Code Scheduling

27 CS 671 – Spring 2008

Sample VLIW code

c = a + b d = a - b e = a * b ld j = [x] nop

g = c + d h = c - d nop ld k = [y] nop

nop nop i = j * c ld f = [z] br g

Add/Sub Add/Sub Mul/Div Ld/St Branch

VLIW processor: 5 issue2 Add/Sub units (1 cycle)1 Mul/Div unit (2 cycle, unpipelined)1 LD/ST unit (2 cycle, pipelined)1 Branch unit (no delay slots)

Page 28: Static Code Scheduling

28 CS 671 – Spring 2008

Multi-Issue Scheduling Example

RU_map

time ALU MEM0123456789

2m

3m

5m

4

6

98

10

7m

1

Schedule

time Ready Placed0123456789

Machine: 2 issue, 1 memory port, 1 ALUMemory port = 2 cycles, non-pipelinedALU = 1 cycle

Page 29: Static Code Scheduling

29 CS 671 – Spring 2008

Earliest Latest Sets

Machine: 2 issue, 1 memory port, 1 ALUMemory port = 2 cycles, pipelinedALU = 1 cycle

1m 2m

4m

7

3

65

8

10

9m

Page 30: Static Code Scheduling

30 CS 671 – Spring 2008

List Scheduling Algorithm

Build dependence graph, calculate priority

Add all ops to UNSCHEDULED set

time = 0

while (UNSCHEDULED is not empty)

time++

READY = UNSCHEDULED ops whose incoming deps have been satisfied

Sort READY using priority function

For each op in READY (highest to lowest priority)

op can be scheduled at current time? (resources free?)

Yes: schedule it, op.issue_time = time

Mark resources busy in RU_map relative to issue time

Remove op from UNSCHEDULED/READY sets

No: continue

Page 31: Static Code Scheduling

31 CS 671 – Spring 2008

Improving Basic Block Scheduling

• Loop unrolling – creates longer basic blocks• Register renaming – can change register usage

in blocks to remove immediate reuse of registers

Summary• Static scheduling complements (or replaces)

dynamic scheduling by the hardware