spring 2014jim hogg - uw - cse - p501o-1 cse p501 – compiler construction instruction scheduling...

15
Spring 2014 Jim Hogg - UW - CSE - P501 O-1 CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling

Upload: walter-glenn

Post on 12-Jan-2016

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Spring 2014Jim Hogg - UW - CSE - P501O-1 CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling

Spring 2014 Jim Hogg - UW - CSE - P501 O-1

CSE P501 – Compiler Construction

Instruction Scheduling Issues

Latencies

List scheduling

Page 2: Spring 2014Jim Hogg - UW - CSE - P501O-1 CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling

Instruction Scheduling is . . .

Spring 2014 Jim Hogg - UW - CSE - P501 O-2

Schedule

• Execute in-order to get correct answer

abcdefgh

badfcghf

• Issue in new order• eg: memory fetch is

slow• eg: divide is slow• Overall faster• Still get correct

answer!

Originally devised for super-computersNow used everywhere:

• in-order procs - older ARM• out-of-order procs - newer x86

Compiler does 'heavy lifting' - reduce chip power

Page 3: Spring 2014Jim Hogg - UW - CSE - P501O-1 CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling

Spring 2014 JIm Hogg - UW - CSE - P501 O-3

Chip Complexity, 1

Following factors make scheduling complicated:

Different kinds of instruction take different times (in clock cycles) to complete

Modern chips have multiple functional units so they can issue several operations per cycle "super-scalar"

Loads are non-blocking ~50 in-flight loads and ~50 in-flight stores

Page 4: Spring 2014Jim Hogg - UW - CSE - P501O-1 CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling

Typical Instruction Timings

Spring 2014 JIm Hogg - UW - CSE - P501 O-4

Instruction Time in Clock Cycles

int int 1

int * int 3

float float 3

float * float 5

float float 15

int int 30

Page 5: Spring 2014Jim Hogg - UW - CSE - P501O-1 CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling

Load Latencies

T-5

Core Core

L1 = 64 KB per core

L2 = 256 KB per core

L3 = 2-8 MB shared

DRAM

Instruction ~5 per cycle

Register 1 cycle L1 Cache ~4

cycles L2 Cache ~10

cycles L3 Cache ~40

cycles DRAM ~100 ns

Page 6: Spring 2014Jim Hogg - UW - CSE - P501O-1 CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling

Spring 2014 JIm Hogg - UW - CSE - P501 O-6

Super-Scalar

Page 7: Spring 2014Jim Hogg - UW - CSE - P501O-1 CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling

Spring 2014 JIm Hogg - UW - CSE - P501 O-7

Chip Complexity, 2

Branch costs vary (branch predictor)

Branches on some processors have delay slots (eg: Sparc)

Modern processors have branch-predictor logic in hardware

heuristics predict whether branches are taken or not keeps pipelines full

GOAL: Scheduler should reorder instructions to hide latencies take advantage of multiple function units (and delay

slots) help the processor effectively pipeline execution

However, many chips schedule on-the-fly too eg: Haswell out-of-order window = 192 ops

Page 8: Spring 2014Jim Hogg - UW - CSE - P501O-1 CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling

Data Dependence Graph

Spring 2014 JIm Hogg - UW - CSE - P501 O-8

a

i

c

g

h

d

f

b

e

Start Cycle

Instruction

a loadAI rarp, @a => r1

b add r1, r1 => r1

c loadAI rarp, @b => r2

d mult r1, r2 => r1

e loadAI rarp, @c => r2

f mult r1, r2 => r1

g loadAI rarp, @d => r2

h mult r1, r2 => r1

i storeAI r1 => rarp, @a

read-after-write = RAW = true dependence = flow dependencewrite-after-read = WAR = anti-dependencewrite-after-write = WAW = output-dependence

The scheduler has freedom to re-order instructions, so long as it complies with inter-instruction dependencies

leaf

root

Page 9: Spring 2014Jim Hogg - UW - CSE - P501O-1 CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling

Scheduling Really Works ...

Spring 2014 JIm Hogg - UW - CSE - P501 O-9

Start Cycle

Instruction

1 loadAI rarp, @a => r1

4 add r1, r1 => r1

5 loadAI rarp, @b => r2

8 mult r1, r2 => r1

10 loadAI rarp, @c => r2

13 mult r1, r2 => r1

15 loadAI rarp, @d => r2

18 mult r1, r2 => r1

20 storeAI r1 => rarp, @a

Start Cycle

Instruction

1 loadAI rarp, @a => r1

2 loadAI rarp, @b => r2

3 loadAI rarp, @c => r3

4 add r1, r1 => r1

5 mult r1, r2 => r1

6 loadAI rarp, @d => r2

7 mult r1, r3 => r1

9 mult r1, r2 => r1

11 storeAI r1 => rarp, @a

Original Scheduled

1 Functional UnitLoad or Store: 3 cyclesMultiply: 2 cyclesOtherwise: 1 cycle

a = 2*a*b*c*d New schedule uses extra register: r3

Preserves (WAW) output-dependency

Page 10: Spring 2014Jim Hogg - UW - CSE - P501O-1 CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling

Spring 2014 JIm Hogg - UW - CSE - P501 O-10

Scheduler: Job Description

The Job Given code for some machine; and latencies for each

instruction, reorder to minimize execution time

Constraints Produce correct code Minimize wasted cycles Avoid spilling registers Don't take forever to reach an answer

Page 11: Spring 2014Jim Hogg - UW - CSE - P501O-1 CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling

Job Description - Part 2

foreach instruction in dependence graph

Denote current instruction as ins

Denote number of cyles to execute as ins.delay

Denote cycle number in which ins should start as ins.start

foreach instruction dep that is dependent on ins

Ensure ins.start + ins.delay <= dep.start

Spring 2014 JIm Hogg - UW - CSE - P501 O-11

What if the scheduler makes a mistake?

On-chip hardware stalls the pipeline until operands become available: so slower, but still correct!

Page 12: Spring 2014Jim Hogg - UW - CSE - P501O-1 CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling

Dependence Graph + Timings

Spring 2014 JIm Hogg - UW - CSE - P501 O-12

a13

i3

c1

2

g8

h5

d9

f7

b10

e1

0

Start Cycle

Instruction

a loadAI rarp, @a => r1

b add r1, r1 => r1

c loadAI rarp, @b => r2

d mult r1, r2 => r1

e loadAI rarp, @c => r2

f mult r1, r2 => r1

g loadAI rarp, @d => r2

h mult r1, r2 => r1

i storeAI r1 => rarp, @a

1 Functional UnitLoad or Store: 3 cyclesMultiply: 2 cyclesOtherwise: 1 cycle

• Superscripts show path length to end of computation• a-b-d-f-h-i is critical path• Can schedule leaves any time - no constraints• Since a has longest delay, schedule it first; then c;

then ...

Page 13: Spring 2014Jim Hogg - UW - CSE - P501O-1 CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling

Spring 2014 JIm Hogg - UW - CSE - P501 O-13

List Scheduling

Build a precedence graph D

Compute a priority function over the nodes in D

typical: longest latency-weighted path

Rename registers to remove WAW conflicts

Create schedule, one cycle at a time Use queue of operations that are Ready At each cycle

Choose a Ready operation and schedule it Update Ready queue

Page 14: Spring 2014Jim Hogg - UW - CSE - P501O-1 CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling

O-14

List Scheduling Algorithm

cycle = 1 // clock cycle numberReady = leaves of D // ready to be

scheduledActive = { } // being executed

while Ready Active {} do foreach ins Active do if ins.start + ins.delay < cycle then remove ins from Active foreach successor suc of ins in D do if suc Ready then Ready = {suc} endif enddo endif endforeach if Ready {} then remove an instruction, ins, from Ready ins.start = cycle; Active = ins; endif cycle++endwhile

Page 15: Spring 2014Jim Hogg - UW - CSE - P501O-1 CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling

Beyond Basic Blocks

List scheduling dominates, but moving beyond basic blocks can improve quality of the code. Possibilities:

Schedule extended basic blocks (EBBs) Watch for exit points – limits reordering or requires

compensating

Trace scheduling Use profiling information to select regions for scheduling

using traces (paths) through code

Spring 2014 JIm Hogg - UW - CSE - P501 O-15