compiler optimisation - 6 instruction scheduling
TRANSCRIPT
![Page 1: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/1.jpg)
Compiler Optimisation6 – Instruction Scheduling
Hugh LeatherIF 1.18a
Institute for Computing Systems ArchitectureSchool of Informatics
University of Edinburgh
2019
![Page 2: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/2.jpg)
Introduction
This lecture:Scheduling to hide latency and exploit ILPDependence graphLocal list Scheduling + prioritiesForward versus backward schedulingSoftware pipelining of loops
![Page 3: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/3.jpg)
Latency, functional units, and ILP
Instructions take clock cycles to execute (latency)Modern machines issue several operations per cycleCannot use results until ready, can do something elseExecution time is order-dependentLatencies not always constant (cache, early exit, etc)
Operation Cyclesload, store 3load /∈ cache 100sloadI, add, shift 1mult 2div 40branch 0 – 8
![Page 4: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/4.jpg)
Machine types
In orderDeep pipelining allows multiple instructions
SuperscalarMultiple functional units, can issue > 1 instruction
Out of orderLarge window of instructions can be reordered dynamically
VLIWCompiler statically allocates to FUs
![Page 5: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/5.jpg)
Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready
Simple schedule1 a := 2*a*b*c
Cycle Operations Operands waitingloadAI rarp ,@a ⇒ r1
add r1, r1 ⇒ r1loadAI rarp ,@b ⇒ r2
mult r1, r2 ⇒ r1loadAI rarp ,@c ⇒ r2
mult r1, r2 ⇒ r1storeAI r1 ⇒ rarp ,@a
Done
1loads/stores 3 cycles, mults 2, adds 1
![Page 6: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/6.jpg)
Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready
Simple schedule1 a := 2*a*b*c
Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 r13 r1
add r1, r1 ⇒ r1loadAI rarp ,@b ⇒ r2
mult r1, r2 ⇒ r1loadAI rarp ,@c ⇒ r2
mult r1, r2 ⇒ r1storeAI r1 ⇒ rarp ,@a
Done
1loads/stores 3 cycles, mults 2, adds 1
![Page 7: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/7.jpg)
Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready
Simple schedule1 a := 2*a*b*c
Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 r13 r14 add r1, r1 ⇒ r1 r1
loadAI rarp ,@b ⇒ r2mult r1, r2 ⇒ r1
loadAI rarp ,@c ⇒ r2mult r1, r2 ⇒ r1
storeAI r1 ⇒ rarp ,@aDone
1loads/stores 3 cycles, mults 2, adds 1
![Page 8: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/8.jpg)
Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready
Simple schedule1 a := 2*a*b*c
Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 r13 r14 add r1, r1 ⇒ r1 r15 loadAI rarp ,@b ⇒ r2 r26 r27 r2
mult r1, r2 ⇒ r1loadAI rarp ,@c ⇒ r2
mult r1, r2 ⇒ r1storeAI r1 ⇒ rarp ,@a
Done
1loads/stores 3 cycles, mults 2, adds 1
![Page 9: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/9.jpg)
Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready
Simple schedule1 a := 2*a*b*c
Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 r13 r14 add r1, r1 ⇒ r1 r15 loadAI rarp ,@b ⇒ r2 r26 r27 r28 mult r1, r2 ⇒ r1 r19 Next op does not use r1 r1
loadAI rarp ,@c ⇒ r2mult r1, r2 ⇒ r1
storeAI r1 ⇒ rarp ,@aDone
1loads/stores 3 cycles, mults 2, adds 1
![Page 10: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/10.jpg)
Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready
Simple schedule1 a := 2*a*b*c
Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 r13 r14 add r1, r1 ⇒ r1 r15 loadAI rarp ,@b ⇒ r2 r26 r27 r28 mult r1, r2 ⇒ r1 r19 loadAI rarp ,@c ⇒ r2 r1, r2
10 r211 r2
mult r1, r2 ⇒ r1storeAI r1 ⇒ rarp ,@a
Done
1loads/stores 3 cycles, mults 2, adds 1
![Page 11: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/11.jpg)
Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready
Simple schedule1 a := 2*a*b*c
Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 r13 r14 add r1, r1 ⇒ r1 r15 loadAI rarp ,@b ⇒ r2 r26 r27 r28 mult r1, r2 ⇒ r1 r19 loadAI rarp ,@c ⇒ r2 r1, r2
10 r211 r212 mult r1, r2 ⇒ r1 r113 r1
storeAI r1 ⇒ rarp ,@aDone
1loads/stores 3 cycles, mults 2, adds 1
![Page 12: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/12.jpg)
Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready
Simple schedule1 a := 2*a*b*c
Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 r13 r14 add r1, r1 ⇒ r1 r15 loadAI rarp ,@b ⇒ r2 r26 r27 r28 mult r1, r2 ⇒ r1 r19 loadAI rarp ,@c ⇒ r2 r1, r2
10 r211 r212 mult r1, r2 ⇒ r1 r113 r114 storeAI r1 ⇒ rarp ,@a store to complete15 store to complete16 store to complete
Done
1loads/stores 3 cycles, mults 2, adds 1
![Page 13: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/13.jpg)
Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready
Schedule loads early2 a := 2*a*b*c
Cycle Operations Operands waitingloadAI rarp ,@a ⇒ r1loadAI rarp ,@b ⇒ r2loadAI rarp ,@c ⇒ r3
add r1, r1 ⇒ r1mult r1, r2 ⇒ r1mult r1, r2 ⇒ r1
storeAI r1 ⇒ rarp ,@aDone
2loads/stores 3 cycles, mults 2, adds 1
![Page 14: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/14.jpg)
Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready
Schedule loads early2 a := 2*a*b*c
Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r1
loadAI rarp ,@b ⇒ r2loadAI rarp ,@c ⇒ r3
add r1, r1 ⇒ r1mult r1, r2 ⇒ r1mult r1, r3 ⇒ r1
storeAI r1 ⇒ rarp ,@aDone
2loads/stores 3 cycles, mults 2, adds 1
![Page 15: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/15.jpg)
Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready
Schedule loads early2 a := 2*a*b*c
Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 loadAI rarp ,@b ⇒ r2 r1, r2
loadAI rarp ,@c ⇒ r3add r1, r1 ⇒ r1
mult r1, r2 ⇒ r1mult r1, r3 ⇒ r1
storeAI r1 ⇒ rarp ,@aDone
2loads/stores 3 cycles, mults 2, adds 1
![Page 16: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/16.jpg)
Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready
Schedule loads early2 a := 2*a*b*c
Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 loadAI rarp ,@b ⇒ r2 r1, r23 loadAI rarp ,@c ⇒ r3 r1, r2, r3
add r1, r1 ⇒ r1mult r1, r2 ⇒ r1mult r1, r3 ⇒ r1
storeAI r1 ⇒ rarp ,@aDone
2loads/stores 3 cycles, mults 2, adds 1
![Page 17: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/17.jpg)
Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready
Schedule loads early2 a := 2*a*b*c
Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 loadAI rarp ,@b ⇒ r2 r1, r23 loadAI rarp ,@c ⇒ r3 r1, r2, r34 add r1, r1 ⇒ r1 r1, r2, r3
mult r1, r2 ⇒ r1mult r1, r3 ⇒ r1
storeAI r1 ⇒ rarp ,@aDone
2loads/stores 3 cycles, mults 2, adds 1
![Page 18: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/18.jpg)
Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready
Schedule loads early2 a := 2*a*b*c
Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 loadAI rarp ,@b ⇒ r2 r1, r23 loadAI rarp ,@c ⇒ r3 r1, r2, r34 add r1, r1 ⇒ r1 r1, r2, r35 mult r1, r2 ⇒ r1 r1, r36 r1
mult r1, r3 ⇒ r1storeAI r1 ⇒ rarp ,@a
Done
2loads/stores 3 cycles, mults 2, adds 1
![Page 19: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/19.jpg)
Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready
Schedule loads early2 a := 2*a*b*c
Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 loadAI rarp ,@b ⇒ r2 r1, r23 loadAI rarp ,@c ⇒ r3 r1, r2, r34 add r1, r1 ⇒ r1 r1, r2, r35 mult r1, r2 ⇒ r1 r1, r36 r17 mult r1, r3 ⇒ r1 r18 r1
storeAI r1 ⇒ rarp ,@aDone
2loads/stores 3 cycles, mults 2, adds 1
![Page 20: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/20.jpg)
Effect of schedulingSuperscalar, 1 FU: New op each cycle if operands ready
Schedule loads early2 a := 2*a*b*c
Cycle Operations Operands waiting1 loadAI rarp ,@a ⇒ r1 r12 loadAI rarp ,@b ⇒ r2 r1, r23 loadAI rarp ,@c ⇒ r3 r1, r2, r34 add r1, r1 ⇒ r1 r1, r2, r35 mult r1, r2 ⇒ r1 r1, r36 r17 mult r1, r3 ⇒ r1 r18 r19 storeAI r1 ⇒ rarp ,@a store to complete
10 store to complete11 store to complete
DoneUses one more register
11 versus 16 cycles – 31% faster!
2loads/stores 3 cycles, mults 2, adds 1
![Page 21: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/21.jpg)
Scheduling problem
Schedule maps operations to cycle; ∀a ∈ Ops, S(a) ∈ NRespect latency;∀a, b ∈ Ops, a dependson b =⇒ S(a) ≥ S(b) + λ(b)Respect function units; no more ops per type per cycle thanFUs can handle
Length of schedule, L(S) = maxa∈Ops(S(a) + λ(a))Schedule S is time-optimal if ∀S1, L(S) ≤ L(S1)
Problem: Find a time-optimal schedule3
Even local scheduling with many restrictions is NP-complete
3A schedule might also be optimal in terms of registers, power, or space
![Page 22: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/22.jpg)
List scheduling
Local greedy heuristic to produce schedules for single basic blocks1 Rename to avoid anti-dependences2 Build dependency graph3 Prioritise operations4 For each cycle
1 Choose the highest priority ready operation & schedule it2 Update ready queue
![Page 23: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/23.jpg)
List schedulingDependence/Precedence graph
Schedule operation only when operands readyBuild dependency graph of read-after-write (RAW) deps
Label with latency and FU requirements
Anti-dependences (WAR) restrict movement
Example: a = 2*a*b*c
![Page 24: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/24.jpg)
List schedulingDependence/Precedence graph
Schedule operation only when operands readyBuild dependency graph of read-after-write (RAW) deps
Label with latency and FU requirementsAnti-dependences (WAR) restrict movement
Example: a = 2*a*b*c
![Page 25: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/25.jpg)
List schedulingDependence/Precedence graph
Schedule operation only when operands readyBuild dependency graph of read-after-write (RAW) deps
Label with latency and FU requirementsAnti-dependences (WAR) restrict movement – renamingremoves
Example: a = 2*a*b*c
![Page 26: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/26.jpg)
List scheduling
List scheduling algorithmCycle ← 1Ready ← leaves of (D)Active ← ∅while(Ready ∪ Active 6= ∅)
∀a ∈ Active where S(a) + λ(a) ≤ CycleActive ← Active - a∀ b ∈ succs(a) where isready(b)
Ready ← Ready ∪ bif ∃ a ∈ Ready and ∀ b, apriority ≥ bpriority
Ready ← Ready - aS(op) ← CycleActive ← Active ∪ a
Cycle ← Cycle + 1
![Page 27: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/27.jpg)
List schedulingPriorities
Many different priorities usedQuality of schedules depends on good choice
The longest latency path or critical path is a good priorityTie breakers
Last use of a value - decreases demand for register as moves itnearer defNumber of descendants - encourages scheduler to pursuemultiple pathsLonger latency first - others can fit in shadowRandom
![Page 28: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/28.jpg)
List schedulingExample: Schedule with priority by critical path length
![Page 29: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/29.jpg)
List schedulingExample: Schedule with priority by critical path length
![Page 30: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/30.jpg)
List schedulingExample: Schedule with priority by critical path length
![Page 31: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/31.jpg)
List schedulingExample: Schedule with priority by critical path length
![Page 32: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/32.jpg)
List schedulingExample: Schedule with priority by critical path length
![Page 33: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/33.jpg)
List schedulingExample: Schedule with priority by critical path length
![Page 34: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/34.jpg)
List schedulingExample: Schedule with priority by critical path length
![Page 35: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/35.jpg)
List schedulingExample: Schedule with priority by critical path length
![Page 36: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/36.jpg)
List schedulingExample: Schedule with priority by critical path length
![Page 37: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/37.jpg)
List schedulingExample: Schedule with priority by critical path length
![Page 38: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/38.jpg)
List schedulingExample: Schedule with priority by critical path length
![Page 39: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/39.jpg)
List schedulingExample: Schedule with priority by critical path length
![Page 40: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/40.jpg)
List schedulingForward vs backward
Can schedule from root to leaves (backward)May change schedule timeList scheduling cheap, so try both, choose best
![Page 41: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/41.jpg)
List schedulingForward vs backward
Opcode loadI lshift add addI cmp storeLatency 1 1 2 1 1 4
![Page 42: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/42.jpg)
List schedulingForward vs backward
ForwardsInt Int Stores
1 loadI1 lshift2 loadI2 loadI33 loadI4 add14 add2 add35 add4 addI store16 cmp store27 store38 store49 store510111213 cbr
BackwardsInt Int Stores
1 loadI12 addI lshift3 add4 loadI34 add3 loadI2 store55 add2 loadI1 store46 add1 store37 store28 store191011 cmp12 cbr
![Page 43: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/43.jpg)
Scheduling Larger Regions
Schedule extended basic blocks (EBBs)Super block cloning
Schedule tracesSoftware pipelining
![Page 44: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/44.jpg)
Scheduling Larger RegionsExtended basic blocks
Extended basic blockEBB is maximal set of blocks such thatSet has a single entry, BiEach block Bj other than Bi has
exactly one predecessor
![Page 45: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/45.jpg)
Scheduling Larger RegionsExtended basic blocks
Extended basic blockEBB is maximal set of blocks such thatSet has a single entry, BiEach block Bj other than Bi has
exactly one predecessor
![Page 46: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/46.jpg)
Scheduling Larger RegionsExtended basic blocks
Schedule entire paths throughEBBsExample has four EBB paths
Having B1 in both causes conflicts
Moving an op out of B1 causesproblemsMust insert compensation codeMoving an op into B1 causesproblems
![Page 47: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/47.jpg)
Scheduling Larger RegionsExtended basic blocks
Schedule entire paths throughEBBsExample has four EBB paths
Having B1 in both causes conflicts
Moving an op out of B1 causesproblemsMust insert compensation codeMoving an op into B1 causesproblems
![Page 48: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/48.jpg)
Scheduling Larger RegionsExtended basic blocks
Schedule entire paths throughEBBsExample has four EBB paths
Having B1 in both causes conflicts
Moving an op out of B1 causesproblemsMust insert compensation codeMoving an op into B1 causesproblems
![Page 49: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/49.jpg)
Scheduling Larger RegionsExtended basic blocks
Schedule entire paths throughEBBsExample has four EBB paths
Having B1 in both causes conflicts
Moving an op out of B1 causesproblemsMust insert compensation codeMoving an op into B1 causesproblems
![Page 50: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/50.jpg)
Scheduling Larger RegionsExtended basic blocks
Schedule entire paths throughEBBsExample has four EBB pathsHaving B1 in both causes conflicts
Moving an op out of B1 causesproblems
Must insert compensation codeMoving an op into B1 causesproblems
![Page 51: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/51.jpg)
Scheduling Larger RegionsExtended basic blocks
Schedule entire paths throughEBBsExample has four EBB pathsHaving B1 in both causes conflicts
Moving an op out of B1 causesproblemsMust insert compensation code
Moving an op into B1 causesproblems
![Page 52: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/52.jpg)
Scheduling Larger RegionsExtended basic blocks
Schedule entire paths throughEBBsExample has four EBB pathsHaving B1 in both causes conflicts
Moving an op out of B1 causesproblemsMust insert compensation code
Moving an op into B1 causesproblems
![Page 53: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/53.jpg)
Scheduling Larger RegionsSuperblock cloning
Join points create context problems
Clone blocks to create morecontextMerge any simple control flowSchedule EBBs
![Page 54: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/54.jpg)
Scheduling Larger RegionsSuperblock cloning
Join points create context problemsClone blocks to create morecontext
Merge any simple control flowSchedule EBBs
![Page 55: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/55.jpg)
Scheduling Larger RegionsSuperblock cloning
Join points create context problemsClone blocks to create morecontextMerge any simple control flow
Schedule EBBs
![Page 56: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/56.jpg)
Scheduling Larger RegionsSuperblock cloning
Join points create context problemsClone blocks to create morecontextMerge any simple control flowSchedule EBBs
![Page 57: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/57.jpg)
Scheduling Larger RegionsTrace scheduling
Edge frequency from profile (notblock frequency)
Pick “hot” pathSchedule with compensation codeRemove from CFGRepeat
![Page 58: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/58.jpg)
Scheduling Larger RegionsTrace scheduling
Edge frequency from profile (notblock frequency)Pick “hot” pathSchedule with compensation code
Remove from CFGRepeat
![Page 59: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/59.jpg)
Scheduling Larger RegionsTrace scheduling
Edge frequency from profile (notblock frequency)Pick “hot” pathSchedule with compensation codeRemove from CFG
Repeat
![Page 60: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/60.jpg)
Scheduling Larger RegionsTrace scheduling
Edge frequency from profile (notblock frequency)Pick “hot” pathSchedule with compensation codeRemove from CFGRepeat
![Page 61: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/61.jpg)
Loop scheduling
Loop structures can dominate execution timeSpecialist technique software pipeliningAllows application of list scheduling to loops
Why not loop unrolling?
Allows loop effect to become arbitrarily small, butCode growth, cache pressure, register pressure
![Page 62: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/62.jpg)
Loop scheduling
Loop structures can dominate execution timeSpecialist technique software pipeliningAllows application of list scheduling to loops
Why not loop unrolling?Allows loop effect to become arbitrarily small, butCode growth, cache pressure, register pressure
![Page 63: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/63.jpg)
Software pipelining
Consider simple loop to sum array
![Page 64: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/64.jpg)
Software pipeliningSchedule on 1 FU - 5 cycles
load 3 cycles, add 1 cycle, branch 1 cycle
![Page 65: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/65.jpg)
Software pipeliningSchedule on VLIW 3 FUs - 4 cycles
load 3 cycles, add 1 cycle, branch 1 cycle
![Page 66: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/66.jpg)
Software pipeliningA better steady state schedule exists
load 3 cycles, add 1 cycle, branch 1 cycle
![Page 67: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/67.jpg)
Software pipeliningRequires prologue and epilogue (may schedule others in epilogue)
load 3 cycles, add 1 cycle, branch 1 cycle
![Page 68: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/68.jpg)
Software pipeliningRespect dependences and latency – including loop carries
load 3 cycles, add 1 cycle, branch 1 cycle
![Page 69: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/69.jpg)
Software pipeliningComplete code
load 3 cycles, add 1 cycle, branch 1 cycle
![Page 70: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/70.jpg)
Software pipeliningSome definitions
Initiation interval (ii)Number of cycles between initiating loop iterations
Original loop had ii of 5 cyclesFinal loop had ii of 2 cycles
RecurrenceLoop-based computation whose value is used in later loop iteration
Might be several iterations laterHas dependency chain(s) on itselfRecurrence latency is latency of dependency chain
![Page 71: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/71.jpg)
Software pipeliningAlgorithm
Choose an initiation interval, iiCompute lower bounds on iiShorter ii means faster overall execution
Generate a loop body that takes ii cyclesTry to schedule into ii cycles, using modulo schedulerIf it fails, increase ii by one and try again
Generate the needed prologue and epilogue codeFor prologue, work backward from upward exposed uses in thescheduled loop bodyFor epilogue, work forward from downward exposed definitionsin the scheduled loop body
![Page 72: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/72.jpg)
Software pipeliningInitial initiation interval (ii)
Starting value for ii based on minimum resource and recurrenceconstraints
Resource constraintii must be large enough to issue every operationLet Nu = number of FUs of type uLet Iu = number of operations of type udIu/Nue is lower bound on ii for type umaxu(dIu/Nue) is lower bound on ii
![Page 73: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/73.jpg)
Software pipeliningInitial initiation interval (ii)
Starting value for ii based on minimum resource and recurrenceconstraints
Recurrence constraintii cannot be smaller than longest recurrence latencyRecurrence r is over kr iterations with latency λr
dλr/kue is lower bound on ii for type rmaxr(dλr/kue) is lower bound on ii
![Page 74: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/74.jpg)
Software pipeliningInitial initiation interval (ii)
Starting value for ii based on minimum resource and recurrenceconstraints
Start value = max(maxu(dIu/Nue),maxr (dλr/kue)
For simple loop
a = A[ i ]b = b + ai = i + 1if i < n gotoend
Resource constraintMemory Integer Branch
Iu 1 2 1Nu 1 1 1
dIu/Nue 1 2 1Recurrence constraint
b ikr 1 1λr 2 1
dIu/Nue 2 1
![Page 75: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/75.jpg)
Software pipeliningModulo scheduling
Modulo schedulingSchedule with cycle modulo initiation interval
![Page 76: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/76.jpg)
Software pipeliningModulo scheduling
Modulo schedulingSchedule with cycle modulo initiation interval
![Page 77: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/77.jpg)
Software pipeliningModulo scheduling
Modulo schedulingSchedule with cycle modulo initiation interval
![Page 78: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/78.jpg)
Software pipeliningModulo scheduling
Modulo schedulingSchedule with cycle modulo initiation interval
![Page 79: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/79.jpg)
Software pipeliningModulo scheduling
Modulo schedulingSchedule with cycle modulo initiation interval
![Page 80: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/80.jpg)
Software pipeliningCurrent research
Much research in different software pipelining techniquesDifficult when there is general control flow in the loopPredication in IA64 for example really helps hereSome recent work in exhaustive scheduling -i.e. solve theNP-complete problem for basic blocks
![Page 81: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/81.jpg)
Summary
Scheduling to hide latency and exploit ILPDependence graph - dependences between instructions +latencyLocal list Scheduling + prioritiesForward versus backward schedulingScheduling EBBs, superblock cloning, trace schedulingSoftware pipelining of loops
![Page 82: Compiler Optimisation - 6 Instruction Scheduling](https://reader030.vdocuments.net/reader030/viewer/2022012620/61a0c429aaa3f82fdf154ce3/html5/thumbnails/82.jpg)
PPar CDT Advert
The biggest revolution in the technological landscape for fifty years
Now accepting applications! Find out more and apply at:
pervasiveparallelism.inf.ed.ac.uk
• • 4-year programme: 4-year programme: MSc by Research + PhDMSc by Research + PhD
• Collaboration between: ▶ University of Edinburgh’s School of Informatics ✴ Ranked top in the UK by 2014 REF
▶ Edinburgh Parallel Computing Centre ✴ UK’s largest supercomputing centre
• Full funding available
• Industrial engagement programme includes internships at leading companies
• Research-focused: Work on your thesis topic from the start
• Research topics in software, hardware, theory and
application of: ▶ Parallelism ▶ Concurrency ▶ Distribution