optimizing compilers cisc 673 spring 2011 gobal instruction scheduling
DESCRIPTION
Optimizing Compilers CISC 673 Spring 2011 Gobal Instruction Scheduling. John Cavazos (Ben Perry) University of Delaware. Overview. Introduction Pipelining Instruction Pipeline Pipeline Execution Constraints and Dependences. Current Processors. - PowerPoint PPT PresentationTRANSCRIPT
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Optimizing CompilersCISC 673
Spring 2011Gobal Instruction Scheduling
John Cavazos(Ben Perry)
University of Delaware
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT 2
Overview
Introduction Pipelining
Instruction Pipeline Pipeline Execution
Constraints and Dependences
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Current Processors
Can execute several operations in a single cycle
“How fast can a program run on a processor with instruction-level parallelism?” Potential parallelism in the program Available parallelism on the processor Ability to parallelize a sequential program Find best schedule given constraints
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT 4
Best targets
Programs with operations that are completely dependent on each other are no good Focus on constraints instead of scheduling
Numeric applications with large aggregate data structures are good.
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT 5
Pipelines
Instruction Pipelines are found in every processor
Instructions go through multiple steps in the pipeline from read to execute Fetch, decode, execute, access memory,
write result Parallel processors: new instruction can
be fetched while current instruction is processed.
Each step in the pipeline takes a clock cycle
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Example pipeline
i i+1 i+2 i+3 i+4
1 Fetch
2 Identify Fetch
3 Execute Identify Fetch
4 Read Execute Identify Fetch
5 Write Read Execute Identify Fetch
6 Write Read Execute Identify
7 Write Read Execute
8 Write Read
9 Write
6
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT 7
Pipelines – Speculative Computing
Load next instruction even if it may be branched over (speculative)
On a branch event, the pipeline is emptied and the branch must be fetched. (delay)
Hardware can predict which branch to fetch, but it may be wrong
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Pipeline Execution
Execution of an instruction is pipelined if succeeding instructions not dependent on the result are allowed to proceed.
Hardware can often detect dependencies (superscaler machines) and pause execution if operand isn’t available
8
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Pipeline Execution
Some processors (Android phone, perhaps), leave batch execution to compilers.
Very-long-instruction-words (VLIW) are created by compiler that indicate a batch of instructions to execute in parallel.
Out-of-order instructions can be scheduled by advanced schedulers; best done at software due to hardware limitations
9
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Code-scheduling Constraints
Control-dependence – All operations executed in original must be executed
Data-dependence – Must produce same results as original
Resource
10
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Data dependence
11
X = 5; Y = 6 Obviously, we can reorder these
operations. X = 5; Y = X Obviously, we cannot reorder
these.
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Data dependence
RAW – Read after write. True dependence. If a write is followed by a read of the
same location, the read depends on the value written
WAR – Write after Read. Anti-dependence If the write happens before the read,
the read will get the wrong value.
12
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT 13
Dependence
WAW – Write after Write. If two writes go to the same location,
the value will be wrong WAR and WAW can be eliminated using
different locations to store different values.
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Finding dependences
Compiler: GUILTY until proven innocent! (always assume operations refer to same location, and prove it otherwise).
Pointers p and (p + 10) cannot possibly refer to the same location
Array data dependence analysis: for i=0 to n: a[2i] = a[2i + 1]. No dependency in array during this loop
14
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Finding dependences
Pointer alias analysis Two pointers are aliased if they
refer to the same object. Difficult problem.
Interprocedural Analysis Parameters passed by reference, or
if globals are passed
15
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Register allocation
LD temporary_register1, aST b, temporary_register1LD temporary_register2, cST d, temporary_register2
Two RAWs, but can be reordered. If temporary_registers 1 and 2 get
mapped to the same physical register, we create another dependency
16
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Control dependence
All operations in a basic block are guaranteed to execute. But they’re small And often highly related.
Optimize across other basic blocks is crucial.
17
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Control dependence
An instruction i1 is control dependent on instruction i2 if the outcome of i2 determines whether i1 is to be executed
Speculatively execute across different basic-blocks
18
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Speculative computing
Prefectching Bring data from memory to the
cache before it is needed Poison bits
Don’t throw exceptions when speculatively computing. Instead, set poison bit. If poison registered is really used, then throw exception.
19
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Speculative computing
Predicated Execution Change
if (a == 0) b = c To
st r4, r3movif r2, r4, r1
Processor supports a conditional store, enabling combination of basic blocks
20
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Basic Block List Scheduling
NP-complete, but don’t give up. Basic blocks are typically small. Start with data-dependence graph
Nodes are instructions and resource annotations
Edges are data dependences with a delay destination has to wait (some instructions may take 10 cycles, others only 1).
21
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
List Scheduling
Data dependence cannot have cycles Build a topological ordering of the
nodes several such orderings may exist,
though some are better than others Choose an ordering of the nodes such
that for each node, any following node cannot create a dependence on it.
22
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
List Scheduling
RT = an empty reservation tableForeach n in SortedNodes:
-Find the earliest time instruction could begin -Delay the instruction until resources are available-Schedule node after all delays-claim resources
23
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
List Scheduling – better topologies
Longest path through the data-dependence graph is shortest schedule.
Resources available constrain; critical resource is the one with the largest ratio of uses to the number of units of that resource available.
24
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Global Code Scheduling
Optimize use of resources across blocks.
Global Code Scheduling - Moving instructions from one basic block to another
Data AND control dependencies. All instructions still must be performed Speculative computing cannot be
disruptive.
25
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Global Code Scheduling example
if (!a) {c=b;}e=d+d
What are the data dependences? What are the control
dependences? What can intuitively be ran in
parallel?
26
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Global Code Scheduling Example
if (!a) {c=b;}e=d+d
Loads take two clock ticks, always hit. R1 = a, R2 = b, …,
Processor can execute two instructions
27
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
if (!a) {c=b;}e=d+d
28
Block 1 Block 2 Block 3
load r6, r1
idle load r7, r2
idle load r8, r4
idle
noop idle noop idle noop idle
jumpz r6, b3
idle store r3, r7
idle add r8,r8,r8
idle
st r5, r8 idle
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
if (!a) {c=b;}e=d+d
29
Block 1 Block 2 Block 3
load r6, r1
idle load r7, r2
idle load r8, r4
idle
noop idle noop idle noop idle
jumpz r6, b3
idle store r3, r7
idle add r8,r8,r8
idle
st r5, r8 idleBlock 1 Block 2 Block 3
load r6, r1
load r8, r4
st r5, r8 idle st r5, r8 st r3, r7
Load r7, r2
idle
add r8,r8,r8
jumpz r6, b3
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Code movement
Definitions: Dominates – A dominates B if all paths
through B pass through A. Post-dominates – B post-dominates A if all
paths that pass through A pass through B. Downward – Move operation down
along control Upward – Move operation up along
control
30
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Upward Code Movement
Moving instruction from block src to block dest. Block src comes after block dest in the topological-sorted graph. Assume no dependencies.
If dest dominates src and src post-dominates dest, then we’re done.
31
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Upward Code Movement
If src does not postdominate dst, then we have to speculatively compute Only desirable if the operation is
cheap Only useful if src is reached.
If dst does not dominate src, copies of the instruction are needed
32
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Downward Code Movement
Moving instruction from block src to block dest. Block src comes before block dest in the topological-sorted graph. Assume no dependencies
If src dominates dest and dest dominates src, we’re done.
33
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Downward Code Movement
If src does not dominate dest, Writes are often overwritten Extra operations will be needed. Replicate basic blocks and place
operation in new copy of dest Alternatively, use predicated instructions (speculative)
If dest does not post-dominate src, Compensation code
34
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT 35
Conclusion
Processors can execute several instructions in parallel
We take advantage of this by moving code
Code can be moved if no dependencies occur, but sometimes at a cost.