loop scheduling and software pipelining - capsl scheduling and software pipelining ... formulate and...
TRANSCRIPT
Loop Scheduling and Loop Scheduling and Loop Scheduling and Loop Scheduling and
Software PipeliningSoftware PipeliningSoftware PipeliningSoftware Pipelining
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 1
Software PipeliningSoftware PipeliningSoftware PipeliningSoftware Pipelining
Reading ListReading ListReading ListReading List
• Slides: Topic 7 and 7a
• Other papers as assigned in class
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 2
or homework:
ABET Outcome
• Ability to apply knowledge of basic code generation techniques, e.g. Loop
scheduling e.g. software pipelining techniques to solve code generation
problems.
• An ability to identify, formulate and solve loops scheduling problems using
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 3
• An ability to identify, formulate and solve loops scheduling problems using
software pipelining techniques
• Ability to analyze the basic algorithms on the above techniques and
conduct experiments to show their effectiveness.
• Ability to use a modern compiler development platform and tools for the
practice of above.
• A Knowledge on contemporary issues on this topic.
OutlineOutlineOutlineOutline
• Brief overview
• Problem formulation of the modulo
scheduling problem
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 4
scheduling problem
• Solution methods
• Summary
• Good IPO
• Good LNO
• Good global optimization
• Good integration of IPO/LNO/OPT
• Smooth information passing between FE and CG
Inter-ProceduralOptimization (IPA)
Loop NestOptimization (LNO)
Global Optimization
Source
General Compiler Framework
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 5
between FE and CG
• Complete and flexible support of inner-loop scheduling (SWP), instruction scheduling and register allocation
Global Optimization(OPT)
InnermostLoop
scheduling
Global instscheduling
Reg alloc
Local instscheduling
Executable
Arch
Models
BE/CG
ME
• How to formulate the loop
scheduling problem?
Questions ?
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 6
• How to model it?
• How to solve it?
Questions (cont’d)Questions (cont’d)Questions (cont’d)Questions (cont’d)
• Instruction scheduling for code without loops – a
review
• Dependence graphs may become cyclic! So,
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 7
• Dependence graphs may become cyclic! So,
“critical path length” is less obvious!
• Is it becoming harder!(?)
• What new insights are required to formulate and
solve it?
Challenges of Loop SchedulingChallenges of Loop SchedulingChallenges of Loop SchedulingChallenges of Loop Scheduling
A DDG With Cycles
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 8
Right strategy?
ObservationsObservationsObservationsObservations
• Execution of “good loops” tend to be “regular”
and “repetitive” - a “pattern” may appear
• This gives “cyclic scheduling” problem a new
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 9
• This gives “cyclic scheduling” problem a new
twist!
• How to efficiently derive a “pattern”?
Problem Formulation (I)Problem Formulation (I)Problem Formulation (I)Problem Formulation (I)
Given a weighted dependence graph, derive a schedule
which is “time-optimal” under a machine model M.
Def: A schedule S of a loop L is time-optimal if among
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 10
Def: A schedule S of a loop L is time-optimal if among
all “legal” schedules of L, no other schedule that is faster
than S.
Note: There may be more than one time-optimal
schedule.
A Short Tour on Data A Short Tour on Data
Dependence Graphs for LoopsDependence Graphs for Loops
A Short Tour on Data A Short Tour on Data
Dependence Graphs for LoopsDependence Graphs for Loops
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 11
Dependence Graphs for LoopsDependence Graphs for LoopsDependence Graphs for LoopsDependence Graphs for Loops
Basic Concept and MotivationBasic Concept and MotivationBasic Concept and MotivationBasic Concept and Motivation
• Data dependence between 2 accesses
The same memory location
Exist an execution path between them
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 12
Exist an execution path between them
One of them is a write
• Three types of data dependence
• Dependence graphs
• Things are not simple when dealing with loops
Types of Data DependenceTypes of Data DependenceTypes of Data DependenceTypes of Data Dependence
• Flow dependenceX =
= X... δ
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 13
• Anti-dependence
• Output dependenceX =
X =
... 0δ
= X
X =
... -1δ--
Data DependenceData DependenceData DependenceData Dependence
Example 1:
S1: A = 0
S2: B = A
S1
S2
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 14
S2: B = A
S3: C = A + D
S4: D = 2
Sx δ Sy ⇒ Sy depends on Sx
S3
S4
Example 2:
S1: A = 0
S2: B = A
S3: A = B + 1
S4: C = A
Con’d
S1
S2
S3
Data DependenceData DependenceData DependenceData Dependence
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 15
S4: C = A
S10 S3 : Output-dep
S2-1 S3 : anti-dep
δ
δ
S3
S4
Should we consider input Should we consider input dependence?dependence?Should we consider input Should we consider input dependence?dependence?
= XIs the reading of the
same X important?
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 16
= XWell, it may be!
(if we intend to
group the 2 reads
together for
cache optimization!)
Subscript VariablesSubscript VariablesSubscript VariablesSubscript Variables
• Extension of def-use chains to employ a more precise treatment of arrays, especially in iterative loops.
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 17
DO I = 1, N
A(I + 1) = X(I + 1) + B(I)
X(I) = A(I) * 5
ENDDO
Applications- register allocation- instruction scheduling
Con’d
Dependence GraphDependence GraphDependence GraphDependence Graph
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 18
- instruction scheduling- loop scheduling- vectorization- parallelization - memory hierarchy optimization
Data Dependence in LoopsData Dependence in LoopsData Dependence in LoopsData Dependence in Loops
An Example
Find the dependence relations due to the array X in the program below:
(1) for I = 2 to 9 do
(2) X[I] = Y[I] + Z[I]
(3) A[I] = X[I-1] + 1
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 19
(3) A[I] = X[I-1] + 1
(4) end for
Solution
I = 2 I = 3 I = 4
(2) X[2]=Y[2]+Z[2] (3) A[2]=X[1]+1
To find the data dependence relations in a simple loop, we can unroll the loop and see which statement instances depend on which others:
X[3] =Y[3]+Z[3]A[3] =X[2]+1
X[4]=Y[4]+Z[4]A[4]=X[3]+1
In our example, there is a loop-carried, lexically forward flow
dependence relation.
Data Dependence in LoopsData Dependence in LoopsData Dependence in LoopsData Dependence in LoopsCon’d
S
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 20
S2
S3
(1)
Data dependence graph for statements in a loop.
- Loop-carried vs loop-independent
- Lexical-forward vs lexical backward
Dependence distance = 1
An ExampleAn ExampleAn ExampleAn Example
for i = 0 to N - 1 do
a: a[i] = a[i - 1] + R[i];
b: b[i] = a[i] + c[i - 1];
c: c[i] = b[i] + 1;
end;
time i = 0 i = 1 i = 2
0 a
1 b
i = 3
iterations
. . .
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 21
end; 1 b
2 c a
3 b
4 c a
a
b
c
II
So, iteration interval II = 2tim
e
. . .
Assume each operation takes 1 cycle and there
is only one addition unit!
Note: We use a token here to represent a flow dependence of
distance = 1
Software Pipeline: ConceptSoftware Pipeline: ConceptSoftware Pipeline: ConceptSoftware Pipeline: Concept
Software Pipeline is a technique that reduces the execution
time of important loops by interweaving operations
from many iterations to optimize the use of resources.
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 22
0 1 2 3 4 5 6 7 8 9 10 11 12 16151413 time
ldffadds
stf
sub
cmp
bg
% prologue
a[0] = a[-1] + R[0];
% pattern
for i = 0 to N-2 do
b[i] = a[i] + c[i -1];
a[i + 1] = a[i] + R[i + 1];
prologue
b ,
The Structrure of the SWP code
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 23
a[i + 1] = a[i] + R[i + 1];
c[i] = b[i] + 1;
end;
% epilogue
b[N - 1] = a[N - 1] + c[N - 2];
c[N - 1] = b[N - 1] + 1
bi ,
ci , a i+1
epilogue
Software Pipeline (Cont’d)Software Pipeline (Cont’d)Software Pipeline (Cont’d)Software Pipeline (Cont’d)
What limits the speed of a loop?
• Data dependencies: recurrence initiation interval (rec_mii)
• Processor resources: resource initiation interval (res_mii)
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 24
0 1 2 3 4 5 6 7 8 9 10 11 12 16151413 time
ldffadds
stf
sub
cmp
bg
Initiation interval
Previous ApproachesPrevious ApproachesPrevious ApproachesPrevious Approaches
• Approach I (Operational):
“Emulate” the loop execution under the machine model and a
“pattern” will eventually occur
[AikenNic88, EbciogluNic89, GaoEtAl91]
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 25
[AikenNic88, EbciogluNic89, GaoEtAl91]
• Approach II (Periodic scheduling):
Specify the scheduling problem into a periodical scheduling
problem and find optimal solution
[Lam88, RauEtAl81,GovindAltmanGao94]
Periodic SchedulePeriodic SchedulePeriodic SchedulePeriodic Schedule(Modulo Scheduling)
The time (cycle) when the I-th instance of
the operation v is scheduled :
t(i, v) = T * i+ A[v] where T = II
so t(i + 1, v) -t(i, v) = T*(i + 1) – T*(i) = T
⋅
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 26
so t(i + 1, v) -t(i, v) = T*(i + 1) – T*(i) = T
For our example:
t(i, v) = 2i + A[v]
where A(a) = 1
A(b) = 0
A(c) = 1
Question: Is this an optimal schedule?
Yes, the schedule
t(i, v) = 2i + A[v]
is time-optimal!
Periodic Schedule (cont’d)
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 27
is time-optimal!
With II = 2
Given a DDG of a loop L, how to determine the fastest
computation rate of L -- also called minimum initiation
interval (MII) ?
Restate the problem
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 28
interval (MII) ?
Hint: Consider the “Critical Cycles” as well
as critical resource usage in L
An Example (Revisit) : RecMIIAn Example (Revisit) : RecMIIAn Example (Revisit) : RecMIIAn Example (Revisit) : RecMII
for i = 0 to N - 1 do
a: a[i] = a[i - 1] + R[i];
b: b[i] = a[i] + c[i - 1];
c: c[i] = b[i] + 1;
end;
time i = 0 i = 1 i = 2
0 a
1 b
i = 3
iterations
. . .
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 30
1 b
2 c a
3 b
4 c a
a
b
c
II
So, RecMII = 2tim
e
. . .
Assume each operation takes 1 cycle and there
is only one addition unit!
Theorem: The maximum computation rate of a
loop is bounded by the following ratio
ropt = min{ Dc/Wc}
where C is a dependence cycle, Dc is the total
(C)
Maximum Computation RateMaximum Computation RateMaximum Computation RateMaximum Computation Rate
Hint: now one must think about deadlines.
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 31
where C is a dependence cycle, Dc is the total
dependence distance along C and Wc is total
execution time of C. i.e.
Dc di
c
=∑
Wc wi
c
=∑
(di is the dependence distance along the edge i in C)
(wi is the edge weight along theedge i in C)
And the optimal period
MII = Topt = 1/ropt
Def: A cycle is critical if the period of the cycle
RecMII (Contn’d)
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 32
Def: A cycle is critical if the period of the cycle
equal to MII
(We should write it as RecMII)
Note: a loop may have multiple critical cycles!
An Example (Revisit) : RecMIIAn Example (Revisit) : RecMIIAn Example (Revisit) : RecMIIAn Example (Revisit) : RecMII
for i = 0 to N - 1 do
a: a[i] = a[i - 1] + R[i];
b: b[i] = a[i] + c[i - 1];
c: c[i] = b[i] + 1;
end;
time i = 0 i = 1 i = 2
0 a
1 b
i = 3
iterations
. . .
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 33
1 b
2 c a
3 b
4 c a
a
b
c
II
So, RecMII = 2tim
e
. . .
Assume each operation takes 1 cycle and there
is only one addition unit!
How About Machine How About Machine How About Machine How About Machine Resource and Register Resource and Register Resource and Register Resource and Register
Constraints?Constraints?Constraints?Constraints?
• Numbers and types of FUs, etc,
(ResMII)
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 34
(ResMII)
• Number of registers
Software PipeliningSoftware PipeliningSoftware PipeliningSoftware Pipelining
• Review of previous work
[RauFisher93] [Rau94]
• Minimum initiation interval [MII]
MII = max {RecMII, ResMII}
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 35
MII = max {RecMII, ResMII}
where
RecMII: determined by “critical’
recurrence cycles in DDG
ResMII: determined by resource
constraints (#s and types of FUs)
The Resource Constraint MII (ResMII)
• Can be calculated by totaling, for each resource, the usage
requirement imposed by one iteration of the loop
• can be done by bin-packing a resource reservation table
Consider a simple example of n nodes and 1 FU -- what is ResMII ?
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 36
• can be done by bin-packing a resource reservation table
(expensive)
• usually derive a lower-bound is enough as a starting point to begin
the search process
• More price with complex hardware pipelines ?
Example 1: Reservation Table Example 1: Reservation Table Example 1: Reservation Table Example 1: Reservation Table ---- An ExampleAn ExampleAn ExampleAn Example
1
543210Stage
XXX1
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 37
2
3XXX1
XX2
X3
(a) a Pipeline M (b) The Reservation Table of M
A Quiz ?
• What is the RT of a fully pipelined adder
with 3 stages ?
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 38
• How about an unpipelined adder ?
• How to computer ResMII for each case ?
Modulo Reservation Table
• The modulo reservation table only has length = II
• Each entry in the table records a sequence of
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 39
• Each entry in the table records a sequence of
reservations corresponds a sequence of slots every II
cycles
• Hints: think about your weekly calendar
Modulo SchedulingModulo SchedulingModulo SchedulingModulo Scheduling
MII
II = II + 1
Choose Next Operation X
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 40
Remove from L
is L =<>?
succeed
failed try to place x in
a time slot i in II
No
A legal schedule
is found
Heuristic Method for Heuristic Method for Heuristic Method for Heuristic Method for
Modulo SchedulingModulo SchedulingModulo SchedulingModulo Scheduling
Why a simple variant list
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 41
Why a simple variant list
scheduling may not work?
Hint: consider the deadline constraints of operations ina cycle.
4 2
2 4
[1]
B D
A C(a)
Counter Example I: List Scheduling May Fail !
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 42
0
1
2
3
A B
C
D
0
1
2
3
A
B
C
D
(b) MII = RecMII = 4
Note: if simple list scheduling is used
B cannot be scheduled due to the
deadline set by scheduling C: deadlock!
(c)
MII = RecMII = 4
Note: we cannot fire C as early as possible!
In previous figure,
We show an example demonstrating the problem with
greedy scheduling in the presence of recurrences.
Example I (Cont’d)
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 43
(a) The data dependence graph with a cycle.
(b) The resulting partial schedule when C is scheduled
greedily. B cannot be scheduled.
(c) The resulting valid schedule when C is not scheduled
greedyly – delayed to two cycles later.
2
1
2
3M6A1
C2
A3
A4
0
1
2
3
M6A1
C2
A3
A4
0
1
2
3
A1
A4
M6C2
Adder Mult Bus Adder Mult Bus
Example 2: Example 2: Example 2: Example 2: Problems with Greedy schedule due to ResMIIProblems with Greedy schedule due to ResMIIProblems with Greedy schedule due to ResMIIProblems with Greedy schedule due to ResMII
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 44
2
3
A4
M5
3
4
5
A3 3
4
5
A3 M5
(a) DDG (b) A greedy schedule which (c) a non-greedy schedule
cannot schedule A4 with which achieves ResMII
ResMII
ResMII = 6 ResMII = 6
A: non-pipelined adders
M: non-pipelined multipliers
In previous figure,
We show an example demonstrating the problem with greedy
scheduling in the presence of complex reservation tables.
(a) The data dependence graph without cycle.
Example 2 (Cont’d)
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 45
(a) The data dependence graph without cycle.
(b) The resulting partial schedule when A1, M6, C2, and A3
are scheduled greedily. A4 cannot be scheduled.
(c) The resulting valid schedule when A3 is scheduled one
cycle later.
Example 3: infeasibility of MIIExample 3: infeasibility of MIIExample 3: infeasibility of MIIExample 3: infeasibility of MII
2A1
Adder Mult Bus
3M2
Adder Mult Bus
5MA3
Adder Mult Bus
Con’d
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 46
(a)
Adder Mult Bus Adder Mult Bus Adder Mult Bus
Note: ResMII = 2. But is there a legal schedule under II = 2 ?
Note: The presence of complex reservation tables
Example 3 (cont’d)Example 3 (cont’d)Example 3 (cont’d)Example 3 (cont’d)
Adder Mult Bus
A1 M2 M2
A1
Adder Mult Bus
A1 M2
MA3 A1
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 47
A1 MA3
MA3 M2
A1
MII = 2
II = 2
(b) cannot fit MA3
under II=2
MII = 2
II = 3
(c) A feasible schedule
for II=3
Note: It is possible that there is no valid schedule at MII!
The previous slide shows an example demonstrating
the infeasibility of the MII in the presence of complex
reservation tables. (a) The three operations and their
Infeasibility of MIIInfeasibility of MIIInfeasibility of MIIInfeasibility of MII Con’d
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 48
reservation tables. (b) The MRT corresponding to the
dead-end partial schedule for an II of 2 after A1 and
M2 have been scheduled. (c) The MRT
corresponding to a valid schedule for an II of 3.
2
2
2
A1
A2
A3
[2]
2
2
A3
A4
Adder Mult Bus
2
2
Example 4: Infeasibility of MIIExample 4: Infeasibility of MIIExample 4: Infeasibility of MIIExample 4: Infeasibility of MII
Assume: fully pipelined adders
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 49
2
2
A3
A4
(a)
2A4 2
ResMII = 4
RecMII = 4
(b)
Note: It is possible that there is no valid schedule at MII!
And, the reservation table here is simple!
Example 4 (cont’d)Example 4 (cont’d)Example 4 (cont’d)Example 4 (cont’d)
The previous slides shows an example demonstrating the
infeasibility of the MII due to the interaction between the
recurrence constraints and the resource usage constraints.
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 50
(a) The data dependence graph.
(b) The MRT corresponding to the dead-end partial
schedule for an II of 4 after A1 and A2 have been
scheduled.
How to derive a best feasible
schedule ?
• It is possible do so via exhaustive search.
• But, it is expensive!
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 51
Heuristic (Aiken88, AikenNic88, Ebcioglu89, etc.)OperationalApproach
Formal Model (GaoWonNin91)PLDI'91
SoftwarePipelining
In-Exact Method (Heuristic)(RauGla81, Lam88, RauEtAI92, Huff93, DehnertTow93,Rau94, WanEis93, StouchininEtAI00)
Basic Formulation(DongenGao92)CONPAR'92
PeriodicScheduling
(Modulo Scheduling)
Register Optimal(Ninggao91, NingGao93, Ning93)
POPL'93
Resource ConstrainedExact Method ILP based (GovindAITGao94)
A Taxonomy of Software PipeliningA Taxonomy of Software PipeliningA Taxonomy of Software PipeliningA Taxonomy of Software Pipelining
2008-04-24 \course\cpeg421-08s\Topic-7.ppt 52
Method ILP based (GovindAITGao94)Micro27
Resource & Register"Showdown"(GovindAITGao95 , Altman95,(RuttenbergGao
PLDI'95StouchininWoody96)
Exhaustive
EichenbergerDav95)PLDI'96
SearchEuroPar'96
(Altman95)FSA Co-Schedulingformulation(GovindAltmanGao'96)
FSA BasedMethod FSA Construction Method
(Model hardware (GovindAltmanGao98)
pipeline withsharing and hazards)
FSA Heuristic/optimization(ZhangGovindRyanGao99)
Theory of Co-Scheduling(GovindAltmanGao00)