processor architectures and program mapping
DESCRIPTION
Processor Architectures and Program Mapping. Exploiting ILP part 2: code generation. TU/e 5kk10 Henk Corporaal Jef van Meerbergen Bart Mesman. Overview. Enhance performance: architecture methods Instruction Level Parallelism VLIW Examples C6 TM TTA Clustering Code generation - PowerPoint PPT PresentationTRANSCRIPT
Processor Architectures and Program Mapping
TU/e 5kk10Henk Corporaal
Jef van Meerbergen
Bart Mesman
Exploiting ILPpart 2: code generation
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
2
Overview• Enhance performance: architecture methods• Instruction Level Parallelism• VLIW• Examples
– C6
– TM
– TTA
• Clustering• Code generation• Hands-on
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
3
Compiler basics
• Overview– Compiler trajectory / structure / passes– Control Flow Graph (CFG)– Mapping and Scheduling– Basic block list scheduling– Extended scheduling scope– Loop schedulin
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
4
Compiler basics: trajectory
Preprocessor
Compiler
Assembler
Loader/Linker
Source program
Object program
Error messages
Library code
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
5
Compiler basics: structure / passes
Lexical analyzer
Parsing
Code optimization
Register allocation
Source code
Sequential code
Intermediate code
Code generation
Scheduling and allocation
Object code
token generation
check syntax check semantic parse tree generation
data flow analysis local optimizations global optimizationscode selection peephole optimizations
making interference graph graph coloring spill code insertion caller / callee save and restore code
exploiting ILP
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
6
Compiler basics: structure Simple compilation example
Lexical analyzer
Syntax analyzer
Intermediate code generator
position := initial + rate * 60
id := id + id * 60
:=
+id
*id
60id
Code optimizer
Code generator
temp1 := intoreal(60)temp2 := id3 * temp1temp3 := id2 + temp2id1 := temp3
temp1 := id3 * 60.0id1 := id2 + temp1
movf id3, r2mulf #60, r2, r2movf id2, r1addf r2, r1movf r1, id1
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
7
Compiler basics: Control flow graph (CFG)
C input code:
CFG: 1 sub t1, a, b bgz t1, 2, 3
4 ………….. …………..
3 rem r, b, a goto 4
2 rem r, a, b goto 4
Program, is collection of Functions, each function is collection of Basic Blocks, each BB contains set of Instructions, each instruction consists of several Transports,..
if (a > b) { r = a % b; } else { r = b % a; }
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
8
Mapping / Scheduling: placing operations in space and time
d = a * b;
e = a + d;
f = 2 * b + d;
r = f – e;
x = z + y;
* *
+ +
-
+
a b 2
z yd
e f
r
x
Data Dependence Graph (DDG)
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
9
How to map these operations?
* *
+ +
-+
a b 2
z y
d
e f
rx
Architecture constraints:• One Function Unit• All operations single cycle latency
*
*
+
+
-
+
cycle 1
2
3
4
5
6
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
10
How to map these operations?
* *
+ +
-+
a b 2
z y
d
e f
rx
Architecture constraints:• One Add-sub and one Mul unit• All operations single cycle latency
*
* +
+
-
+cycle 1
2
3
4
5
6
Mul Add-sub
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
11
There are many mapping solutions
Pareto curve(solution space)
T e
xecu
tion
x
x
x
x
xx
x
xx
x
x
x
x
x
x
xxx
x
x
xx
x
x
x
x
x
xx
x
xx
Cost0
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
12
Basic Block Scheduling
• Make a dependence graph
• Determine minimal length
• Determine ASAP, ALAP, and slack of each operation
• Place each operation in first cycle with sufficient resources
Note:– Scheduling order sequential
– Priority determined by used heuristic; e.g. slack
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
13
Basic Block Scheduling
ADD
LD
A C
y
<1,3>
<2,4>MUL
A B
z
<1,4>
ADD
ADD
SUB
NEG LD
A
B C
X
<3,3>
<4,4>
<2,2>
<2,3>
<1,1>
ASAP cycle
ALAP cycle
slack
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
14
Cycle based list schedulingproc Schedule(DDG = (V,E))beginproc ready = { v | (u,v) E } ready’ = ready sched = current_cycle = 0 while sched V do for each v ready’ do if ResourceConfl(v,current_cycle, sched) then cycle(v) = current_cycle sched = sched {v} endif endfor current_cycle = current_cycle + 1 ready = { v | v sched (u,v) E, u sched } ready’ = { v | v ready (u,v) E, cycle(u) + delay(u,v) current_cycle} endwhileendproc
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
15
Extended basic block scheduling: Code Motion
A a) add r4, r4, 4 b) beq . . .
D e) st r1, 8(r4)
C d) sub r1, r1, r2
B c) add r1, r1, r2
• Downward code motions?
— a B, a C, a D, c D, d D
• Upward code motions?
— c A, d A, e B, e C, e A
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
16
Extended Scheduling scope
A
C
F
B
D
E
G
A;If cond Then B Else C;D;If cond Then E Else F;G;
Code: CFG:ControlFlowGraph
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
17
Scheduling scopes
Trace Superblock Decision tree Hyperblock/region
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
18
Code movement (upwards) within regions
I
I I
add
I
source block
destination block
I
Copy needed
Intermediateblock
Check foroff-liveness
Legend:
Code movement
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
19
Extended basic block scheduling:Code Motion
• A dominates B A is always executed before B– Consequently:
• A does not dominate B code motion from B to A requires
code duplication
• B post-dominates A B is always executed after A– Consequently:
• B does not post-dominate A code motion from B to A is speculative
A
CB
ED
F
Q1: does C dominate E?
Q2: does C dominate D?
Q3: does F post-dominate D?
Q4: does D post-dominate B?
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
20
Scheduling: Loops
B C
D
A
B
C’’
D
A
C’
C B
C’’
D
A
C’
C
Loop peeling Loop unrolling
Loop Optimizations:
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
21
Scheduling: LoopsProblems with unrolling:
• Exploits only parallelism within sets of n iterations
• Iteration start-up latency
• Code expansion
Basic block scheduling
Basic block scheduling and unrolling
Software pipelining
reso
urc
e u
tiliz
atio
n
time
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
22
Software pipelining• Software pipelining a loop is:
– Scheduling the loop such that iterations start before preceding iterations have finished
Or:– Moving operations across the backedge
LD
ML
ST
LD
LD ML
LD ML ST
ML ST
ST
LD
LD ML
LD ML ST
ML ST
ST
Example: y = a.x
3 cycles/iteration Unroling
5/3 cycles/iteration
Software pipelining
1 cycle/iteration
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
23
Software pipelining (cont’d)Basic techniques:
• Modulo scheduling (Rau, Lam)– list scheduling with modulo resource constraints
• Kernel recognition techniques– unroll the loop
– schedule the iterations
– identify a repeating pattern
– Examples:• Perfect pipelining (Aiken and Nicolau)
• URPR (Su, Ding and Xia)
• Petri net pipelining (Allan)
• Enhanced pipeline scheduling (Ebcioğlu)– fill first cycle of iteration
– copy this instruction over the backedge
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
24
Software pipelining: Modulo scheduling
Example: Modulo scheduling a loop
for (i = 0; i < n; i++)
a[i+6] = 3* a[i] - 1;
(a) Example loop
ld r1,(r2)
mul r3,r1,3
sub r4,r3,1
st r4,(r5)
(b) Code without loop control
ld r1,(r2)
mul r3,r1,3
sub r4,r3,1
st r4,(r5)
ld r1,(r2)
mul r3,r1,3
sub r4,r3,1
st r4,(r5)
ld r1,(r2)
mul r3,r1,3
sub r4,r3,1
st r4,(r5)
ld r1,(r2)
mul r3,r1,3
sub r4,r3,1
st r4,(r5)
Prologue
Kernel
Epilogue
(c) Software pipeline
• Prologue fills the SW pipeline with iterations
• Epilogue drains the SW pipeline
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
25
Software pipelining: determine II, Initation Interval
ld r1, (r2)
mul r3, r1, 3
(0,1) (1,0)
sub r4, r3, 1
st r4, (r5)
(0,1) (1,0)
(0,1) (1,0) (1,6)
(delay, distance)
Cyclic data dependences
cycle(v) cycle(u) + delay(u,v) - II.distance(u,v)
For (i=0;.....)
A[i+6]= 3*A[i]-1
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
26
Modulo scheduling constraints
MII minimum initiation interval bounded by cyclic dependences and resources:
MII = max{ ResMII, RecMII }
Resources:
)(
)(max
ravailable
rusedResMII
resourcesr
Cycles:
ce
edistanceIIedelayvcyclevcycle )(.)()()(
Therefore:
ce
cyclesc edistanceIIedelayNIIRecMII )(.)(0,|min
Or:
ce
ce
cyclesc edistance
edelayRecMII
)(
)(max
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
27
The Role of the Compiler
9 steps required to translate an HLL program
• Front-end compilation
• Determine dependencies
• Graph partitioning: make multiple threads (or tasks)
• Bind partitions to compute nodes
• Bind operands to locations
• Bind operations to time slots: Scheduling
• Bind operations to functional units
• Bind transports to buses
• Execute operations and perform transports
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
28
Division of responsibilities between hardware and compiler
Frontend
Binding of Operands
Determine Dependencies
Scheduling
Binding of Transports
Binding of Operations
Execute
Binding of Operands
Determine Dependencies
Scheduling
Binding of Transports
Binding of Operations
Responsibility of compiler Responsibility of Hardware
Application
Superscalar
Dataflow
Multi-threaded
Indep. Arch
VLIW
TTA
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
29
Overview• Enhance performance: architecture methods• Instruction Level Parallelism• VLIW• Examples
– C6
– TM
– TTA
• Clustering• Code generation• Hands-on
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
30
Hands-on (not this year)
• Map JPEG to a TTA processor– see web page:
http://www.ics.ele.tue.nl/~heco/courses/pam
• Install TTA tools (compiler and simulator)
• Go through all listed steps
• Perform DSE: design space exploration
• Add SFU
• 1 or 2 page report in 2 weeks
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
31
Hands-on
• Let’s look at DSE: Design Space Exploration
• We will use the Imagine processor
• http://cva.stanford.edu/projects/imagine/
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
32
Mapping applications to processorsMOVE framework
Architectureparameters
OptimizerOptimizer
Parametric compilerParametric compiler Hardware generatorHardware generator
feedbackfeedback
Userintercation
Parallel object code chip
Pareto curve(solution space)
cost
exec
. tim
e
x
x
x
x
xx
x
xx
x
x
x
x
x
x
xx x
x
x
Move framework
TTA based system
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
33
Code generation trajectory for TTAs
Application (C)
Compiler frontend
Sequential code
Compiler backend
Parallel code
Sequential simulation
Parallel simulation
Arc
hite
ctur
e de
scri
ptio
n
Profiling data
Input/Output
Input/Output
• Frontend: GCC or SUIF (adapted)
• Frontend: GCC or SUIF (adapted)
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
34
Exploration: TTA resource reduction
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
35
Exporation: TTA connectivity reduction
Number of connections removed
Exe
cuti
on t
ime
Reducing bus delay
FU stage constrains cycle time
Cri
tical
con
nect
ions
dis
appe
ar
0
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
36
Can we do better
How ?
• Transformations
• SFUs: Special Function Units
• Multiple Processors
Cost
Exe
cutio
n tim
e
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
37
Transforming the specification
+
+
+
+
+
+
Based on associativity of + operationa + (b + c) = (a + b) + c
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
38
Transforming the specification
d = a * b;
e = a + d;
f = 2 * b + d;
r = f – e;
x = z + y;
r = 2*b – a;x = z + y;
<<
-
a
1 b
+
x
zy
r
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
39
Changing the architectureadding SFUs: special function units
+
+
+
+
+
+
4-input adderwhy is this faster?
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
40
Changing the architectureadding SFUs: special function units
In the extreme case put everything into one unit!
Spatial mapping- no control flow
However: no flexibility / programmability !!
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
41
SFUs: fine grain patterns• Why using fine grain SFUs:
– Code size reduction– Register file #ports reduction– Could be cheaper and/or faster– Transport reduction– Power reduction (avoid charging non-local wires)– Supports whole application domain !
Which patterns do need support?• Detection of recurring operation patterns needed
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
42
SFUs: covering results
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
43
Exploration: resulting architecture
9 buses4 RFs
4 Addercmp FUs 2 Multiplier FUs
2 Diffadd FUs
streamoutput
streaminput
Architecture for image processing• Note the reduced connectivity
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
44
Conclusions• Billions of embedded processing systems
– how to design these systems quickly, cheap, correct, low power,.... ?
– what will their processing platform look like?
• VLIWs are very powerful and flexible– can be easily tuned to application domain
• TTAs even more flexible, scalable, and lower power
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
45
Conclusions
• Compilation for ILP architectures is getting mature, and
• Enters the commercial area.
• However– Great discrepancy between available and exploitable
parallelism
• Advanced code scheduling techniques needed to exploit ILP
04/19/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
46
Bottom line: