the molen compiler backend for reconfigurable architectures€¦ · the molen compiler backend for...
TRANSCRIPT
The Molen Compiler Backend forReconfigurable Architectures
Computer EngineeringTU DELFT
The Netherlands
Elena Moscu PanainteCarlo GaluzziYana Yankova
Koen BertelsStamatis Vassiliadis
Elena Moscu Panainte
OUTLINE
BackgroundMolen machine organizationMolen programming paradigm
Molen CompilerOptimizations for Dynamic Reconfiguration
Intra/Interprocedural instruction schedulingCompiler-driven FPGA area allocation
ResultsConclusions
Elena Moscu Panainte
The Molen Machine Organization
Main components:• GPP• Reconfigurable Processor• Arbiter • Exchange Registers
Elena Moscu Panainte
The Molen Prototype
Molen machine organization
Molen prototypeimplemented on
Virtex II Pro
Elena Moscu Panainte
The Molen Programming Paradigm (I)
A one time architectural extensionone time architectural extension of a few instructions:– Two* instructions for controlling the FPGA
• SET <address>: for hardware configuration• EXECUTE <address>: for controlling the
execution on the FPGA– Two move instructions for passing values to and
from the GPP register file and the FPGAFPGA has associated a special set of registers – Exchange Registers (XRs)
Elena Moscu Panainte
The Molen Programming Paradigm (II)
Example: C code: res = alpha(param1, param2);
Elena Moscu Panainte
The Molen Programming Paradigm (II)
Example: C code: res = alpha(param1, param2);
movtx XR1 ← param1movtx XR2 ← param2
Send param.
Elena Moscu Panainte
The Molen Programming Paradigm (II)
Example: C code: res = alpha(param1, param2);
movtx XR1 ← param1movtx XR2 ← param2set <address_alpha_set>
Send param.
HW reconfiguration
Elena Moscu Panainte
The Molen Programming Paradigm (II)
Example: C code: res = alpha(param1, param2);
movtx XR1 ← param1movtx XR2 ← param2set <address_alpha_set>exec <address_alpha_exec>
Send param.
HW reconfigurationHW execution
Elena Moscu Panainte
The Molen Programming Paradigm (II)
Example: C code: res = alpha(param1, param2);
movtx XR1 ← param1movtx XR2 ← param2set <address_alpha_set>exec <address_alpha_exec>movfx res ← XR3
Send param.
HW reconfigurationHW executionReturn result
Elena Moscu Panainte
OUTLINE
BackgroundMolen machine organizationMolen programming paradigm
Molen CompilerOptimizations for Dynamic Reconfiguration
Intra/Interprocedural instruction schedulingCompiler-driven FPGA area allocation
ResultsConclusions
Elena Moscu Panainte
SUIFfrontend
Machine SUIFbackend framework
MolenExtensions
ISA extension(SET/EXEC)
Register extension
PowerPC backend
MolenOptimizations
The Molen Compiler
Compiler FCCM
MAIN.c
File_n.c
C application
Elena Moscu Panainte
PowerPC Backend
PowerPC instruction generationPowerPC register allocationPowerPC EABI stack frame allocation
+SET/EXECUTE - ISA extensionXRs - Register extension
Elena Moscu Panainte
OUTLINE
BackgroundMolen machine organizationMolen programming paradigm
Molen CompilerOptimizations for Dynamic Reconfiguration
Intra/Interprocedural instruction schedulingCompiler-driven FPGA area allocation
ResultsConclusions
Elena Moscu Panainte
Molen Compiler: Optimizations
Challenge: huge reconfiguration latency(for SET instruction)
Repetitive reconfiguration: – performance decrease of one order of
magnitudeHardware kernel executions:– Speedup of one order of magnitude
Elena Moscu Panainte
Solutions
Hardware solutions:– Partial configurations– Configuration Prefetching
Compiler solution:- Scheduling of SET instructions
- Intraprocedural level- Interprocedural level
- Compiler-driven FPGA area allocationSoftware solution:– Application rewriting (code transformation)
Elena Moscu Panainte
Instruction Scheduling
Compiler Optimizations
SET op1EXEC op1…………
a) Repetitivereconfigurations
SET op1EXEC op1…………
b) Singlereconfiguration
SET op1
Elena Moscu Panainte
Instruction Scheduling
Compiler Optimizations
SET op1EXEC op1…………
a) Repetitivereconfigurations
SET op1EXEC op1SET op2EXEC op2
b) Multiple hardwareoperations
Elena Moscu Panainte
Speculation-Based Instruction Scheduling
Algorithm based on:– Edge profiling
SET op1
251000
100025
Elena Moscu Panainte
Speculation-Based Instruction Scheduling
Algorithm based on:– Edge profiling– Speculation
• SET instruction doesnot cause any exception
SET op125
1000
100025
A
B
C DSET op1
Elena Moscu Panainte
Speculation-Based Instruction Scheduling
Algorithm based on:– Edge profiling– Speculation
• SET instruction doesnot cause any exception
– Information about FPGA area conflicts
OP1
OP2
FPGA
Elena Moscu Panainte
Instruction SchedulingSTEP 1: Anticipation
STEP 1: Iterative backward data-flow analysis for partial anticipabilityLocal information:– Gen(n)– Kill(n)
Global information:IN(s1)
IN(s2) IN(s3)IN(s4)
IN(i)
OUT(i)
Gen(i)Kill(i)
U ))()(()()( iKilliPANToutiGeniPANTin −=
U)(
)()(iSuccj
jPANTiniPANTout∈
= +
Elena Moscu Panainte
Instruction SchedulingSTEP 2: Availability
STEP 2: Iterative forward data-flow analysis for availabilityLocal information:– Gen(n)– Kill(n)
Global information:
U ))()(()()( iKilliAVALiniGeniAVALout −=
I)(Pred
)()(ij
jAVALoutiAVALin∈
=
OUT(p1)OUT(p2) OUT(p3)
OUT(p4)
IN(i)
OUT(i)
Gen(i)Kill(i)
Elena Moscu Panainte
Instruction SchedulingSTEP 3: Minimum s-t Cut
Anticipation Graph for each HW op:
Minimum s-t cut for finding the bestinsertion edges
)}()(|),{( vPANTinopuAVALoutopvuESS ∈∧∉=
10
s
B7
B8
B9
B10
t
B14
B13
INF INF
10
10
200Min s-t cut
for op2INF
INF
Elena Moscu Panainte
Interprocedural Instruction Scheduling
SAD – 117084DCT – 1152IDCT - 1152
SAD – 1DCT – 1IDCT - 1
Initial
Final
Goal: anticipation of SET instructions at interprocedurallevel
SADIDCT
DCT
FPGA area allocation
Elena Moscu Panainte
Step 1: Construction of the Call Graph
We use suifbrowser packageNo indirect procedure callsThe call graph is a DAG
motion.c transform.c…………int sad(..)
…………
…………
void dct(..)
…………
putseq.c…………
void idct(..)
…………
…………
MPEG2 Encoder
Elena Moscu Panainte
Step 2: Propagation of Hardware Reconfigurations
Interprocedural data-flow analysisBackward propagationFor each procedure compute LRMOD and RMODLRMOD(p) = Rop, if p is executed on the FPGA
Ø , otherwise{RMOD(p) = LRMOD(p) RMOD(s)U
s in Succ(p)
Elena Moscu Panainte
Step 3: Conflict Propoagationand Instruction Scheduling
Compute CF for each procedure
for each edge <pi,pj> in the call graphfor each op in CF(pi) and [RMOD(pj)-CF(pj)]
insert SET op in pi where pj is calledfor each op in RMOD(root) – CF(root)
insert SET op at the application entry point
}),(|)({)( jiji opoppRMODoppRMODoppCF ≠∈∃∈=
Elena Moscu Panainte
putseq:…..SET sadcall motion_estimation……SET dctcall transform…….SET idctcall itransform
Elena Moscu Panainte
Compiler-driven FPGA area allocation
Example FPGA
1 12………….
ROP 1
1 2 3
ROP 2
1 2 3 4
ROP 3
1 8………….
1 12………….
FIX RW
Trace: n(Rop1) = 4; n(Rop2)=2; n(Rop3)=1 1 12………….
Rop1 Rop2Rop3
Elena Moscu Panainte
FIX/RW Algorithm
Rops selection: FIX/ RW0-1 integer linear programming problem – min:
– constraints:
∑∈ROPiRop ix*iA*n(T)
⎪⎪⎪⎪⎪⎪
⎩
⎪⎪⎪⎪⎪⎪
⎨
⎧
≤+
≤+
≤+
≤+
∑
∑
∑
∑
∈
∈
∈
∈
SxAxA
SxAxA
SxAxA
SxAxA
ROPRopjjnn
ROPRopjjii
ROPRopjj
ROPRopjj
j
j
j
j
**........................................
**........................................
**
**
22
11
Elena Moscu Panainte
FIX/RW/SW Algorithmmin:
constraints:∑∑∑===
++n
ii
n
ii
n
ii
111
xsw*cost_swxrw*cost_rwxfix*cost_fix iii
⎪⎪⎪⎪⎪⎪⎪
⎩
⎪⎪⎪⎪⎪⎪⎪
⎨
⎧
≤+
≤+
≤+
≤+
∑
∑
∑
∑
=
=
=
=
SxfixAxrwA
SxfixAxrwA
SxfixAxrwA
SxfixAxrwA
n
jjjnn
n
jjjii
n
jjj
n
jjj
1
1
12
111
**
........................................
**
........................................
**2
**
Elena Moscu Panainte
OUTLINE
BackgroundMolen machine organizationMolen programming paradigm
Molen CompilerOptimizations for Dynamic Reconfiguration
Intra/Interprocedural instruction schedulingCompiler-driven FPGA area allocation
ResultsConclusions
Elena Moscu Panainte
Intraprocedural Instruction Scheduling Algorithm
Optimization implemented as a MachineSUIF passTarget application: M-JPEG encoder
multimedia benchmarkGPP included in the Molen prototype:
IBM PowerPC 405 at 250 MHzFunctions executed on the FPGA:– DCT (2D Discrete Cosine Transform)– Quantization– VLC (Variable Length Coding)
Elena Moscu Panainte
Intraprocedural Instruction Scheduling Algorithm
Xilinx IP cores for DCT, Quant and VLC
Simple scheduling: 10x slowdown for DCT
HW Execution SW ExecutionOp EXEC Area SET One call %TotalName [cycle] [slice] [cycle] [cycle] M-JPEGDCT 416 848 431771 44396 80 %
Quant 73 397 202073 1494 3 %VLC 272 193 98237 6921 12.5 %
Elena Moscu Panainte
Intraprocedural Instruction Scheduling Algorithm
Elena Moscu Panainte
Interprocedural Instruction Scheduling Algorithm
M-JPEG encoder:– input: 30 frames from “tennis”, 256x256– Hardware operations: DCT, Quantization, VLCMPEG2 encoder:– input: 3 standard test frames– Hardware operations: SAD, DCT, IDCT
Elena Moscu Panainte
Interprocedural Instruction Scheduling Algorithm
M-JPEG encoder:Initial With interprocedural optimization
HW op [#SET] No cf DCT –Quant cf
DCT VLC cf
Quant –VLC cf
All cf
DCT 61440 1 15360 15360 1 15360Quant 15360 1 15360 1 15360 15360VLC 15360 1 1 15360 15360 15360
Elena Moscu Panainte
Interprocedural Instruction Scheduling Algorithm
MPEG2 encoder:Initial With interprocedural optimization
HW op [#SET] No cf
SAD -DCT cf
SAD -IDCT cf
DCT -IDCT cf
All cf
SAD 117084 1 3 3 1 3DCT 1152 1 3 1 3 3IDCT 1152 1 1 3 3 3
Elena Moscu Panainte
Conclusions
The proposed compiler optimization can significantly reduce the number of performed reconfigurations and improve the overall performance The anticipation of the SET instructions will allow the hardware reconfigurations to be performed in parallel with the GPP execution
Elena Moscu Panainte
Thank you!