the molen compiler backend for reconfigurable architectures€¦ · the molen compiler backend for...

The Molen Compiler Backend forReconfigurable Architectures

Computer EngineeringTU DELFT

The Netherlands

Elena Moscu PanainteCarlo GaluzziYana Yankova

Koen BertelsStamatis Vassiliadis

Elena Moscu Panainte

OUTLINE

BackgroundMolen machine organizationMolen programming paradigm

Molen CompilerOptimizations for Dynamic Reconfiguration

Intra/Interprocedural instruction schedulingCompiler-driven FPGA area allocation

ResultsConclusions


The Molen Machine Organization

Main components:• GPP• Reconfigurable Processor• Arbiter • Exchange Registers


The Molen Prototype

Molen machine organization

Molen prototypeimplemented on

Virtex II Pro


The Molen Programming Paradigm (I)

A one time architectural extensionone time architectural extension of a few instructions:– Two* instructions for controlling the FPGA

• SET <address>: for hardware configuration• EXECUTE <address>: for controlling the

execution on the FPGA– Two move instructions for passing values to and

from the GPP register file and the FPGAFPGA has associated a special set of registers – Exchange Registers (XRs)


The Molen Programming Paradigm (II)

Example: C code: res = alpha(param1, param2);




movtx XR1 ← param1movtx XR2 ← param2

Send param.




movtx XR1 ← param1movtx XR2 ← param2set <address_alpha_set>

Send param.

HW reconfiguration




movtx XR1 ← param1movtx XR2 ← param2set <address_alpha_set>exec <address_alpha_exec>

Send param.

HW reconfigurationHW execution




movtx XR1 ← param1movtx XR2 ← param2set <address_alpha_set>exec <address_alpha_exec>movfx res ← XR3

Send param.

HW reconfigurationHW executionReturn result


OUTLINE




ResultsConclusions


SUIFfrontend

Machine SUIFbackend framework

MolenExtensions

ISA extension(SET/EXEC)

Register extension

PowerPC backend

MolenOptimizations

The Molen Compiler

Compiler FCCM

MAIN.c

File_n.c

C application


PowerPC Backend

PowerPC instruction generationPowerPC register allocationPowerPC EABI stack frame allocation

+SET/EXECUTE - ISA extensionXRs - Register extension


OUTLINE




ResultsConclusions


Molen Compiler: Optimizations

Challenge: huge reconfiguration latency(for SET instruction)

Repetitive reconfiguration: – performance decrease of one order of

magnitudeHardware kernel executions:– Speedup of one order of magnitude


Solutions

Hardware solutions:– Partial configurations– Configuration Prefetching

Compiler solution:- Scheduling of SET instructions

- Intraprocedural level- Interprocedural level

- Compiler-driven FPGA area allocationSoftware solution:– Application rewriting (code transformation)


Instruction Scheduling

Compiler Optimizations

SET op1EXEC op1…………

a) Repetitivereconfigurations


b) Singlereconfiguration

SET op1


Instruction Scheduling

Compiler Optimizations


a) Repetitivereconfigurations

SET op1EXEC op1SET op2EXEC op2

b) Multiple hardwareoperations


Speculation-Based Instruction Scheduling

Algorithm based on:– Edge profiling

SET op1

251000

100025



Algorithm based on:– Edge profiling– Speculation

• SET instruction doesnot cause any exception

SET op125

1000

100025

A

B

C DSET op1



Algorithm based on:– Edge profiling– Speculation

• SET instruction doesnot cause any exception

– Information about FPGA area conflicts

OP1

OP2

FPGA


Instruction SchedulingSTEP 1: Anticipation

STEP 1: Iterative backward data-flow analysis for partial anticipabilityLocal information:– Gen(n)– Kill(n)

Global information:IN(s1)

IN(s2) IN(s3)IN(s4)

IN(i)

OUT(i)

Gen(i)Kill(i)

U ))()(()()( iKilliPANToutiGeniPANTin −=

U)(

)()(iSuccj

jPANTiniPANTout∈

= +


Instruction SchedulingSTEP 2: Availability

STEP 2: Iterative forward data-flow analysis for availabilityLocal information:– Gen(n)– Kill(n)

Global information:

U ))()(()()( iKilliAVALiniGeniAVALout −=

I)(Pred

)()(ij

jAVALoutiAVALin∈

=

OUT(p1)OUT(p2) OUT(p3)

OUT(p4)

IN(i)

OUT(i)

Gen(i)Kill(i)


Instruction SchedulingSTEP 3: Minimum s-t Cut

Anticipation Graph for each HW op:

Minimum s-t cut for finding the bestinsertion edges

)}()(|),{( vPANTinopuAVALoutopvuESS ∈∧∉=

10

s

B7

B8

B9

B10

t

B14

B13

INF INF

10

10

200Min s-t cut

for op2INF

INF


Interprocedural Instruction Scheduling

SAD – 117084DCT – 1152IDCT - 1152

SAD – 1DCT – 1IDCT - 1

Initial

Final

Goal: anticipation of SET instructions at interprocedurallevel

SADIDCT

DCT

FPGA area allocation


Step 1: Construction of the Call Graph

We use suifbrowser packageNo indirect procedure callsThe call graph is a DAG

motion.c transform.c…………int sad(..)

…………

…………

void dct(..)

…………

putseq.c…………

void idct(..)

…………

…………

MPEG2 Encoder


Step 2: Propagation of Hardware Reconfigurations

Interprocedural data-flow analysisBackward propagationFor each procedure compute LRMOD and RMODLRMOD(p) = Rop, if p is executed on the FPGA

Ø , otherwise{RMOD(p) = LRMOD(p) RMOD(s)U

s in Succ(p)


Step 3: Conflict Propoagationand Instruction Scheduling

Compute CF for each procedure

for each edge <pi,pj> in the call graphfor each op in CF(pi) and [RMOD(pj)-CF(pj)]

insert SET op in pi where pj is calledfor each op in RMOD(root) – CF(root)

insert SET op at the application entry point

}),(|)({)( jiji opoppRMODoppRMODoppCF ≠∈∃∈=


putseq:…..SET sadcall motion_estimation……SET dctcall transform…….SET idctcall itransform


Compiler-driven FPGA area allocation

Example FPGA

1 12………….

ROP 1

1 2 3

ROP 2

1 2 3 4

ROP 3

1 8………….

1 12………….

FIX RW

Trace: n(Rop1) = 4; n(Rop2)=2; n(Rop3)=1 1 12………….

Rop1 Rop2Rop3


FIX/RW Algorithm

Rops selection: FIX/ RW0-1 integer linear programming problem – min:

– constraints:

∑∈ROPiRop ix*iA*n(T)

⎪⎪⎪⎪⎪⎪

⎩

⎪⎪⎪⎪⎪⎪

⎨

⎧

≤+

≤+

≤+

≤+

∑

∑

∑

∑

∈

∈

∈

∈

SxAxA

SxAxA

SxAxA

SxAxA

ROPRopjjnn

ROPRopjjii

ROPRopjj

ROPRopjj

j

j

j

j

**........................................

**........................................

**

**

22

11


FIX/RW/SW Algorithmmin:

constraints:∑∑∑===

++n

ii

n

ii

n

ii

111

xsw*cost_swxrw*cost_rwxfix*cost_fix iii

⎪⎪⎪⎪⎪⎪⎪

⎩

⎪⎪⎪⎪⎪⎪⎪

⎨

⎧

≤+

≤+

≤+

≤+

∑

∑

∑

∑

=

=

=

=

SxfixAxrwA

SxfixAxrwA

SxfixAxrwA

SxfixAxrwA

n

jjjnn

n

jjjii

n

jjj

n

jjj

1

1

12

111

**

........................................

**

........................................

**2

**


OUTLINE




ResultsConclusions


Intraprocedural Instruction Scheduling Algorithm

Optimization implemented as a MachineSUIF passTarget application: M-JPEG encoder

multimedia benchmarkGPP included in the Molen prototype:

IBM PowerPC 405 at 250 MHzFunctions executed on the FPGA:– DCT (2D Discrete Cosine Transform)– Quantization– VLC (Variable Length Coding)



Xilinx IP cores for DCT, Quant and VLC

Simple scheduling: 10x slowdown for DCT

HW Execution SW ExecutionOp EXEC Area SET One call %TotalName [cycle] [slice] [cycle] [cycle] M-JPEGDCT 416 848 431771 44396 80 %

Quant 73 397 202073 1494 3 %VLC 272 193 98237 6921 12.5 %


Interprocedural Instruction Scheduling Algorithm

M-JPEG encoder:– input: 30 frames from “tennis”, 256x256– Hardware operations: DCT, Quantization, VLCMPEG2 encoder:– input: 3 standard test frames– Hardware operations: SAD, DCT, IDCT



M-JPEG encoder:Initial With interprocedural optimization

HW op [#SET] No cf DCT –Quant cf

DCT VLC cf

Quant –VLC cf

All cf

DCT 61440 1 15360 15360 1 15360Quant 15360 1 15360 1 15360 15360VLC 15360 1 1 15360 15360 15360



MPEG2 encoder:Initial With interprocedural optimization

HW op [#SET] No cf

SAD -DCT cf

SAD -IDCT cf

DCT -IDCT cf

All cf

SAD 117084 1 3 3 1 3DCT 1152 1 3 1 3 3IDCT 1152 1 1 3 3 3


Conclusions

The proposed compiler optimization can significantly reduce the number of performed reconfigurations and improve the overall performance The anticipation of the SET instructions will allow the hardware reconfigurations to be performed in parallel with the GPP execution


Thank you!

the molen compiler backend for reconfigurable architectures€¦ · the molen compiler backend for...

Documents