warp processors

WARP PROCESSORS

Roman Lysecky , Greg Stitt , Frank Vahid, Warp Processors, ACM Transactions on Design Automation of Electronic Systems (TODAES), v.11 n.3, p.659-681, July 2006

MOTIVATION

Wish to overcome barriers for FPGA acceleration: Integrating tools to SW flows Non-conformance to standard binary concept

Aim to make FPGAs invisible to SW developer Dynamically determine critical regions Re-implement as custom HW Communicate between HW/SW

SYSTEM OVERVIEW

Initially execute application in SW only

Profile application to determine critical regions

Partition critical regions to HW

Program configurable logic and update SW binary

Partitioned application’s speed “warps” as accelerator takes over critical region

PROFILING

Typical profilers instrument code Change behaviour Require extra tools

Warp profiler monitors instruction addresses seen on instruction memory bus

Maintains cache of 16 8-bit entries to store backward branch frequencies Maintains relative frequencies Accurately selects kernels within 10 saturations

ON-CHIP CAD

On-chip CAD module implemented on separate ARM7 processor

In multi-processor environments, only one CAD module is necessary

Stages: Decompilation Partitioning Behavioral & RT Synthesis JIT FPGA Compilation

DECOMPILATION

Used to recover high-level constructs i.e. loops, if statements, arrays

Decompiles critical region into CDFG1) Intermediate code creation2) High-level construct recovery3) Map into statements/expressions

Use techniques to undo compiler optimizations Loop re-rolling Strength promotion Compare-with-zero optimization

PARTITIONING

Determines which software kernels are most suitable for implementation in HW

Uses heuristic [assumed the 0-1 knapsack heuristic] to choose kernels to maximize speedup while reducing energy

BEHAVIORAL & RT SYNTHESIS

Converts CDFG for each critical kernel to HW circuit description

Then converts into netlist format

JIT COMPILATIONLOGIC SYNTHESIS

Optimizes hardware circuit

Creates acyclic graph of Boolean logic network

Nodes correspond to any simple 2-input logic gate

Uses Riverside on-chip minimizer (ROCM), a simple two-level logic minimizer Traverse in breadth-first manner, apply logic minimization at each node

15x faster & 3x less memory than Espresso-II 2% increase in circuit size

JIT COMPILATIONTECHNOLOGY MAPPING

Maps hardware onto CLBs and LUTs of RCLF

3-phase greedy hierarchical graph-clustering algorithm

1) Breadth-first traversal of input acyclic graph creates 3-input 1-output LUT nodes

2) Breadth-first traversal combines nodes where possible to form final 3-input 2-output LUTs

3) Traverses graph final time, packs LUTs into CLBs

25X faster than commercial algorithms Only minimally impacts circuit delay

JIT COMPILATIONPLACEMENT

Places network of CLBs onto configurable logic

Greedy dependency-based positional algorithm Places critical path nodes on single horizontal row of RCLF Analyzes dependencies between placed/unplaced nodes Based on dependencies, place above (input to placed node)

or below (uses output from placed node)

Attempts to utilize routing resources between adjacent CLBs

Superimposes and aligns relative placement onto RCLF

JIT COMPILATIONROUTING

Rips up illegal routes, adjusts routing costs of entire routing resource graph

Uses general approach of VPR’s routability-driven router Allows both overuse of routing resources and illegal routes

Constructs routing conflict graph Two routes conflict when both pass through a switch matrix and

assigning them the same channel would result in illegal routing Uses vertex coloring algorithm to assign routing channels

If any routes cannot be assigned legal channel, rips up, re-adjusts, and re-reroutes

JIT COMPILATIONBINARY UPDATER

Used to allow SW to communicate with accelerated HW kernel

Replaces original SW instructions for loop with a jump to HW init. code Enables HW with memory-mapped register Shuts down microprocessor to power-down sleep mode HW asserts completion signal to cause SW interrupt to

wake up microprocessor Jumps back to end of SW loop

W-FPGAS Data Address

Generator (DADG) Loop Control

Hardware (LCH) Multiplier-Accumulator

(MAC)

All memory accesses handled through DADG

LCH for zero loop overhead

DADG &

LCH

Routing-Oriented Configurable Logic

Fabric

Reg0

32-bit MAC

Reg1

Reg2

W-FPGASROUTING-ORIENTED CONFIGURABLE LOGIC FABRIC

SM

CLB

SM

SM

SM

SM

SM

CLB

SM

CLB

SM

CLB

SM

SM

SM SM

DADGLCH

Configurable Logic Fabric

32-bit MAC

RCLF consists of array of CLBs surrounded by switch matrices for routing between CLBs

Handle routing between CLBs using switch matrices SMs can route signals to one of 4 neighbour SMs or

two SM two rows/cols apart

W-FPGASCONFIGURABLE LOGIC BLOCKS

Incorporates two 3-input 2-output LUTs Equivalent to four 3-input 1-output LUTs with fixed internal

routing Reduces mapping complexity to increase speed

LUTLUT

a b c d e f

o1 o2 o3o4

Adj.CLB

Adj.CLB

SM

CLB

SM

SM

SM

SM

SM

CLB

SM

CLB

SM

CLB

SM

SM

SM SM

W-FPGASSWITCH MATRICES

All nets are routed using only a single pair of channels throughout the CLF Each short channel is associated with single long channel

Designed for fast, lean JIT FPGA routing

0

0L

1

1L2L

2

3L

3

0123

0L1L2L3L

0123

0L1L2L3L

0 1 2 3 0L1L2L3L

SM

CLB

SM

SM

SM

SM

SM

CLB

SM

CLB

SM

CLB

SM

SM

SM SM

W-FPGAS

Lean place & route tools on RCLF can execute 10X faster using 18X less memory than existing tools Results in lower clock frequencies for large

circuits Inclusion of DADG and MAC helps offset low freq.

RESULTSBENCHMARKS

Benchmark Benchmark Suite Description

brev Powerstone Bit reversal

g3fax Powerstone Group three fax decode

matmul Powerstone Matrix multiplication

mpeg2 MediaBench MPEG-2 decoder

pktflow EEMBC IP header validation

bitmnp EEMBC Bit manipulation

canrdr EEMBC Controller area network (CAN)

tblook EEMBC Table lookup and interpolation

ttsprk EEMBC Engine spark controller

matrix EEMBC Matrix operations

idct EEMBC Inverse discrete cosine transform

fir EEMBC Finite impulse response filter

rocm Warp RDCAD logic minimizer

prewitt Warp (MT) Prewitt edge detection

search Warp (MT) Parallel search

moravec Warp (MT) Moravec image processing

wavelet Warp (MT) Wavelet transform

maxfilter Warp (MT) Maximum window image filter

N-body Warp (MT) Barnes-Hut N-body simulation

RESULTSSINGLE CRITICAL REGION

RESULTSOVERALL SPEEDUP (MAX 4 CRITICAL REGIONS)

RESULTS

IMPLEMENTATION WITH MICROBLAZE

MicroBlaze

Instr.(BRAM)

lmb_cntrl

W-FPGA

W-FPGA Interface

Instr/ Data

(BRAM)

MicroBlaze

(ROCCAD)

lmb_cntrl

lmb_cntrl

lmb_cntrl

opb_ddr

uartlite

d_lmb

i_lmb

Data(BRAM)

opb

profiler

prof_intf lmb_cntrl

d_lmb

i_lmb

Dynamic Partitioning

Base MicroBlaze system

warp processors

Documents

critical kernel

critical regionsre

logic minimization

input logic gateuses

hwprogram configurable

sw developerdynamically

sw onlyprofile application

chip minimizer rocm