dynamic hardware/software partitioning: a first approach

Dynamic Hardware/Software Partitioning: A First Approach

Greg Stitt, Roman Lysecky, Frank Greg Stitt, Roman Lysecky, Frank Vahid*Vahid*Department of Computer Science and Department of Computer Science and EngineeringEngineeringUniversity of California, RiversideUniversity of California, Riverside*Also with the Center for Embedded Computer Systems at *Also with the Center for Embedded Computer Systems at UC IrvineUC Irvine

Introduction Dynamic optimizations an increasing trendDynamic optimizations an increasing trend

– ExamplesExamples DynamoDynamo

– Dynamic software optimizationsDynamic software optimizations Transmeta CrusoeTransmeta Crusoe

– Dynamic code morphingDynamic code morphing Just In Time CompilationJust In Time Compilation

– Interpreted languagesInterpreted languages AdvantagesAdvantages

– Transparent optimizationsTransparent optimizations No designer effortNo designer effort No tool restrictionsNo tool restrictions

– Adapts to actual usageAdapts to actual usage

Sw__________________

Introduction Drawbacks of current dynamic optimizationsDrawbacks of current dynamic optimizations

– Currently limited to software optimizationsCurrently limited to software optimizations Limited speedup (1.1x to 1.3x common)Limited speedup (1.1x to 1.3x common)

Alternatively, we could perform hw/sw partitioningAlternatively, we could perform hw/sw partitioning– Achieve large speedups (2x to 10x common)Achieve large speedups (2x to 10x common)– However, presently dynamic optimization not possibleHowever, presently dynamic optimization not possible

Sw__________________

Hw__________________

Profiler

Critical Regions

Processor ASIC/FPGA

Introduction Ideally, we would perform hardware/software Ideally, we would perform hardware/software

partitioning dynamicallypartitioning dynamically– Transparent partitioningTransparent partitioning

Supports all sw languages/toolsSupports all sw languages/tools Most partitioning approaches have complex tool Most partitioning approaches have complex tool

flowsflows– Achieves better results than software Achieves better results than software

optimizationsoptimizations >2x speedup, energy savings>2x speedup, energy savings

– Adapts to actual usageAdapts to actual usage Appropriate architecture requiredAppropriate architecture required

– Requires a processor and configurable logicRequires a processor and configurable logic

Introduction Microprocessor/FPGA single-chip platforms make Microprocessor/FPGA single-chip platforms make

partitioning more attractivepartitioning more attractive– More efficient communication, smaller sizeMore efficient communication, smaller size

Higher performance, low powerHigher performance, low power ExamplesExamples

– Xilinx Virtex II Pro, Triscend E5/A7, Altera Excalibur, Xilinx Virtex II Pro, Triscend E5/A7, Altera Excalibur, Atmel FPSLICAtmel FPSLIC

Makes dynamic hw/sw partitioning more feasibleMakes dynamic hw/sw partitioning more feasible– However, partitioning must be performed at binary levelHowever, partitioning must be performed at binary level

FPGAProcessorProcessor FPGA

1990s 2003

Introduction Binary-level hw/sw partitioningBinary-level hw/sw partitioning

– Binary is profiled and hardware Binary is profiled and hardware candidates are determinedcandidates are determined

– Regions to be partitioned are Regions to be partitioned are decompiled into CDFGdecompiled into CDFG

– CDFG is synthesized to hardwareCDFG is synthesized to hardware– Binary is updated to use Binary is updated to use

hardwarehardware Many advantages over source-level Many advantages over source-level

partitioningpartitioning– Supports any language or Supports any language or

software compilersoftware compiler No change in toolsNo change in tools

– Better software size and Better software size and performance estimation at binary performance estimation at binary levellevel

Enables dynamic hw/sw Enables dynamic hw/sw partitioningpartitioning

Binary

Netlist

Processor FPGA

Updated Binary

Profiling

Hw Exploration

Decompilation

Behavioral Synthesis

Binary Updater

Dynamic Hw/Sw Partitioning

Memory

Dynamic Partitioning

Module

ConfigurableLogic

Micro-processor

Micro-processor

Micro-processor

Micro-processor

Memory

Micro-processor

SW___________________________

SW addaddaddaddaddaddaddaddaddaddaddaddaddaddaddaddaddaddaddaddaddaddaddadd


Memory


Module

ConfigurableLogic

Micro-processor

Micro-processor

Micro-processor

Micro-processor

Memory

Micro-processor

SW___________________________

SW beqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeq


Memory


Module

ConfigurableLogic

Micro-processor

Micro-processor

Micro-processor

Micro-processor

Memory

Micro-processor

SW___________________________

SW addaddaddaddaddaddaddaddaddaddadd

addaddaddaddaddaddaddaddaddaddadd


Moduleaddaddadd

add


Memory


Module

ConfigurableLogic

Micro-processor

Micro-processor

Micro-processor

Micro-processor

Memory

Micro-processor

SW___________________________

SW beqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeq

beqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeq


Modulebeqbeqbeq

beq


Memory


Module

ConfigurableLogic

Micro-processor

Micro-processor

Micro-processor

Micro-processor

Memory

Micro-processor

SW___________________________

SW


Module

FrequentLoops

SWSWSW

SW

SW

SWSWSW


Memory


Module

ConfigurableLogic

Micro-processor

Micro-processor

Micro-processor

Micro-processor

Memory

Micro-processor

SW___________________________

SW


Module

FrequentLoops

HWHWHWHWHWHWHW

Frequent Loops

Memory


Module

ConfigurableLogic

Micro-processor

Micro-processor

Micro-processor

Micro-processor

Memory

Micro-processor


Module


SW___________________________

SW

FrequentLoops

ConfigurableLogic

Frequent Loops

0

20

40

60

80

100

Time Energy

SWHW /SW

Dynamic Partitioning Module Dynamic partitioning module executes Dynamic partitioning module executes

partitioning tools on chippartitioning tools on chip– Profiler, partitioning compiler, synthesis, Profiler, partitioning compiler, synthesis,

place&routeplace&route

Profiler

Partitioning

CompilerSynthesisSW Binary

HW

SW Source

Place&Route

Memory


Module

ConfigurableLogic

Micro-processor

Micro-processor

Micro-processor

Micro-processor

Dynamic Partitioning Module Synthesis and place & route tools all moved on-Synthesis and place & route tools all moved on-

chipchip– These tools typically execute on powerful These tools typically execute on powerful

workstationsworkstations– Most people will cringe at idea of moving these Most people will cringe at idea of moving these

tools on-chiptools on-chip However, dynamic partitioning deals with small However, dynamic partitioning deals with small

regions of coderegions of code– Typically, small innermost loopsTypically, small innermost loops

Therefore, we can develop lean tools that work Therefore, we can develop lean tools that work specifically for these small loopsspecifically for these small loops– Lean tools make on-chip execution possibleLean tools make on-chip execution possible

Area overhead becoming less critical due to Area overhead becoming less critical due to Moore’s LawMoore’s Law

System Architecture MicroprocessorMicroprocessor

ss– MIPS (may be MIPS (may be

many)many) On-chip On-chip

memorymemory Configurable Configurable

logiclogic Dynamic Dynamic

partitioning partitioning modulemodule

Memory


Module

ConfigurableLogic

Micro-processor

Micro-processor

Micro-processor

Micro-processor

Dynamic Partitioning Module Dynamically detects frequent loops and then Dynamically detects frequent loops and then

reimplements the loops in hardware running reimplements the loops in hardware running on the configurable logicon the configurable logic

Architectural componentsArchitectural components– ProfilerProfiler– Additional processor and memoryAdditional processor and memory

But SOCs may have dozens anywaysBut SOCs may have dozens anyways Alternatively, we could share main processorAlternatively, we could share main processor

Memory

Profiler

Partitioning Co-Processor

Configurable Logic Greatly simplified in order to create lean place & route toolsGreatly simplified in order to create lean place & route tools DMA used to access memoryDMA used to access memory Two registersTwo registers

– R0_Input stores data from memoryR0_Input stores data from memory– R1_InOut stores temporary data & data to write back to memoryR1_InOut stores temporary data & data to write back to memory

FabricFabric– Supports combinational logicSupports combinational logic– Implies loops must have body implemented in single cycle Implies loops must have body implemented in single cycle

(temporary restriction)(temporary restriction)

DMAR0_Input

Configurable Logic Fabric

R1_InOut

Configurable Logic Fabric FabricFabric

– 3-input 2-output LUTS surrounded by switch 3-input 2-output LUTS surrounded by switch matricesmatrices

Switch MatrixSwitch Matrix– Connect wire to same channel on different sideConnect wire to same channel on different side

LUTLUT– 3-input (8 word) 2-output SRAM3-input (8 word) 2-output SRAM


LUTT

LUT UT

...

SMM

SMSM

SMM

SMSM

SMM

...

0

0

00

1

1

1 12

2

2

2

33

3

3

Inputs Inputs

SRAM(8x2)

Outputs

Configurable Logic Fabric Switch Matrix LUT

Tool OverviewBinary

Loop Profiling

Small, Frequent Loops

Decompilation

Place & Route

HW

RT and Logic Synthesis

Binary Modification

Updated Binary

DMA Configuration

Bitfile Creation

Tech. Mapping

Tool flow slightly Tool flow slightly different from standard different from standard partitioning flowpartitioning flow– DecompilationDecompilation– Binary modificationBinary modification

Loop Profiling Non-intrusive profilerNon-intrusive profiler

– Monitors instruction busMonitors instruction bus Very little overheadVery little overhead

– Small cache (~16 entries) and 2,300 logic Small cache (~16 entries) and 2,300 logic gatesgates

Less than 1% power overheadLess than 1% power overhead

Micr

o-pr

oces

sor Frequent Loop

CacheFrequent Loop

Cache Controller

++

rd/wraddr

datadata

To L1 Memory

rd/wr

addr

sbb

datasaturation

Decompilation Decompilation recovers high-level informationDecompilation recovers high-level information Creates optimized CDFGCreates optimized CDFG

– All instruction-set inefficiencies are removedAll instruction-set inefficiencies are removed Binary partitioning has been shown to Binary partitioning has been shown to

achieve similar results to source-level achieve similar results to source-level partitioning for many applicationspartitioning for many applications– [Greg Stitt, Frank Vahid, ICCAD 2002][Greg Stitt, Frank Vahid, ICCAD 2002]

DMA Configuration Maps memory accesses to our DMA Maps memory accesses to our DMA

architecturearchitecture– Reads/writesReads/writes– Increment/decrement address updatesIncrement/decrement address updates– Single/block request modesSingle/block request modes

Optimizes DFG for DMAOptimizes DFG for DMA– Removes address calculationsRemoves address calculations– Removes loop counters/exit conditionsRemoves loop counters/exit conditions

1 r1

+ Read

r1 +

r2

• Memory Read

• Increment Address

• Block Request

r3

DMA Read

+

r2

r3

Register Transfer Synthesis Maps DFG operations to hw library Maps DFG operations to hw library

componentscomponents– Adders, Comparators, Multiplexors, ShiftersAdders, Comparators, Multiplexors, Shifters

Creates Boolean expression for each output Creates Boolean expression for each output bit in dataflow graph by replacing hw bit in dataflow graph by replacing hw components with corresponding expressionscomponents with corresponding expressions

r4[0]=r1[0] xor r2[0], carry[0]=r1[0] and r2[0]r4[1]=(r1[1] xor r2[1]) xor carry[0], carry[1]= …….…….

r1 r2

+

r4

r3 8

<

r532-bit adder 32-bit comparator

Logic Synthesis Optimizes Boolean equations from RT Optimizes Boolean equations from RT

synthesissynthesis– Large opportunity for logic minimization due to Large opportunity for logic minimization due to

use of immediate values in the binaryuse of immediate values in the binary Simple on-chip 2-level logic minimization Simple on-chip 2-level logic minimization

methodmethod– Lysecky/Vahid DAC’03, session 20.4 (9:45 Wed)Lysecky/Vahid DAC’03, session 20.4 (9:45 Wed)

r2[0] = r1[0] xor 0 xor 0r2[1] = r1[1] xor 0 xor carry[0]r2[2] = r1[2] xor 1 xor carry[1]r2[3] = r1[3] xor 0 xor carry[2]…

r1 4

+

r2

r2[0] = r1[0]r2[1] = r1[1] xor carry[0]r2[2] = r1[2]’ xor carry[1]r2[3] = r1[3] xor carry[2]…

Technology Mapping Maps logic operations to 3-input, 2-output Maps logic operations to 3-input, 2-output

LUTsLUTs1.1. Traverse logic network and combine nodes to Traverse logic network and combine nodes to

determine single output LUTsdetermine single output LUTs2.2. Combine nodes to form two output LUTsCombine nodes to form two output LUTs

3-input, 2-output LUTs

Placement Nodes along critical path are placed in single Nodes along critical path are placed in single

horizontal rowhorizontal row Build dependencies between remaining nodes Build dependencies between remaining nodes

and placed nodesand placed nodes– Use dependencies to place remaining nodesUse dependencies to place remaining nodes

Either above or below placed nodesEither above or below placed nodes

LUT LUTLUTLUTLUT LUTLUTLUTLUT LUTLUTLUT




Routing Greedy algorithmGreedy algorithm

1.1. At each switch matrix, choose directionAt each switch matrix, choose directionto routeto route

2.2. Continue to route until reaching switchContinue to route until reaching switchmatrix that is already in usematrix that is already in use

3.3. Backtrack to previous switch matrix,Backtrack to previous switch matrix,and try another directionand try another direction

Place and route most complex task;Place and route most complex task;currently working on improvementscurrently working on improvements

Bitfile Creation Combines place&routed hardware description Combines place&routed hardware description

with DMA configuration into bitfilewith DMA configuration into bitfile– Used to initialize the configurable logicUsed to initialize the configurable logic

HW Netlist

Bitfile Creation

DMA Configuration

Bitfile

DMA R0_Input


R1_InOut

Binary Modification Updates the application binary in order to Updates the application binary in order to

utilize the new hardwareutilize the new hardware– Loop replaced with jump to hw initialization Loop replaced with jump to hw initialization

codecode– Wisconsin Architectural Research Tool Set Wisconsin Architectural Research Tool Set

(WARTS)(WARTS) EEL (Executable Editing Library)EEL (Executable Editing Library)

– We assume memory is RAM or programmable We assume memory is RAM or programmable ROMROM

loop:

Load r2, 0(r1)

Add r1, r1, 1

Add r3, r3, r2

Blt r1, 8, loop

after_loop:

…..

hw_init:

1. Initialize HW registers

2. Enable HW

3. Shutdown processor

• Woken up by HW interrupt

4. Store any results

5. Jump to after_loop

loop:

Jump hw_init

..

after_loop:

…..

Tool Statistics Executed on SimpleScalarExecuted on SimpleScalar

– Similar to a MIPS instruction setSimilar to a MIPS instruction set– Used 60 MHz clock (like Triscend A7 device)Used 60 MHz clock (like Triscend A7 device)

StatisticsStatistics– Total run time of only 1.09 secondsTotal run time of only 1.09 seconds– Requires less than ½ megabyte of RAMRequires less than ½ megabyte of RAM– Code size much smaller than standard Code size much smaller than standard

synthesis toolssynthesis tools

Tool

Code Size

(Lines)

Binary size

(Kbytes)

Data size

(Kbytes)Time

(s)DecompilationDMA Config.RT Synthes isLogic Synthes isTech. MappingPlace & Route

4,695 88 360 1.04

7,203 125 452 0.05

Experiments Benchmark InformationBenchmark Information

– Powerstone (Brev, g3fax1&2)Powerstone (Brev, g3fax1&2)– NetBench (url)NetBench (url)– Logic minimization kernel (logmin) Logic minimization kernel (logmin)

StatisticsStatistics– 55% of total time spent in loops that are moved to hardware55% of total time spent in loops that are moved to hardware– Ideal speedup of 2.8Ideal speedup of 2.8– These loops were only 2.4% of the size of the original applicationThese loops were only 2.4% of the size of the original application

ExampleTotal Ins

Loop Ins

Loop Time%

Loop Size%

Ideal Speedup

brev 992 104 70.0% 10.5% 3.3g3fax1 1094 6 31.4% 0.5% 1.5g3fax2 1094 6 31.2% 0.5% 1.5url 13526 17 79.9% 0.1% 5.0logm in 8968 38 63.8% 0.4% 2.8

Avg: 55.3% 2.4% 2.8

Experiments ResultsResults

– Achieved average speedup of 2.6, close to ideal 2.8Achieved average speedup of 2.6, close to ideal 2.8– Hardware loops were 20X faster than software loopsHardware loops were 20X faster than software loops

Even with simple architecture and tools, large Even with simple architecture and tools, large speedups were achievedspeedups were achieved

ExampleSw

Time

Sw Loop Time

Hw Loop Time

Sw /Hw Time

Speedup

brev 0.05 0.03 0.001 0.02 3.1g3fax1 23.50 7.35 0.82 16.98 1.4g3fax2 23.50 7.39 1.49 17.61 1.3url 379.90 303.74 13.29 89.45 4.2logmin 16.32 10.42 0.21 6.12 2.7

Avg: 65.78 3.16 26.03 2.6

Conclusion Dynamic hardware/software partitioning has advantages Dynamic hardware/software partitioning has advantages

over other partitioning approachesover other partitioning approaches– Completely transparentCompletely transparent– Designers get performance/energy benefits of hw/sw Designers get performance/energy benefits of hw/sw

partitioning by simply writing softwarepartitioning by simply writing software– Quality likely not as good as desktop CAD for some Quality likely not as good as desktop CAD for some

applications, so most suitable when transparency is critical applications, so most suitable when transparency is critical (very often!)(very often!)

Achieved average speedup of 2.6Achieved average speedup of 2.6– Very close to ideal speedup of 2.8Very close to ideal speedup of 2.8

Future workFuture work– More complex configurable logic fabricMore complex configurable logic fabric

Designed in close conjunction with on-chip CAD toolsDesigned in close conjunction with on-chip CAD tools Sequential logic and increased inputs/outputsSequential logic and increased inputs/outputs Support larger hardware regions, not just simple loopsSupport larger hardware regions, not just simple loops Improved algorithms (especially place and route)Improved algorithms (especially place and route)

– Handle more complex memory access patternsHandle more complex memory access patterns

dynamic hardware/software partitioning: a first approach

Documents

partitioning tools

dynamic partitioning

atmel fpslicmakes dynamic

chipthese tools

lean tools

x commonhowever

toolsbetter software

software compilerno