dynamic hardware/software partitioning: a first approach
DESCRIPTION
Dynamic Hardware/Software Partitioning: A First Approach. Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems at UC Irvine. Introduction. - PowerPoint PPT PresentationTRANSCRIPT
Dynamic Hardware/Software Partitioning: A First Approach
Greg Stitt, Roman Lysecky, Frank Greg Stitt, Roman Lysecky, Frank Vahid*Vahid*Department of Computer Science and Department of Computer Science and EngineeringEngineeringUniversity of California, RiversideUniversity of California, Riverside*Also with the Center for Embedded Computer Systems at *Also with the Center for Embedded Computer Systems at UC IrvineUC Irvine
Introduction Dynamic optimizations an increasing trendDynamic optimizations an increasing trend
– ExamplesExamples DynamoDynamo
– Dynamic software optimizationsDynamic software optimizations Transmeta CrusoeTransmeta Crusoe
– Dynamic code morphingDynamic code morphing Just In Time CompilationJust In Time Compilation
– Interpreted languagesInterpreted languages AdvantagesAdvantages
– Transparent optimizationsTransparent optimizations No designer effortNo designer effort No tool restrictionsNo tool restrictions
– Adapts to actual usageAdapts to actual usage
Sw__________________
Introduction Drawbacks of current dynamic optimizationsDrawbacks of current dynamic optimizations
– Currently limited to software optimizationsCurrently limited to software optimizations Limited speedup (1.1x to 1.3x common)Limited speedup (1.1x to 1.3x common)
Alternatively, we could perform hw/sw partitioningAlternatively, we could perform hw/sw partitioning– Achieve large speedups (2x to 10x common)Achieve large speedups (2x to 10x common)– However, presently dynamic optimization not possibleHowever, presently dynamic optimization not possible
Sw__________________
Hw__________________
Profiler
Critical Regions
Processor ASIC/FPGA
Introduction Ideally, we would perform hardware/software Ideally, we would perform hardware/software
partitioning dynamicallypartitioning dynamically– Transparent partitioningTransparent partitioning
Supports all sw languages/toolsSupports all sw languages/tools Most partitioning approaches have complex tool Most partitioning approaches have complex tool
flowsflows– Achieves better results than software Achieves better results than software
optimizationsoptimizations >2x speedup, energy savings>2x speedup, energy savings
– Adapts to actual usageAdapts to actual usage Appropriate architecture requiredAppropriate architecture required
– Requires a processor and configurable logicRequires a processor and configurable logic
Introduction Microprocessor/FPGA single-chip platforms make Microprocessor/FPGA single-chip platforms make
partitioning more attractivepartitioning more attractive– More efficient communication, smaller sizeMore efficient communication, smaller size
Higher performance, low powerHigher performance, low power ExamplesExamples
– Xilinx Virtex II Pro, Triscend E5/A7, Altera Excalibur, Xilinx Virtex II Pro, Triscend E5/A7, Altera Excalibur, Atmel FPSLICAtmel FPSLIC
Makes dynamic hw/sw partitioning more feasibleMakes dynamic hw/sw partitioning more feasible– However, partitioning must be performed at binary levelHowever, partitioning must be performed at binary level
FPGAProcessorProcessor FPGA
1990s 2003
Introduction Binary-level hw/sw partitioningBinary-level hw/sw partitioning
– Binary is profiled and hardware Binary is profiled and hardware candidates are determinedcandidates are determined
– Regions to be partitioned are Regions to be partitioned are decompiled into CDFGdecompiled into CDFG
– CDFG is synthesized to hardwareCDFG is synthesized to hardware– Binary is updated to use Binary is updated to use
hardwarehardware Many advantages over source-level Many advantages over source-level
partitioningpartitioning– Supports any language or Supports any language or
software compilersoftware compiler No change in toolsNo change in tools
– Better software size and Better software size and performance estimation at binary performance estimation at binary levellevel
Enables dynamic hw/sw Enables dynamic hw/sw partitioningpartitioning
Binary
Netlist
Processor FPGA
Updated Binary
Profiling
Hw Exploration
Decompilation
Behavioral Synthesis
Binary Updater
Dynamic Hw/Sw Partitioning
Memory
Dynamic Partitioning
Module
ConfigurableLogic
Micro-processor
Micro-processor
Micro-processor
Micro-processor
Memory
Micro-processor
SW___________________________
SW addaddaddaddaddaddaddaddaddaddaddaddaddaddaddaddaddaddaddaddaddaddaddadd
Dynamic Hw/Sw Partitioning
Memory
Dynamic Partitioning
Module
ConfigurableLogic
Micro-processor
Micro-processor
Micro-processor
Micro-processor
Memory
Micro-processor
SW___________________________
SW beqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeq
Dynamic Hw/Sw Partitioning
Memory
Dynamic Partitioning
Module
ConfigurableLogic
Micro-processor
Micro-processor
Micro-processor
Micro-processor
Memory
Micro-processor
SW___________________________
SW addaddaddaddaddaddaddaddaddaddadd
addaddaddaddaddaddaddaddaddaddadd
Dynamic Partitioning
Moduleaddaddadd
add
Dynamic Hw/Sw Partitioning
Memory
Dynamic Partitioning
Module
ConfigurableLogic
Micro-processor
Micro-processor
Micro-processor
Micro-processor
Memory
Micro-processor
SW___________________________
SW beqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeq
beqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeq
Dynamic Partitioning
Modulebeqbeqbeq
beq
Dynamic Hw/Sw Partitioning
Memory
Dynamic Partitioning
Module
ConfigurableLogic
Micro-processor
Micro-processor
Micro-processor
Micro-processor
Memory
Micro-processor
SW___________________________
SW
Dynamic Partitioning
Module
FrequentLoops
SWSWSW
SW
SW
SWSWSW
Dynamic Hw/Sw Partitioning
Memory
Dynamic Partitioning
Module
ConfigurableLogic
Micro-processor
Micro-processor
Micro-processor
Micro-processor
Memory
Micro-processor
SW___________________________
SW
Dynamic Partitioning
Module
FrequentLoops
HWHWHWHWHWHWHW
Frequent Loops
Memory
Dynamic Partitioning
Module
ConfigurableLogic
Micro-processor
Micro-processor
Micro-processor
Micro-processor
Memory
Micro-processor
Dynamic Partitioning
Module
Dynamic Hw/Sw Partitioning
SW___________________________
SW
FrequentLoops
ConfigurableLogic
Frequent Loops
0
20
40
60
80
100
Time Energy
SWHW /SW
Dynamic Partitioning Module Dynamic partitioning module executes Dynamic partitioning module executes
partitioning tools on chippartitioning tools on chip– Profiler, partitioning compiler, synthesis, Profiler, partitioning compiler, synthesis,
place&routeplace&route
Profiler
Partitioning
CompilerSynthesisSW Binary
HW
SW Source
Place&Route
Memory
Dynamic Partitioning
Module
ConfigurableLogic
Micro-processor
Micro-processor
Micro-processor
Micro-processor
Dynamic Partitioning Module Synthesis and place & route tools all moved on-Synthesis and place & route tools all moved on-
chipchip– These tools typically execute on powerful These tools typically execute on powerful
workstationsworkstations– Most people will cringe at idea of moving these Most people will cringe at idea of moving these
tools on-chiptools on-chip However, dynamic partitioning deals with small However, dynamic partitioning deals with small
regions of coderegions of code– Typically, small innermost loopsTypically, small innermost loops
Therefore, we can develop lean tools that work Therefore, we can develop lean tools that work specifically for these small loopsspecifically for these small loops– Lean tools make on-chip execution possibleLean tools make on-chip execution possible
Area overhead becoming less critical due to Area overhead becoming less critical due to Moore’s LawMoore’s Law
System Architecture MicroprocessorMicroprocessor
ss– MIPS (may be MIPS (may be
many)many) On-chip On-chip
memorymemory Configurable Configurable
logiclogic Dynamic Dynamic
partitioning partitioning modulemodule
Memory
Dynamic Partitioning
Module
ConfigurableLogic
Micro-processor
Micro-processor
Micro-processor
Micro-processor
Dynamic Partitioning Module Dynamically detects frequent loops and then Dynamically detects frequent loops and then
reimplements the loops in hardware running reimplements the loops in hardware running on the configurable logicon the configurable logic
Architectural componentsArchitectural components– ProfilerProfiler– Additional processor and memoryAdditional processor and memory
But SOCs may have dozens anywaysBut SOCs may have dozens anyways Alternatively, we could share main processorAlternatively, we could share main processor
Memory
Profiler
Partitioning Co-Processor
Configurable Logic Greatly simplified in order to create lean place & route toolsGreatly simplified in order to create lean place & route tools DMA used to access memoryDMA used to access memory Two registersTwo registers
– R0_Input stores data from memoryR0_Input stores data from memory– R1_InOut stores temporary data & data to write back to memoryR1_InOut stores temporary data & data to write back to memory
FabricFabric– Supports combinational logicSupports combinational logic– Implies loops must have body implemented in single cycle Implies loops must have body implemented in single cycle
(temporary restriction)(temporary restriction)
DMAR0_Input
Configurable Logic Fabric
R1_InOut
Configurable Logic Fabric FabricFabric
– 3-input 2-output LUTS surrounded by switch 3-input 2-output LUTS surrounded by switch matricesmatrices
Switch MatrixSwitch Matrix– Connect wire to same channel on different sideConnect wire to same channel on different side
LUTLUT– 3-input (8 word) 2-output SRAM3-input (8 word) 2-output SRAM
Configurable Logic Fabric
LUTT
LUT UT
...
SMM
SMSM
SMM
SMSM
SMM
...
0
0
00
1
1
1 12
2
2
2
33
3
3
Inputs Inputs
SRAM(8x2)
Outputs
Configurable Logic Fabric Switch Matrix LUT
Tool OverviewBinary
Loop Profiling
Small, Frequent Loops
Decompilation
Place & Route
HW
RT and Logic Synthesis
Binary Modification
Updated Binary
DMA Configuration
Bitfile Creation
Tech. Mapping
Tool flow slightly Tool flow slightly different from standard different from standard partitioning flowpartitioning flow– DecompilationDecompilation– Binary modificationBinary modification
Loop Profiling Non-intrusive profilerNon-intrusive profiler
– Monitors instruction busMonitors instruction bus Very little overheadVery little overhead
– Small cache (~16 entries) and 2,300 logic Small cache (~16 entries) and 2,300 logic gatesgates
Less than 1% power overheadLess than 1% power overhead
Micr
o-pr
oces
sor Frequent Loop
CacheFrequent Loop
Cache Controller
++
rd/wraddr
datadata
To L1 Memory
rd/wr
addr
sbb
datasaturation
Decompilation Decompilation recovers high-level informationDecompilation recovers high-level information Creates optimized CDFGCreates optimized CDFG
– All instruction-set inefficiencies are removedAll instruction-set inefficiencies are removed Binary partitioning has been shown to Binary partitioning has been shown to
achieve similar results to source-level achieve similar results to source-level partitioning for many applicationspartitioning for many applications– [Greg Stitt, Frank Vahid, ICCAD 2002][Greg Stitt, Frank Vahid, ICCAD 2002]
DMA Configuration Maps memory accesses to our DMA Maps memory accesses to our DMA
architecturearchitecture– Reads/writesReads/writes– Increment/decrement address updatesIncrement/decrement address updates– Single/block request modesSingle/block request modes
Optimizes DFG for DMAOptimizes DFG for DMA– Removes address calculationsRemoves address calculations– Removes loop counters/exit conditionsRemoves loop counters/exit conditions
1 r1
+ Read
r1 +
r2
• Memory Read
• Increment Address
• Block Request
r3
DMA Read
+
r2
r3
Register Transfer Synthesis Maps DFG operations to hw library Maps DFG operations to hw library
componentscomponents– Adders, Comparators, Multiplexors, ShiftersAdders, Comparators, Multiplexors, Shifters
Creates Boolean expression for each output Creates Boolean expression for each output bit in dataflow graph by replacing hw bit in dataflow graph by replacing hw components with corresponding expressionscomponents with corresponding expressions
r4[0]=r1[0] xor r2[0], carry[0]=r1[0] and r2[0]r4[1]=(r1[1] xor r2[1]) xor carry[0], carry[1]= …….…….
r1 r2
+
r4
r3 8
<
r532-bit adder 32-bit comparator
Logic Synthesis Optimizes Boolean equations from RT Optimizes Boolean equations from RT
synthesissynthesis– Large opportunity for logic minimization due to Large opportunity for logic minimization due to
use of immediate values in the binaryuse of immediate values in the binary Simple on-chip 2-level logic minimization Simple on-chip 2-level logic minimization
methodmethod– Lysecky/Vahid DAC’03, session 20.4 (9:45 Wed)Lysecky/Vahid DAC’03, session 20.4 (9:45 Wed)
r2[0] = r1[0] xor 0 xor 0r2[1] = r1[1] xor 0 xor carry[0]r2[2] = r1[2] xor 1 xor carry[1]r2[3] = r1[3] xor 0 xor carry[2]…
r1 4
+
r2
r2[0] = r1[0]r2[1] = r1[1] xor carry[0]r2[2] = r1[2]’ xor carry[1]r2[3] = r1[3] xor carry[2]…
Technology Mapping Maps logic operations to 3-input, 2-output Maps logic operations to 3-input, 2-output
LUTsLUTs1.1. Traverse logic network and combine nodes to Traverse logic network and combine nodes to
determine single output LUTsdetermine single output LUTs2.2. Combine nodes to form two output LUTsCombine nodes to form two output LUTs
3-input, 2-output LUTs
Placement Nodes along critical path are placed in single Nodes along critical path are placed in single
horizontal rowhorizontal row Build dependencies between remaining nodes Build dependencies between remaining nodes
and placed nodesand placed nodes– Use dependencies to place remaining nodesUse dependencies to place remaining nodes
Either above or below placed nodesEither above or below placed nodes
LUT LUTLUTLUTLUT LUTLUTLUTLUT LUTLUTLUT
LUT LUTLUTLUTLUT LUTLUTLUTLUT LUTLUTLUT
LUT LUTLUTLUTLUT LUTLUTLUTLUT LUTLUTLUT
LUT LUTLUTLUTLUT LUTLUTLUTLUT LUTLUTLUT
Routing Greedy algorithmGreedy algorithm
1.1. At each switch matrix, choose directionAt each switch matrix, choose directionto routeto route
2.2. Continue to route until reaching switchContinue to route until reaching switchmatrix that is already in usematrix that is already in use
3.3. Backtrack to previous switch matrix,Backtrack to previous switch matrix,and try another directionand try another direction
Place and route most complex task;Place and route most complex task;currently working on improvementscurrently working on improvements
Bitfile Creation Combines place&routed hardware description Combines place&routed hardware description
with DMA configuration into bitfilewith DMA configuration into bitfile– Used to initialize the configurable logicUsed to initialize the configurable logic
HW Netlist
Bitfile Creation
DMA Configuration
Bitfile
DMA R0_Input
Configurable Logic Fabric
R1_InOut
Binary Modification Updates the application binary in order to Updates the application binary in order to
utilize the new hardwareutilize the new hardware– Loop replaced with jump to hw initialization Loop replaced with jump to hw initialization
codecode– Wisconsin Architectural Research Tool Set Wisconsin Architectural Research Tool Set
(WARTS)(WARTS) EEL (Executable Editing Library)EEL (Executable Editing Library)
– We assume memory is RAM or programmable We assume memory is RAM or programmable ROMROM
loop:
Load r2, 0(r1)
Add r1, r1, 1
Add r3, r3, r2
Blt r1, 8, loop
after_loop:
…..
hw_init:
1. Initialize HW registers
2. Enable HW
3. Shutdown processor
• Woken up by HW interrupt
4. Store any results
5. Jump to after_loop
loop:
Jump hw_init
..
after_loop:
…..
Tool Statistics Executed on SimpleScalarExecuted on SimpleScalar
– Similar to a MIPS instruction setSimilar to a MIPS instruction set– Used 60 MHz clock (like Triscend A7 device)Used 60 MHz clock (like Triscend A7 device)
StatisticsStatistics– Total run time of only 1.09 secondsTotal run time of only 1.09 seconds– Requires less than ½ megabyte of RAMRequires less than ½ megabyte of RAM– Code size much smaller than standard Code size much smaller than standard
synthesis toolssynthesis tools
Tool
Code Size
(Lines)
Binary size
(Kbytes)
Data size
(Kbytes)Time
(s)DecompilationDMA Config.RT Synthes isLogic Synthes isTech. MappingPlace & Route
4,695 88 360 1.04
7,203 125 452 0.05
Experiments Benchmark InformationBenchmark Information
– Powerstone (Brev, g3fax1&2)Powerstone (Brev, g3fax1&2)– NetBench (url)NetBench (url)– Logic minimization kernel (logmin) Logic minimization kernel (logmin)
StatisticsStatistics– 55% of total time spent in loops that are moved to hardware55% of total time spent in loops that are moved to hardware– Ideal speedup of 2.8Ideal speedup of 2.8– These loops were only 2.4% of the size of the original applicationThese loops were only 2.4% of the size of the original application
ExampleTotal Ins
Loop Ins
Loop Time%
Loop Size%
Ideal Speedup
brev 992 104 70.0% 10.5% 3.3g3fax1 1094 6 31.4% 0.5% 1.5g3fax2 1094 6 31.2% 0.5% 1.5url 13526 17 79.9% 0.1% 5.0logm in 8968 38 63.8% 0.4% 2.8
Avg: 55.3% 2.4% 2.8
Experiments ResultsResults
– Achieved average speedup of 2.6, close to ideal 2.8Achieved average speedup of 2.6, close to ideal 2.8– Hardware loops were 20X faster than software loopsHardware loops were 20X faster than software loops
Even with simple architecture and tools, large Even with simple architecture and tools, large speedups were achievedspeedups were achieved
ExampleSw
Time
Sw Loop Time
Hw Loop Time
Sw /Hw Time
Speedup
brev 0.05 0.03 0.001 0.02 3.1g3fax1 23.50 7.35 0.82 16.98 1.4g3fax2 23.50 7.39 1.49 17.61 1.3url 379.90 303.74 13.29 89.45 4.2logmin 16.32 10.42 0.21 6.12 2.7
Avg: 65.78 3.16 26.03 2.6
Conclusion Dynamic hardware/software partitioning has advantages Dynamic hardware/software partitioning has advantages
over other partitioning approachesover other partitioning approaches– Completely transparentCompletely transparent– Designers get performance/energy benefits of hw/sw Designers get performance/energy benefits of hw/sw
partitioning by simply writing softwarepartitioning by simply writing software– Quality likely not as good as desktop CAD for some Quality likely not as good as desktop CAD for some
applications, so most suitable when transparency is critical applications, so most suitable when transparency is critical (very often!)(very often!)
Achieved average speedup of 2.6Achieved average speedup of 2.6– Very close to ideal speedup of 2.8Very close to ideal speedup of 2.8
Future workFuture work– More complex configurable logic fabricMore complex configurable logic fabric
Designed in close conjunction with on-chip CAD toolsDesigned in close conjunction with on-chip CAD tools Sequential logic and increased inputs/outputsSequential logic and increased inputs/outputs Support larger hardware regions, not just simple loopsSupport larger hardware regions, not just simple loops Improved algorithms (especially place and route)Improved algorithms (especially place and route)
– Handle more complex memory access patternsHandle more complex memory access patterns