automatic processor specialisation using ad-hoc functional units
DESCRIPTION
Automatic Processor Specialisation using Ad-hoc Functional Units. [email protected] , [email protected] , [email protected] EPFL – I & C – LAP. Design Gap!. Classic Options for Systems-on-Chip. Processor Specialisation: Get the Best of Both Options. Embedded!. - PowerPoint PPT PresentationTRANSCRIPT
Automatic Processor Specialisation using
Ad-hoc Functional Units
[email protected], [email protected], [email protected]
EPFL – I&C – LAP
© Ienne 2002Automatic Processor Specialisation2
Classic Options for Systems-on-Chip
Design Gap!
© Ienne 2002Automatic Processor Specialisation3
Processor Specialisation:Get the Best of Both Options
Embedded!
© Ienne 2002Automatic Processor Specialisation4
VLIW Processor SpecialisationTwo complementary specialisation
strategies:Parametric Architecture
Ad-hoc Functional Units (AFUs)
© Ienne 2002Automatic Processor Specialisation5
Automatically Collapsing Clusters of Instructions into New Ones
If the ad-hoc functional unit completes the
job faster GAIN
One ad-hoc complex operation instead of a long
sequence of standard ones
© Ienne 2002Automatic Processor Specialisation6
General Goal
Automatically achieveprocessor specialisation
through high-levelapplication code analysis
© Ienne 2002Automatic Processor Specialisation7
Outline
IntroductionMotivational exampleGoalsOpportunities for specialisationChallenges, further opportunities,…
© Ienne 2002Automatic Processor Specialisation8
Elementary Motivational ExampleAn Important Kernel…/* init */a <<= 8;/* loop */for (i = 0; i < 8; i++) { if (a & 0x8000) { a = (a << 1) + b; } else { a <<= 1; }}return a & 0xffff;
Shift-and-addunsigned8 x 8-bit
multiplication
© Ienne 2002Automatic Processor Specialisation9
Software Predication/* init */a <<= 8;/* loop */for (i = 0; i < 8; i++) { p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;}return a & 0xffff;
Predicate mask(0 or –1 = 0xfffffff)
Shift PredicatedAdd
© Ienne 2002Automatic Processor Specialisation10
Loop Kernel DAGa
&
0x8000
>>
15
-
b
&
<<
1
+
a
In SW In HW
~6cycles
AND gates
Only wiring
ALU
1-2cycles!
© Ienne 2002Automatic Processor Specialisation11
Ad-hoc Unit To AccelerateShift-and-Add Multiplication Loop
Register File
ALU LD/ST MSTEP
if (Rn [31] = = 1)then Rn (Rn << 1) + Rm
else Rn (Rn << 1)1 ad-hoc instruction added
loop kernel
reduced to 15-30%
© Ienne 2002Automatic Processor Specialisation12
Loop Unrolling/* init */a <<= 8;/* no loop anymore */p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;return a & 0xffff;
© Ienne 2002Automatic Processor Specialisation13
Full DAG
ab
+
a
+++++++
&a a b
&-network
+
a
Column Compr.
In SW
~50
cyc
les
In HW
~3-4 cycles
ArithmeticOptimiser
&
0x8000
>>15
-&
<<
1
+&
0x8000
>>15
-&
<<
1
+&
0x8000
>>15
-&
<<
1
+
Etc.
a
b<<
8
© Ienne 2002Automatic Processor Specialisation14
Ad-hoc Unit To AccelerateMultiplication?! Yeah, a MUL…
Register File
ALU LD/ST MUL
Rn (Rn & 0x0000.ffff) x (Rm & 0x0000.ffff)
1 ad-hoc instruction added
function reduced by a factor 10-15
© Ienne 2002Automatic Processor Specialisation15
Classic “Ad-hoc” Customisation…Altera Nios:
Can we do more of this, really ad-hoc?!
© Ienne 2002Automatic Processor Specialisation16
Mainstream SoC/FPGA Processors and Specialisation?
All the recent embedded processors offer some sort of specialisation:
Arbitrary functional units or tightly coupled coprocessors (IFX Carmel 20xx, ARM, Tensilica Xtensa, Altera Nios, etc.)
Parametric resources (STM Lx, ARC Cores, Tensilica Xtensa, Altera Nios, etc.)
But all assume an onerousmanual study and design!
© Ienne 2002Automatic Processor Specialisation17
Summary of Gain Potentials inAd-hoc FUs
Exploit data parallelism in hardware
Exploit constant for logic
simplificationSome operations reduce to wires in
hardware
Exploit arithmetic properties for efficient chaining of arithmetic operations (e.g., carry
save)
© Ienne 2002Automatic Processor Specialisation18
GoalsHow much scope for AFU specialisation in
typical multimedia code?Are classic ILP techniques or other
optimisations (e.g., arithmetic) important to increase the speedup? To which extent?
What are the microarchitectural needs for exploiting well the potentials?Memory ports in the AFUs?Number of inputs from the register file? Are
two enough?Number of outputs to the register file? Is one
enough?
© Ienne 2002Automatic Processor Specialisation19
Related Work inReconfigurable Computing Most of the work in reconfigurable computing; typically
experiments are linked to a given microarchitecture: CHIMAERA [Ye et al., 2000] has the most rich measurements
but only for 1-output AFUs and no AFU-memory interface Similarly PRISC [Razdan et al., 1994] and ConCISe [Kastrup et
al., 1999] use clustering approaches for 2 inputs - 1 output AFUs
GARP [Hauser et al., 1997] concentates on the mapping of control flow (hyperblocks in loops) in a loosely coupled architecture (coprocessor)
First, investigate where potentials are fix microarchitecture
© Ienne 2002Automatic Processor Specialisation20
Related Work inAFU Identification Other authors concentrate on identification methods (“what
is the best function for an AFU?”) often with some microarchitectural assumptions MaxMISOs [Alippi et al., 1999] are 1-output candidates of
maximal size [Jacome et al., 2000] introduce vertical- and horizontal-
aggregation as heuristic methods to cluster operations (no comparisons with other techniques)
[Arnold et al, 2001] use library pattern-matching techniques with a dynamic pattern library (instruction clusters) but very limited cluster complexity (3 instructions) in the experiments
ASIP synthesis: different problem (minimal covering)
First, investigate where potentials are develop appropriate identification algorithms
© Ienne 2002Automatic Processor Specialisation21
MethodologyConcentrate on Data Flow
Easier to capture automatically (no architecturally visible state in the AFUs)
Constant latency (variable latency would hardly fit into a statically scheduled environment—e.g., VLIW)
Measurements on Basic BlocksRepresent the upper limit of the potential
advantagesUpper limit is reachable if microarchitectural
constraints are satisfied (e.g., no. of inputs and outputs)
© Ienne 2002Automatic Processor Specialisation22
Experimental Flow
© Ienne 2002Automatic Processor Specialisation23
Software Execution:Approximate RISC Model One clock cycle assumed for most SUIF nodes, representing
the usage of the execution stage Exceptions: e.g., type casts (zero), divisions (N) Assumed all forwarding paths existing No data/instruction cache or perfect hit rates assumed Jumps accounted with a fixed amount to the cycle count of
each basic block
IF ID WBIF ID EX
IF ID EX2IF ID
1:2:
3:4:
5:
WB
EXWB
EX
EX1 EX3EXID
WB
IF WB
© Ienne 2002Automatic Processor Specialisation24
Hardware Execution:Synthesis-based Model
Operator Precision Relative Delay hw
Multiply-Accumulator 32 bits x 32 bits + 64 bits 1.00
Adder 4 bits + 4 bits 0.11Adder 8 bits + 8 bits 0.12
Adder 16 bits + 16 bits 0.20Adder 24 bits + 24 bits 0.24Adder 32 bits + 32 bits 0.25
Divider 4 bits / 4 bits 0.38
Divider 8 bits / 8 bits 1.22Divider 16 bits / 16 bits 3.68
Divider 24 bits / 24 bits 6.33
Divider 32 bits / 32 bits 9.61Divider (by power of two) any / any 0.00
Barrel shifter 8 bits 0.08
Barrel shifter 16 bits 0.11
Barrel shifter 32 bits 0.16Barrel shifter (by constant amount) any 0.00
Bitwise multiplexer any 0.02
CMOS 0.18µ
SynopsysDesign Compiler+ DesignWare
© Ienne 2002Automatic Processor Specialisation25
Partitioning of DFGMix of Hardware and Software
AFU memory bandwidth issueOn-AFU (Hardware) and
Off-AFU (Software) instructionsDFG partitioned in HW and SW layers
High Cost! LowPerformance?
© Ienne 2002Automatic Processor Specialisation26
Example of Layering Hybrid DFGs
Hardwareand
softwarelayers
© Ienne 2002Automatic Processor Specialisation27
Metrics and MeasurementsTopological basic block information:
Inputs, outputs, etc.Saved cycles speedup
HW
opsall
iSW CPiLat )(
_
opsSWall
iSW
layersAFUall
iHW
opsall
iSW iLatCPiLat
_____
)()(
© Ienne 2002Automatic Processor Specialisation28
Basic Blocks CharacteristicsExamples
Weight
BB # In Out Ld St Tsw Thw nhw Thyb
adpcmdecode 5 22.84% 3 2 1 0 9 2.07 2 3
22 17.77% 4 3 1 1 7 1.25 1 3
9 12.69% 2 3 0 0 5 0.33 1 14 7.61% 1 3 1 0 6 1.00 2 3
mpeg2decode 4 37.44% 5 2 2 0 13 2.49 2 410 34.56% 4 2 2 0 12 2.49 2 4
pegwit 1 31.47% 2 0 296 36 811 3.65 3 335
25 9.06% 5 2 0 0 7 0.83 1 1
28 6.47% 2 0 2 1 5 2.29 2 59 6.45% 2 1 2 0 5 2.82 3 5
13 6.45% 2 0 2 1 5 2.29 2 510 5.16% 4 2 0 0 4 0.83 1 1
TopologyBenchmarkParallelMemoryAccess
SequentialMemoryAccess
Execution concentrated
in few BBs
Few Ld/St…
Small delays
…well separated
High RF pressure
© Ienne 2002Automatic Processor Specialisation29
Basic Blocks CharacteristicsModerate hardware resources for AFUs:
Often, half of the execution time concentrated in not more than 2-3 basic blocks
Pressure on the register file higher than classically supported
Limited importance of memory portsExcept some dramatic cases…
Small delay of typical basic blocks
© Ienne 2002Automatic Processor Specialisation30
Potential Basic SpeedupExamples
SeqMem
NoConst Basic
PlusBitwidth
PlusArith
adpcmdecode 5 1.80 12.71% 12.71% 12.71% 14.82% 14.82%
22 3.50 8.47% 10.59% 10.59% 10.59% 10.59%
9 1.67 8.47% 8.47% 8.47% 8.47% 8.47%4 1.20 3.18% 4.24% 5.29% 5.29% 5.29%
mpeg2decode 4 2.17 24.18% 24.18% 26.87% - -10 2.40 21.50% 21.50% 24.18% - -
pegwit 1 81.10 16.72% 28.31% 28.34% - -
25 1.75 7.03% 5.86% 7.03% - -
28 1.25 0.00% 1.17% 2.34% - -
9 1.00 0.00% 0.00% 2.33% - -
13 1.25 0.00% 1.17% 2.33% - -10 1.33 3.50% 3.50% 3.50% - -
Benchmark Cycle Savings
BB #
ILP
Good speedup with
few BBs
Not critical…
BBs too simple to
bring advantage
© Ienne 2002Automatic Processor Specialisation31
Inputs and Outputs of Basic Blocks
Speedup per # inputs Speedup per # outputs
>60% ~50%
© Ienne 2002Automatic Processor Specialisation32
Potential Basic SpeedupLimited available parallelismTop-ranking basic blocks: 10 to 50% cycle
savingsHardwired constants not a key advantage
Small price for a reduction in design riskSequentialisation penalty not dramatic
AFU memory ports not essentialAccurate bitwidth analysis and arithmetic
optimizations bring limited or no advantageBasic blocks are too simple, ceiling effects,…
© Ienne 2002Automatic Processor Specialisation33
Effects of ILP TechniquesExamples
Benchmark
adpcmdecode (par.) 49.45% 88.11% 88.11% 88.11% 88.11%(2.0 x) (8.4 x) (8.4 x) (8.4 x) (8.4 x)
adpcmdecode (seq.) 45.21% 77.51% 77.51% 81.75% 81.75%(1.8 x) (4.5 x) (4.5 x) (5.5 x) (5.5 x)
mpeg2decode (par.) 68.23% 68.23% 86.91% - 87.92%(3.1 x) (3.1 x) (7.6 x) - (8.3 x)
mpeg2decode (seq.) 60.09% 60.09% 73.73% - 74.40%(2.5 x) (2.5 x) (3.8 x) - (3.9 x)
pegwit (par.) 63.33% 67.31% 67.31% - -(2.7 x) (3.1 x) (3.1 x) - -
pegwit (seq.) 38.99% 42.96% 42.96% - -(1.6 x) (1.7 x) (1.7 x) - -
PlusUnrolling
PlusBitwidthAnalysis
PlusArithmetic
Opt.
1
1
1
Basic
3
4
2
1
2
PlusPredication
2
2
10
1
1
1
1
2
2
1
2
2
1
-
1
-
-2
-
--
1
total speedup30% number of basic blocks to reach 30% speedup
© Ienne 2002Automatic Processor Specialisation34
Effects of ILP TechniquesMajor improvements:
Cumulative speedups between 1.7x and 6.3xRegister file pressure not significantly
modifiedHardware complexity and Thw increased
Area is typically below 2-3x that of 32-bit multiplier, almost never >10x
Accurate bitwidth analysis and arithmetic optimisations bring limited or no advantageBaseline advantage already very large
© Ienne 2002Automatic Processor Specialisation35
Arithmetic Optimisation Impact
mpeg2decode basic block #7 Tsw Thw bb
Without arithmetic transformations 55 5 25,344,000 30.6%
With arithmetic transfonmations 55 3 26,357,760 31.4%
w/o optimisation with optimisation
© Ienne 2002Automatic Processor Specialisation36
ConclusionsDFG-level opens potential speedups (2–3x) at
low cost (hardware and toolset) and low riskLarger number of AFU write ports (2-3) neededHardcoding of constants not essentialAFU memory interfaces also not essentialILP techniques help, as expectedSophisticated and detailed techniques (bitwidth
analysis, arithmetic optimizations) sometimes masked by other effects
© Ienne 2002Automatic Processor Specialisation37
Ongoing WorkMeasure advantages through a complete
toolchain (notably, compiler):DSP microarchitecture:
Validate simple model Find out bottlenecks and impose real DSP constraints
(e.g., nonortogonality)VLIW microarchitecture:
Go beyond simple software execution modelDevelop novel speedup-driven identification
algorithmsHow to get more AFU specialisation potentialsDynamic identification and configuration of
AFUs
© Ienne 2002Automatic Processor Specialisation38
+++++
MS3
Typical Identification Algorithms Bottom-up greedy approaches to cluster
instructions Topologically-driven rather than speedup driven E.g., MaxMISO identification [Alippi et al., 1999]:
*
+
*
+
+
*
+ +
MS2
MS1+
*
+
*
© Ienne 2002Automatic Processor Specialisation39
Speedup-driven Identification Prune-out optimal set of low-speedup nodes to
achieve the required input/output count
i0
0.1
i1 i2 i3 i4 i5 i6 i7
1
2
1
0.1
k
o0 o1
0.5
3
0.5
0.1
0.1SIMD-like and unconnected
graphs
© Ienne 2002Automatic Processor Specialisation40
Open Issues and PerspectivesPower consumption advantages?
Power down because: Less instruction fetches and decodes Less register reads and writebacks
Power possibly up because: Reduced correlation of signals in the AFU Low-efficiency of the implementation (in case of
eFPGAs)More opportunities to increase speedup?
Detect and implement LUTs (e.g., in quantisers) as discrete CAMs
Detect runtime constant values
© Ienne 2002Automatic Processor Specialisation41
Dynamic Specialisation?
Java Bytecode
JiT + Specialisation
ARM + RFU
Dynamic compilation and optimisation together with hardware specialisation
DAISY, Crusoe, JiT, etc. Specialisation may profit
from runtime information Identification in runtime
conditions Dynamic reconfigurability
challenge
© Ienne 2002Automatic Processor Specialisation42
ConclusionsProcessor customisation opportunities
are here: soft cores, FPGA processors, etc.Very specific field of hardware/software
codesign with a very large potentialDo not give up versatilityGet most of the performance of custom
hardwareNeeds automation, to complement
compilers and synthesizers (some work exists but limited in scope)