automatic processor specialisation using ad-hoc functional units

Automatic Processor Specialisation using

Ad-hoc Functional Units

[email protected], [email protected], [email protected]

EPFL – I&C – LAP

mailto:[email protected]




© Ienne 2002Automatic Processor Specialisation2

Classic Options for Systems-on-Chip

Design Gap!


Processor Specialisation:Get the Best of Both Options

Embedded!


VLIW Processor SpecialisationTwo complementary specialisation

strategies:Parametric Architecture

Ad-hoc Functional Units (AFUs)


Automatically Collapsing Clusters of Instructions into New Ones

If the ad-hoc functional unit completes the

job faster GAIN

One ad-hoc complex operation instead of a long

sequence of standard ones


General Goal

Automatically achieveprocessor specialisation

through high-levelapplication code analysis


Outline

IntroductionMotivational exampleGoalsOpportunities for specialisationChallenges, further opportunities,…


Elementary Motivational ExampleAn Important Kernel…/* init */a <<= 8;/* loop */for (i = 0; i < 8; i++) { if (a & 0x8000) { a = (a << 1) + b; } else { a <<= 1; }}return a & 0xffff;

Shift-and-addunsigned8 x 8-bit

multiplication


Software Predication/* init */a <<= 8;/* loop */for (i = 0; i < 8; i++) { p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;}return a & 0xffff;

Predicate mask(0 or –1 = 0xfffffff)

Shift PredicatedAdd


Loop Kernel DAGa

&

0x8000

>>

15

-

b

&

<<

1

+

a

In SW In HW

~6cycles

AND gates

Only wiring

ALU

1-2cycles!


Ad-hoc Unit To AccelerateShift-and-Add Multiplication Loop

Register File

ALU LD/ST MSTEP

if (Rn [31] = = 1)then Rn (Rn << 1) + Rm

else Rn (Rn << 1)1 ad-hoc instruction added

loop kernel

reduced to 15-30%


Loop Unrolling/* init */a <<= 8;/* no loop anymore */p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;return a & 0xffff;


Full DAG

ab

+

a

+++++++

&a a b

&-network

+

a

Column Compr.

In SW

~50

cyc

les

In HW

~3-4 cycles

ArithmeticOptimiser

&

0x8000

>>15

-&

<<

1

+&

0x8000

>>15

-&

<<

1

+&

0x8000

>>15

-&

<<

1

+

Etc.

a

b<<

8


Ad-hoc Unit To AccelerateMultiplication?! Yeah, a MUL…

Register File

ALU LD/ST MUL

Rn (Rn & 0x0000.ffff) x (Rm & 0x0000.ffff)

1 ad-hoc instruction added

function reduced by a factor 10-15


Classic “Ad-hoc” Customisation…Altera Nios:

Can we do more of this, really ad-hoc?!


Mainstream SoC/FPGA Processors and Specialisation?

All the recent embedded processors offer some sort of specialisation:

Arbitrary functional units or tightly coupled coprocessors (IFX Carmel 20xx, ARM, Tensilica Xtensa, Altera Nios, etc.)

Parametric resources (STM Lx, ARC Cores, Tensilica Xtensa, Altera Nios, etc.)

But all assume an onerousmanual study and design!


Summary of Gain Potentials inAd-hoc FUs

Exploit data parallelism in hardware

Exploit constant for logic

simplificationSome operations reduce to wires in

hardware

Exploit arithmetic properties for efficient chaining of arithmetic operations (e.g., carry

save)


GoalsHow much scope for AFU specialisation in

typical multimedia code?Are classic ILP techniques or other

optimisations (e.g., arithmetic) important to increase the speedup? To which extent?

What are the microarchitectural needs for exploiting well the potentials?Memory ports in the AFUs?Number of inputs from the register file? Are

two enough?Number of outputs to the register file? Is one

enough?


Related Work inReconfigurable Computing Most of the work in reconfigurable computing; typically

experiments are linked to a given microarchitecture: CHIMAERA [Ye et al., 2000] has the most rich measurements

but only for 1-output AFUs and no AFU-memory interface Similarly PRISC [Razdan et al., 1994] and ConCISe [Kastrup et

al., 1999] use clustering approaches for 2 inputs - 1 output AFUs

GARP [Hauser et al., 1997] concentates on the mapping of control flow (hyperblocks in loops) in a loosely coupled architecture (coprocessor)

First, investigate where potentials are fix microarchitecture


Related Work inAFU Identification Other authors concentrate on identification methods (“what

is the best function for an AFU?”) often with some microarchitectural assumptions MaxMISOs [Alippi et al., 1999] are 1-output candidates of

maximal size [Jacome et al., 2000] introduce vertical- and horizontal-

aggregation as heuristic methods to cluster operations (no comparisons with other techniques)

[Arnold et al, 2001] use library pattern-matching techniques with a dynamic pattern library (instruction clusters) but very limited cluster complexity (3 instructions) in the experiments

ASIP synthesis: different problem (minimal covering)

First, investigate where potentials are develop appropriate identification algorithms


MethodologyConcentrate on Data Flow

Easier to capture automatically (no architecturally visible state in the AFUs)

Constant latency (variable latency would hardly fit into a statically scheduled environment—e.g., VLIW)

Measurements on Basic BlocksRepresent the upper limit of the potential

advantagesUpper limit is reachable if microarchitectural

constraints are satisfied (e.g., no. of inputs and outputs)


Experimental Flow


Software Execution:Approximate RISC Model One clock cycle assumed for most SUIF nodes, representing

the usage of the execution stage Exceptions: e.g., type casts (zero), divisions (N) Assumed all forwarding paths existing No data/instruction cache or perfect hit rates assumed Jumps accounted with a fixed amount to the cycle count of

each basic block

IF ID WBIF ID EX

IF ID EX2IF ID

1:2:

3:4:

5:

WB

EXWB

EX

EX1 EX3EXID

WB

IF WB


Hardware Execution:Synthesis-based Model

Operator Precision Relative Delay hw

Multiply-Accumulator 32 bits x 32 bits + 64 bits 1.00

Adder 4 bits + 4 bits 0.11Adder 8 bits + 8 bits 0.12

Adder 16 bits + 16 bits 0.20Adder 24 bits + 24 bits 0.24Adder 32 bits + 32 bits 0.25

Divider 4 bits / 4 bits 0.38

Divider 8 bits / 8 bits 1.22Divider 16 bits / 16 bits 3.68

Divider 24 bits / 24 bits 6.33

Divider 32 bits / 32 bits 9.61Divider (by power of two) any / any 0.00

Barrel shifter 8 bits 0.08

Barrel shifter 16 bits 0.11

Barrel shifter 32 bits 0.16Barrel shifter (by constant amount) any 0.00

Bitwise multiplexer any 0.02

CMOS 0.18µ

SynopsysDesign Compiler+ DesignWare


Partitioning of DFGMix of Hardware and Software

AFU memory bandwidth issueOn-AFU (Hardware) and

Off-AFU (Software) instructionsDFG partitioned in HW and SW layers

High Cost! LowPerformance?


Example of Layering Hybrid DFGs

Hardwareand

softwarelayers


Metrics and MeasurementsTopological basic block information:

Inputs, outputs, etc.Saved cycles speedup

HW

opsall

iSW CPiLat )(

_

opsSWall

iSW

layersAFUall

iHW

opsall

iSW iLatCPiLat

_____

)()(


Basic Blocks CharacteristicsExamples

Weight

BB # In Out Ld St Tsw Thw nhw Thyb

adpcmdecode 5 22.84% 3 2 1 0 9 2.07 2 3

22 17.77% 4 3 1 1 7 1.25 1 3

9 12.69% 2 3 0 0 5 0.33 1 14 7.61% 1 3 1 0 6 1.00 2 3

mpeg2decode 4 37.44% 5 2 2 0 13 2.49 2 410 34.56% 4 2 2 0 12 2.49 2 4

pegwit 1 31.47% 2 0 296 36 811 3.65 3 335

25 9.06% 5 2 0 0 7 0.83 1 1

28 6.47% 2 0 2 1 5 2.29 2 59 6.45% 2 1 2 0 5 2.82 3 5

13 6.45% 2 0 2 1 5 2.29 2 510 5.16% 4 2 0 0 4 0.83 1 1

TopologyBenchmarkParallelMemoryAccess

SequentialMemoryAccess

Execution concentrated

in few BBs

Few Ld/St…

Small delays

…well separated

High RF pressure


Basic Blocks CharacteristicsModerate hardware resources for AFUs:

Often, half of the execution time concentrated in not more than 2-3 basic blocks

Pressure on the register file higher than classically supported

Limited importance of memory portsExcept some dramatic cases…

Small delay of typical basic blocks


Potential Basic SpeedupExamples

SeqMem

NoConst Basic

PlusBitwidth

PlusArith

adpcmdecode 5 1.80 12.71% 12.71% 12.71% 14.82% 14.82%

22 3.50 8.47% 10.59% 10.59% 10.59% 10.59%

9 1.67 8.47% 8.47% 8.47% 8.47% 8.47%4 1.20 3.18% 4.24% 5.29% 5.29% 5.29%

mpeg2decode 4 2.17 24.18% 24.18% 26.87% - -10 2.40 21.50% 21.50% 24.18% - -

pegwit 1 81.10 16.72% 28.31% 28.34% - -

25 1.75 7.03% 5.86% 7.03% - -

28 1.25 0.00% 1.17% 2.34% - -

9 1.00 0.00% 0.00% 2.33% - -

13 1.25 0.00% 1.17% 2.33% - -10 1.33 3.50% 3.50% 3.50% - -

Benchmark Cycle Savings

BB #

ILP

Good speedup with

few BBs

Not critical…

BBs too simple to

bring advantage


Inputs and Outputs of Basic Blocks

Speedup per # inputs Speedup per # outputs

>60% ~50%


Potential Basic SpeedupLimited available parallelismTop-ranking basic blocks: 10 to 50% cycle

savingsHardwired constants not a key advantage

Small price for a reduction in design riskSequentialisation penalty not dramatic

AFU memory ports not essentialAccurate bitwidth analysis and arithmetic

optimizations bring limited or no advantageBasic blocks are too simple, ceiling effects,…


Effects of ILP TechniquesExamples

Benchmark

adpcmdecode (par.) 49.45% 88.11% 88.11% 88.11% 88.11%(2.0 x) (8.4 x) (8.4 x) (8.4 x) (8.4 x)

adpcmdecode (seq.) 45.21% 77.51% 77.51% 81.75% 81.75%(1.8 x) (4.5 x) (4.5 x) (5.5 x) (5.5 x)

mpeg2decode (par.) 68.23% 68.23% 86.91% - 87.92%(3.1 x) (3.1 x) (7.6 x) - (8.3 x)

mpeg2decode (seq.) 60.09% 60.09% 73.73% - 74.40%(2.5 x) (2.5 x) (3.8 x) - (3.9 x)

pegwit (par.) 63.33% 67.31% 67.31% - -(2.7 x) (3.1 x) (3.1 x) - -

pegwit (seq.) 38.99% 42.96% 42.96% - -(1.6 x) (1.7 x) (1.7 x) - -

PlusUnrolling

PlusBitwidthAnalysis

PlusArithmetic

Opt.

1

1

1

Basic

3

4

2

1

2

PlusPredication

2

2

10

1

1

1

1

2

2

1

2

2

1

-

1

-

-2

-

--

1

total speedup30% number of basic blocks to reach 30% speedup


Effects of ILP TechniquesMajor improvements:

Cumulative speedups between 1.7x and 6.3xRegister file pressure not significantly

modifiedHardware complexity and Thw increased

Area is typically below 2-3x that of 32-bit multiplier, almost never >10x

Accurate bitwidth analysis and arithmetic optimisations bring limited or no advantageBaseline advantage already very large


Arithmetic Optimisation Impact

mpeg2decode basic block #7 Tsw Thw bb

Without arithmetic transformations 55 5 25,344,000 30.6%

With arithmetic transfonmations 55 3 26,357,760 31.4%

w/o optimisation with optimisation


ConclusionsDFG-level opens potential speedups (2–3x) at

low cost (hardware and toolset) and low riskLarger number of AFU write ports (2-3) neededHardcoding of constants not essentialAFU memory interfaces also not essentialILP techniques help, as expectedSophisticated and detailed techniques (bitwidth

analysis, arithmetic optimizations) sometimes masked by other effects


Ongoing WorkMeasure advantages through a complete

toolchain (notably, compiler):DSP microarchitecture:

Validate simple model Find out bottlenecks and impose real DSP constraints

(e.g., nonortogonality)VLIW microarchitecture:

Go beyond simple software execution modelDevelop novel speedup-driven identification

algorithmsHow to get more AFU specialisation potentialsDynamic identification and configuration of

AFUs


+++++

MS3

Typical Identification Algorithms Bottom-up greedy approaches to cluster

instructions Topologically-driven rather than speedup driven E.g., MaxMISO identification [Alippi et al., 1999]:

*

+

*

+

+

*

+ +

MS2

MS1+

*

+

*


Speedup-driven Identification Prune-out optimal set of low-speedup nodes to

achieve the required input/output count

i0

0.1

i1 i2 i3 i4 i5 i6 i7

1

2

1

0.1

k

o0 o1

0.5

3

0.5

0.1

0.1SIMD-like and unconnected

graphs


Open Issues and PerspectivesPower consumption advantages?

Power down because: Less instruction fetches and decodes Less register reads and writebacks

Power possibly up because: Reduced correlation of signals in the AFU Low-efficiency of the implementation (in case of

eFPGAs)More opportunities to increase speedup?

Detect and implement LUTs (e.g., in quantisers) as discrete CAMs

Detect runtime constant values


Dynamic Specialisation?

Java Bytecode

JiT + Specialisation

ARM + RFU

Dynamic compilation and optimisation together with hardware specialisation

DAISY, Crusoe, JiT, etc. Specialisation may profit

from runtime information Identification in runtime

conditions Dynamic reconfigurability

challenge


ConclusionsProcessor customisation opportunities

are here: soft cores, FPGA processors, etc.Very specific field of hardware/software

codesign with a very large potentialDo not give up versatilityGet most of the performance of custom

hardwareNeeds automation, to complement

compilers and synthesizers (some work exists but limited in scope)

automatic processor specialisation using ad-hoc functional units

Documents

afu specialisation

sort of specialisation

rn rn loop unrolling

ffff1 ad

mulrn rn 0x0000

arbitrary functional

customisationaltera

tensilica xtensa