automatic processor specialisation using ad-hoc functional units

42
Automatic Processor Specialisation using Ad-hoc Functional Units [email protected] , [email protected] , Miljan [email protected] EPFL – I&C – LAP

Upload: bijan

Post on 25-Feb-2016

67 views

Category:

Documents


0 download

DESCRIPTION

Automatic Processor Specialisation using Ad-hoc Functional Units. [email protected] , [email protected] , [email protected] EPFL – I & C – LAP. Design Gap!. Classic Options for Systems-on-Chip. Processor Specialisation: Get the Best of Both Options. Embedded!. - PowerPoint PPT Presentation

TRANSCRIPT

Page 2: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation2

Classic Options for Systems-on-Chip

Design Gap!

Page 3: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation3

Processor Specialisation:Get the Best of Both Options

Embedded!

Page 4: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation4

VLIW Processor SpecialisationTwo complementary specialisation

strategies:Parametric Architecture

Ad-hoc Functional Units (AFUs)

Page 5: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation5

Automatically Collapsing Clusters of Instructions into New Ones

If the ad-hoc functional unit completes the

job faster GAIN

One ad-hoc complex operation instead of a long

sequence of standard ones

Page 6: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation6

General Goal

Automatically achieveprocessor specialisation

through high-levelapplication code analysis

Page 7: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation7

Outline

IntroductionMotivational exampleGoalsOpportunities for specialisationChallenges, further opportunities,…

Page 8: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation8

Elementary Motivational ExampleAn Important Kernel…/* init */a <<= 8;/* loop */for (i = 0; i < 8; i++) { if (a & 0x8000) { a = (a << 1) + b; } else { a <<= 1; }}return a & 0xffff;

Shift-and-addunsigned8 x 8-bit

multiplication

Page 9: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation9

Software Predication/* init */a <<= 8;/* loop */for (i = 0; i < 8; i++) { p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;}return a & 0xffff;

Predicate mask(0 or –1 = 0xfffffff)

Shift PredicatedAdd

Page 10: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation10

Loop Kernel DAGa

&

0x8000

>>

15

-

b

&

<<

1

+

a

In SW In HW

~6cycles

AND gates

Only wiring

ALU

1-2cycles!

Page 11: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation11

Ad-hoc Unit To AccelerateShift-and-Add Multiplication Loop

Register File

ALU LD/ST MSTEP

if (Rn [31] = = 1)then Rn (Rn << 1) + Rm

else Rn (Rn << 1)1 ad-hoc instruction added

loop kernel

reduced to 15-30%

Page 12: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation12

Loop Unrolling/* init */a <<= 8;/* no loop anymore */p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;return a & 0xffff;

Page 13: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation13

Full DAG

ab

+

a

+++++++

&a a b

&-network

+

a

Column Compr.

In SW

~50

cyc

les

In HW

~3-4 cycles

ArithmeticOptimiser

&

0x8000

>>15

-&

<<

1

+&

0x8000

>>15

-&

<<

1

+&

0x8000

>>15

-&

<<

1

+

Etc.

a

b<<

8

Page 14: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation14

Ad-hoc Unit To AccelerateMultiplication?! Yeah, a MUL…

Register File

ALU LD/ST MUL

Rn (Rn & 0x0000.ffff) x (Rm & 0x0000.ffff)

1 ad-hoc instruction added

function reduced by a factor 10-15

Page 15: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation15

Classic “Ad-hoc” Customisation…Altera Nios:

Can we do more of this, really ad-hoc?!

Page 16: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation16

Mainstream SoC/FPGA Processors and Specialisation?

All the recent embedded processors offer some sort of specialisation:

Arbitrary functional units or tightly coupled coprocessors (IFX Carmel 20xx, ARM, Tensilica Xtensa, Altera Nios, etc.)

Parametric resources (STM Lx, ARC Cores, Tensilica Xtensa, Altera Nios, etc.)

But all assume an onerousmanual study and design!

Page 17: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation17

Summary of Gain Potentials inAd-hoc FUs

Exploit data parallelism in hardware

Exploit constant for logic

simplificationSome operations reduce to wires in

hardware

Exploit arithmetic properties for efficient chaining of arithmetic operations (e.g., carry

save)

Page 18: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation18

GoalsHow much scope for AFU specialisation in

typical multimedia code?Are classic ILP techniques or other

optimisations (e.g., arithmetic) important to increase the speedup? To which extent?

What are the microarchitectural needs for exploiting well the potentials?Memory ports in the AFUs?Number of inputs from the register file? Are

two enough?Number of outputs to the register file? Is one

enough?

Page 19: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation19

Related Work inReconfigurable Computing Most of the work in reconfigurable computing; typically

experiments are linked to a given microarchitecture: CHIMAERA [Ye et al., 2000] has the most rich measurements

but only for 1-output AFUs and no AFU-memory interface Similarly PRISC [Razdan et al., 1994] and ConCISe [Kastrup et

al., 1999] use clustering approaches for 2 inputs - 1 output AFUs

GARP [Hauser et al., 1997] concentates on the mapping of control flow (hyperblocks in loops) in a loosely coupled architecture (coprocessor)

First, investigate where potentials are fix microarchitecture

Page 20: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation20

Related Work inAFU Identification Other authors concentrate on identification methods (“what

is the best function for an AFU?”) often with some microarchitectural assumptions MaxMISOs [Alippi et al., 1999] are 1-output candidates of

maximal size [Jacome et al., 2000] introduce vertical- and horizontal-

aggregation as heuristic methods to cluster operations (no comparisons with other techniques)

[Arnold et al, 2001] use library pattern-matching techniques with a dynamic pattern library (instruction clusters) but very limited cluster complexity (3 instructions) in the experiments

ASIP synthesis: different problem (minimal covering)

First, investigate where potentials are develop appropriate identification algorithms

Page 21: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation21

MethodologyConcentrate on Data Flow

Easier to capture automatically (no architecturally visible state in the AFUs)

Constant latency (variable latency would hardly fit into a statically scheduled environment—e.g., VLIW)

Measurements on Basic BlocksRepresent the upper limit of the potential

advantagesUpper limit is reachable if microarchitectural

constraints are satisfied (e.g., no. of inputs and outputs)

Page 22: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation22

Experimental Flow

Page 23: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation23

Software Execution:Approximate RISC Model One clock cycle assumed for most SUIF nodes, representing

the usage of the execution stage Exceptions: e.g., type casts (zero), divisions (N) Assumed all forwarding paths existing No data/instruction cache or perfect hit rates assumed Jumps accounted with a fixed amount to the cycle count of

each basic block

IF ID WBIF ID EX

IF ID EX2IF ID

1:2:

3:4:

5:

WB

EXWB

EX

EX1 EX3EXID

WB

IF WB

Page 24: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation24

Hardware Execution:Synthesis-based Model

Operator Precision Relative Delay hw

Multiply-Accumulator 32 bits x 32 bits + 64 bits 1.00

Adder 4 bits + 4 bits 0.11Adder 8 bits + 8 bits 0.12

Adder 16 bits + 16 bits 0.20Adder 24 bits + 24 bits 0.24Adder 32 bits + 32 bits 0.25

Divider 4 bits / 4 bits 0.38

Divider 8 bits / 8 bits 1.22Divider 16 bits / 16 bits 3.68

Divider 24 bits / 24 bits 6.33

Divider 32 bits / 32 bits 9.61Divider (by power of two) any / any 0.00

Barrel shifter 8 bits 0.08

Barrel shifter 16 bits 0.11

Barrel shifter 32 bits 0.16Barrel shifter (by constant amount) any 0.00

Bitwise multiplexer any 0.02

CMOS 0.18µ

SynopsysDesign Compiler+ DesignWare

Page 25: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation25

Partitioning of DFGMix of Hardware and Software

AFU memory bandwidth issueOn-AFU (Hardware) and

Off-AFU (Software) instructionsDFG partitioned in HW and SW layers

High Cost! LowPerformance?

Page 26: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation26

Example of Layering Hybrid DFGs

Hardwareand

softwarelayers

Page 27: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation27

Metrics and MeasurementsTopological basic block information:

Inputs, outputs, etc.Saved cycles speedup

HW

opsall

iSW CPiLat )(

_

opsSWall

iSW

layersAFUall

iHW

opsall

iSW iLatCPiLat

_____

)()(

Page 28: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation28

Basic Blocks CharacteristicsExamples

Weight

BB # In Out Ld St Tsw Thw nhw Thyb

adpcmdecode 5 22.84% 3 2 1 0 9 2.07 2 3

22 17.77% 4 3 1 1 7 1.25 1 3

9 12.69% 2 3 0 0 5 0.33 1 14 7.61% 1 3 1 0 6 1.00 2 3

mpeg2decode 4 37.44% 5 2 2 0 13 2.49 2 410 34.56% 4 2 2 0 12 2.49 2 4

pegwit 1 31.47% 2 0 296 36 811 3.65 3 335

25 9.06% 5 2 0 0 7 0.83 1 1

28 6.47% 2 0 2 1 5 2.29 2 59 6.45% 2 1 2 0 5 2.82 3 5

13 6.45% 2 0 2 1 5 2.29 2 510 5.16% 4 2 0 0 4 0.83 1 1

TopologyBenchmarkParallelMemoryAccess

SequentialMemoryAccess

Execution concentrated

in few BBs

Few Ld/St…

Small delays

…well separated

High RF pressure

Page 29: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation29

Basic Blocks CharacteristicsModerate hardware resources for AFUs:

Often, half of the execution time concentrated in not more than 2-3 basic blocks

Pressure on the register file higher than classically supported

Limited importance of memory portsExcept some dramatic cases…

Small delay of typical basic blocks

Page 30: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation30

Potential Basic SpeedupExamples

SeqMem

NoConst Basic

PlusBitwidth

PlusArith

adpcmdecode 5 1.80 12.71% 12.71% 12.71% 14.82% 14.82%

22 3.50 8.47% 10.59% 10.59% 10.59% 10.59%

9 1.67 8.47% 8.47% 8.47% 8.47% 8.47%4 1.20 3.18% 4.24% 5.29% 5.29% 5.29%

mpeg2decode 4 2.17 24.18% 24.18% 26.87% - -10 2.40 21.50% 21.50% 24.18% - -

pegwit 1 81.10 16.72% 28.31% 28.34% - -

25 1.75 7.03% 5.86% 7.03% - -

28 1.25 0.00% 1.17% 2.34% - -

9 1.00 0.00% 0.00% 2.33% - -

13 1.25 0.00% 1.17% 2.33% - -10 1.33 3.50% 3.50% 3.50% - -

Benchmark Cycle Savings

BB #

ILP

Good speedup with

few BBs

Not critical…

BBs too simple to

bring advantage

Page 31: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation31

Inputs and Outputs of Basic Blocks

Speedup per # inputs Speedup per # outputs

>60% ~50%

Page 32: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation32

Potential Basic SpeedupLimited available parallelismTop-ranking basic blocks: 10 to 50% cycle

savingsHardwired constants not a key advantage

Small price for a reduction in design riskSequentialisation penalty not dramatic

AFU memory ports not essentialAccurate bitwidth analysis and arithmetic

optimizations bring limited or no advantageBasic blocks are too simple, ceiling effects,…

Page 33: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation33

Effects of ILP TechniquesExamples

Benchmark

adpcmdecode (par.) 49.45% 88.11% 88.11% 88.11% 88.11%(2.0 x) (8.4 x) (8.4 x) (8.4 x) (8.4 x)

adpcmdecode (seq.) 45.21% 77.51% 77.51% 81.75% 81.75%(1.8 x) (4.5 x) (4.5 x) (5.5 x) (5.5 x)

mpeg2decode (par.) 68.23% 68.23% 86.91% - 87.92%(3.1 x) (3.1 x) (7.6 x) - (8.3 x)

mpeg2decode (seq.) 60.09% 60.09% 73.73% - 74.40%(2.5 x) (2.5 x) (3.8 x) - (3.9 x)

pegwit (par.) 63.33% 67.31% 67.31% - -(2.7 x) (3.1 x) (3.1 x) - -

pegwit (seq.) 38.99% 42.96% 42.96% - -(1.6 x) (1.7 x) (1.7 x) - -

PlusUnrolling

PlusBitwidthAnalysis

PlusArithmetic

Opt.

1

1

1

Basic

3

4

2

1

2

PlusPredication

2

2

10

1

1

1

1

2

2

1

2

2

1

-

1

-

-2

-

--

1

total speedup30% number of basic blocks to reach 30% speedup

Page 34: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation34

Effects of ILP TechniquesMajor improvements:

Cumulative speedups between 1.7x and 6.3xRegister file pressure not significantly

modifiedHardware complexity and Thw increased

Area is typically below 2-3x that of 32-bit multiplier, almost never >10x

Accurate bitwidth analysis and arithmetic optimisations bring limited or no advantageBaseline advantage already very large

Page 35: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation35

Arithmetic Optimisation Impact

mpeg2decode basic block #7 Tsw Thw bb

Without arithmetic transformations 55 5 25,344,000 30.6%

With arithmetic transfonmations 55 3 26,357,760 31.4%

w/o optimisation with optimisation

Page 36: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation36

ConclusionsDFG-level opens potential speedups (2–3x) at

low cost (hardware and toolset) and low riskLarger number of AFU write ports (2-3) neededHardcoding of constants not essentialAFU memory interfaces also not essentialILP techniques help, as expectedSophisticated and detailed techniques (bitwidth

analysis, arithmetic optimizations) sometimes masked by other effects

Page 37: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation37

Ongoing WorkMeasure advantages through a complete

toolchain (notably, compiler):DSP microarchitecture:

Validate simple model Find out bottlenecks and impose real DSP constraints

(e.g., nonortogonality)VLIW microarchitecture:

Go beyond simple software execution modelDevelop novel speedup-driven identification

algorithmsHow to get more AFU specialisation potentialsDynamic identification and configuration of

AFUs

Page 38: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation38

+++++

MS3

Typical Identification Algorithms Bottom-up greedy approaches to cluster

instructions Topologically-driven rather than speedup driven E.g., MaxMISO identification [Alippi et al., 1999]:

*

+

*

+

+

*

+ +

MS2

MS1+

*

+

*

Page 39: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation39

Speedup-driven Identification Prune-out optimal set of low-speedup nodes to

achieve the required input/output count

i0

0.1

i1 i2 i3 i4 i5 i6 i7

1

2

1

0.1

k

o0 o1

0.5

3

0.5

0.1

0.1SIMD-like and unconnected

graphs

Page 40: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation40

Open Issues and PerspectivesPower consumption advantages?

Power down because: Less instruction fetches and decodes Less register reads and writebacks

Power possibly up because: Reduced correlation of signals in the AFU Low-efficiency of the implementation (in case of

eFPGAs)More opportunities to increase speedup?

Detect and implement LUTs (e.g., in quantisers) as discrete CAMs

Detect runtime constant values

Page 41: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation41

Dynamic Specialisation?

Java Bytecode

JiT + Specialisation

ARM + RFU

Dynamic compilation and optimisation together with hardware specialisation

DAISY, Crusoe, JiT, etc. Specialisation may profit

from runtime information Identification in runtime

conditions Dynamic reconfigurability

challenge

Page 42: Automatic Processor Specialisation using  Ad-hoc Functional Units

© Ienne 2002Automatic Processor Specialisation42

ConclusionsProcessor customisation opportunities

are here: soft cores, FPGA processors, etc.Very specific field of hardware/software

codesign with a very large potentialDo not give up versatilityGet most of the performance of custom

hardwareNeeds automation, to complement

compilers and synthesizers (some work exists but limited in scope)