hw/sw co-design system partitioning in hw/sw co-design

*OutlineHW/SW Codesign for Embedded Systems

System PartitioningHeterogeneous PlatformsMapping Graphs to PlatformsHeuristic Optimisation Methods for Multiple Objectives

Summary

*Embedded System DesignAn embedded system is a computing device in general subject to a specific purpose and its implementation is predominantly deter-mined by this purpose, usually entailing a complete encapsulation into the environment where this purpose is located at.AutomotivePhones/PDAsTransceiver (WIFI, WLAN, xDSL,...)

*Embedded System Design Flow

*OutlineEmbedded System Design

System PartitioningHeterogeneous PlatformsMapping Graphs to PlatformsHeuristic Optimisation for Multiple Objectives

Summary

*Heterogeneous PlatformsClassical HW/SW Codesign Platform

Is around for ~20 yearsServed well to get a first grip on partitioningHas not gained any relevance for industrial design flows

*Heterogeneous PlatformsModern rapid prototyping platforms

Prototyping board for real-time MIMO OFDM DSP+MicrocontrollerFPGAsBusses and BridgesRAM and RegistersInterfaces

*Heterogeneous PlatformsModern SoC/embedded platforms

UMTS baseband transceiver chip (2003)DSP+MicrocontrollerASICsBusses and BridgesRAM and RegistersInterfaces

*Heterogeneous PlatformsLibrary forDSPsCache/RAMSchedulesFPGARAM/FlashSlices/GatesASICsRegisters/GatesChannelsFifo/Direct/BusMemorySchedulesParallel read/write access



Summary

*Mapping Graphs to PlatformsSystem Graphs

MF

MF

EnAcc

PeakDet

MF

PSCH

SSCH

GroupCode

MF

...

16x

ADC

Rakereceiver

Finger 1

Fingerselect

De-spreadPilot

De-spreadData

FreqOffsetEst

WeightGain

TPC/FBI

x

Finger 2

... Finger N

Sear-cher

...

Sear-cher

PathSelect

TCU

CellSearcher

DelayProfileEstimator

4 QAMDemod

Deinterleaver 1

Desegment

Deinterleaver 2

Turbo

Decoder

Viterbi

DecSwitch

Demodulator

Logical DataProcessing

CRC

12x

12x

root

BB1

exit

=

k

*Mapping Graphs to Platforms

v1

v3

ProcessGraph

v2

v6

v5

v4

*Mapping Graphs to PlatformsNP-hard multi-objective optimisation problemProven to be NP-complete by restriction to the classical graph partitioning problem

19

12

2

4

3

15

3

1

2

5

2

3

9

3

4

19

12

2

4

3

15

3

1

2

5

2

3

9

3

4



Summary

*Heuristic OptimisationMulti-objective optimisation problemA mapping of a problem instance I is called valid, iff , with being objective functions and being constraints. : is the mapping relation of a vertex i to the jth implementation alternative A on resource r.

Objective functions:Area for HW in gates/slices/NAND2 equivalents ( ) :

, with for ASICs, for FPGAs

Code size for SW in bytes ( ) : , with for code size on DSPs.

...

Heuristic optimisationObjective function fT : system delay (makespan)

Multi-core scheduling is NP-hard as well

v1

v2

v6

v5

v4

v3

ProcessGraph

DSP

FPGA

ASIC

v1

v2

v3

v6

v5

v4

DirectASIC-FPGA

Bus

SharedRAM

read

write

processing

SDRAM

Sche-dule

SDRAM

Schedule

ASIC

FIFOASIC-DSP

*Heuristic OptimisationDefinition A heuristic is a robust technique for the design of (randomised) algorithms for optimisation problems, and it provides (randomised) algorithms for which one is not able to guarantee at once the efficiency and the quality of the computed feasible solutions, even not with any bounded constant probability P > 0.

*Heuristic OptimisationPartitioning analytically not solvableUse heuristic methods Simulated AnnealingTabu SearchKernighan-Lin min-cutGenetic AlgorithmParticle SwarmCustom Heuristics (GCLP, RRES, etc.)...

*Heuristic OptimisationClassical Kernighan-Lin min-cutModificationsMore than two partitionsUnbalanced partitions allowedMultiple objectivesOmit change list ...

Partition 2

A

E

G

F

D

B

C

*SummaryScheduling/Partitioning is a hard optimisation problemHeuristic methods have to be appliedHighly dependent on platform model and high level estimation techniquesMany questions yet unsolvedExecution time profiles for processes (control flow)Estimation uncertaintiesAutomated platform composition...

*Outline

Thank you for your attention

Typcial Graphs Industry Design for xDSL Transceiver

ac_im_firFIR9 taps ?

ac_im_lp1WDF1.O

ac_im_lp2WDF 1.O

ac_rx_firFIR5 taps

ac_rx_hp1WDF1.O

8k -> 16k

ac_rx_lp1 WDF7.O

ac_rx_gain2

16k -> 32k

HOLD32k -> 256k

ac_rx_lp2WDF5.O

ac_rx_lp3 WDF 9.O

ac_rx_gain1

ac_th_firFIR9 taps ?

ac_th_ap WDF 1.O

ac_rx_trim

Hold 3.O 256k -> 16M

ac_th_hpWDF 1.O

ac_tx_gain1

ac_tx_hp1WDF 3.O

16k -> 8k

ac_tx_lp1WDF 7.O

ac_tx_fir FIR5 taps

32k -> 16k

256k -> 32k

ac_tx_lp2WDF5.O

ac_tx_lp3 WDF 5.O

ac_tx_gain3

ac_tx_trim

+

ac_th_tx

ac_im_data

+

ac_tx_hp_dis

ac_tx_co16

ac_rx_hp_dis

ac_th_hp_dis

ac_rx_fir_dis

+

ac_rx_co256

ac_im_dis

+

ac_rx_gain_dis

ac_tx_gain_dis

scaling *4

ac_tx_gain_dis

scaling *8

ac_th_dis

+

ac_tx_hp_dis

ac_tx_fir_dis

ac_rx_gain_dis

ac_rx_16k

trimming gain:0db +1.xdB(ac_rx_trim)

trimming gain:0db -1.xdB(ac_tx_trim)

24

round:24 msb ->17

scaling * 4(done in lpim2 together with 0.5 default wdf scaling)

ac_im_gain

ac_tx_hp2WDF 1.O

ac_tx_gain3_disscaling *2

ac_im

ac_th

ac_tx_im

ac_rx_trim

ac_rx_hp1_0

ac_rx_fir_0 - ac_rx_fir_4

ac_rx_lp1_0 - ac_rx_lp1_6

ac_rx_gain1

ac_rx_gain2

ac_tx_trim

ac_tx_gain3

ac_tx_gain2

ac_tx_gain1

ac_tx_lp1_0 -ac_tx_lp1_6



ac_tx_hp1_0 -ac_tx_hp1_2

ac_tx_hp2_0

ac_tx_fir_0 -ac_tx_fir_4

ac_im_fir_0 -ac_im_fir_9

ac_th_fir_0 -ac_th_fir_8

ac_th_ap_0

ac_th_hp_0

ac_im_lp2_0

ac_im_gain

ac_im_lp1_0

+223-1

-223+1

ac_tx_gain2



scaling *2

z-nn=0..24

ac_rx_gain1

ac_rx_gain_dis

ac_rx_gain1

ac_im_delay

z-1

Eventually additional logic needed to reprogram IM filter for Flexi Slic

Eventually additional logic needed to reprogram fir filter for Flexi Slic

Graph PropertiesDegree of parallelism = |VCP| / |V|Density = |E| / |V| Rank-Locality rloc = 1 / |E| (rank(vhead) rank(vtail))rank

= = 1.375

22

16

rloc = = 1.227

27

22

0

1

2

3

4

5

= = 2

16

8

6

7

Restricted Range Exhaustive SearchCreate task graphCreate ordered vector of processesCreate initial mapping

Start exhaustive search on subset of processes (window)Move window along the vectorFinally map process that leaves the window

Strong performance for typical graphsDegree of parallelismDensityLocality

j

k

l

i

a

e

d

g

c

b

f

h

Tentatively mapped

f

b

d

e

i

a

c

h

k

j

l

g

Vertex vector

Finally mapped

Results NormalisedRelativeCost = f (parallelism, locality)AveragedCostWindow LengthAveragedValidity

GA

50

100

150

= [1..10]

1.00

1.04

1.08

min

RRES

min

TS

min

2.6

5

10

15

W

< 50

2.5

2.7

2.8

ES

2.9

RRES

0

20

40

60

80

ES

RRES

*The Genome CodingArrange vertices on a stringString elements (alleles) indicate implementation alternative

What about the order of the vertices? Does it matter?

Genome of |V| vertices

...

v1

v|V|-1

v|V|

v2

v3

v4

v5

4

1

2

1

3

2

2

Id : Ai,j(vk) = (cs, et, gc)

1 : A0,0(v2) = (50, 256, 0)2 : A0,1(v2) = (40, 340, 0)3 : A1,0(v2) = (88, 192, 0)4 : A1,1(v2) = (72, 224, 0)5 : A3,0(v2) = ( 0, 92, 880)...

List of implementation alternatives for v2

vk

*Recombination with chromosomes1-point crossoverMulti-point crossoverUniform crossover

Why does it work?Fundamental schema theorem and the building block hypothesisSchema theoremShort, low-order, above average schemata (building block) proliferateBelow-average schemata die off

What makes schemata fit in system partitioning?

Defining length = 7Order o = 5Wildcard * = unspecified

a

b

k

j

i

h

g

f

e

d

c

m

l

*

*

*

*

1

*

6

*

4

1

2

*

*

Chromosome Schema

*Combinatorial vs. structural fitnessCombinatorial (area, code size, time)Low resource consumption is ensured for any single vertexCombination of assignments utilise resources optimallyStructural (time)Exact graph matching bet- ween task and architecture subgraphsParallel execution of processes and data transfers

Structural fitness requires a representation in the chromo- someBuilding blocks are short, low-order, and fit schemata

h

g

f

e

*Coding for structural exploitationLocality preserving chromosome codingAdjacent vertices in task graph shall be adjacent in chromosomeUse two schedules As soon as possibleAs last as possibleArrange vertices vi in increasing average start times: stavg(vi) = stasap(vi) + stalap(vi)

l

n

i

a

e

d

g

c

b

f

h

asap

alap

a

b

c

e

d

g

f

j

k

i

h

k

j

l

m

n

a

b

c

e

d

g

f

j

k

i

h

l

m

n

stasap(b)

stalap(b)

Rank

0

1

2

7

6

5

4

3

m

*ResultsImpact of genome codingCostnewrankrandom

*More resultsStructural mutation1-gene mutation (M1g)Swap mutation (Msw)Multi-swap mutation (Mbb)

*More resultsComparison with other heuristicsPenalty reward tabu search (pwTS)Simulated annealing (SA)Global criticality/local phase (GCLP)

Averaged cost Averaged Validity

*Conclusion3-operator GA has been implemented and analysedStructural problem components (time) have been exposedGenome coding Locality preserving orderingMutation Multi-swap mutationCrossover depends heavily on building block sizeComparison with heuristics from literature showed superior performance of GA over pwTSIn contrast to published work

*ResultsRelated to crossover recombinationUniform10-point5-point1-pointnewrandom

*More resultsSelection over mutation probabilityBinary tournament (BT)Survival of the fittest (SOTF)Roulette wheel (RW)

***********************************

hw/sw co-design system partitioning in hw/sw co-design

Documents

multiple objectivessummary

optimisation problems

embedded system design

objective functions

mapping relation

specific purpose

industrial design flows

dsp microcontrollerasicsbusses