hw/sw co-design system partitioning in hw/sw co-design

Download HW/SW Co-design System Partitioning in HW/SW Co-Design

If you can't read please download the document

Upload: seanna

Post on 08-Jan-2016

42 views

Category:

Documents


9 download

DESCRIPTION

Bastian Knerr June 6th, 2008. HW/SW Co-design System Partitioning in HW/SW Co-Design. Christian Doppler Laboratory for Design Methodology of Signal Processing Algorithms. Outline. HW/SW Codesign for Embedded Systems System Partitioning Heterogeneous Platforms - PowerPoint PPT Presentation

TRANSCRIPT

  • *OutlineHW/SW Codesign for Embedded Systems

    System PartitioningHeterogeneous PlatformsMapping Graphs to PlatformsHeuristic Optimisation Methods for Multiple Objectives

    Summary

  • *Embedded System DesignAn embedded system is a computing device in general subject to a specific purpose and its implementation is predominantly deter-mined by this purpose, usually entailing a complete encapsulation into the environment where this purpose is located at.AutomotivePhones/PDAsTransceiver (WIFI, WLAN, xDSL,...)

  • *Embedded System Design Flow

  • *OutlineEmbedded System Design

    System PartitioningHeterogeneous PlatformsMapping Graphs to PlatformsHeuristic Optimisation for Multiple Objectives

    Summary

  • *Heterogeneous PlatformsClassical HW/SW Codesign Platform

    Is around for ~20 yearsServed well to get a first grip on partitioningHas not gained any relevance for industrial design flows

  • *Heterogeneous PlatformsModern rapid prototyping platforms

    Prototyping board for real-time MIMO OFDM DSP+MicrocontrollerFPGAsBusses and BridgesRAM and RegistersInterfaces

  • *Heterogeneous PlatformsModern SoC/embedded platforms

    UMTS baseband trans- ceiver chip (2003)DSP+MicrocontrollerASICsBusses and BridgesRAM and RegistersInterfaces

  • *Heterogeneous PlatformsLibrary forDSPsCache/RAMSchedulesFPGARAM/FlashSlices/GatesASICsRegisters/GatesChannelsFifo/Direct/BusMemorySchedulesParallel read/write access

  • *OutlineEmbedded System Design

    System PartitioningHeterogeneous PlatformsMapping Graphs to PlatformsHeuristic Optimisation for Multiple Objectives

    Summary

  • *Mapping Graphs to PlatformsSystem Graphs

    MF

    MF

    EnAcc

    PeakDet

    MF

    PSCH

    SSCH

    GroupCode

    MF

    ...

    16x

    ADC

    Rakereceiver

    Finger 1

    Fingerselect

    De-spreadPilot

    De-spreadData

    FreqOffsetEst

    WeightGain

    TPC/FBI

    x

    Finger 2

    ... Finger N

    Sear-cher

    ...

    Sear-cher

    PathSelect

    TCU

    CellSearcher

    DelayProfileEstimator

    4 QAMDemod

    Deinterleaver 1

    Desegment

    Deinterleaver 2

    Turbo

    Decoder

    Viterbi

    DecSwitch

    Demodulator

    Logical DataProcessing

    CRC

    12x

    12x

    root

    BB1

    exit

    =

    k

  • *Mapping Graphs to Platforms

    v1

    v3

    ProcessGraph

    v2

    v6

    v5

    v4

  • *Mapping Graphs to PlatformsNP-hard multi-objective optimisation problemProven to be NP-complete by restriction to the classical graph partitioning problem

    19

    12

    2

    4

    3

    15

    3

    1

    2

    5

    2

    3

    9

    3

    4

    19

    12

    2

    4

    3

    15

    3

    1

    2

    5

    2

    3

    9

    3

    4

  • *OutlineEmbedded System Design

    System PartitioningHeterogeneous PlatformsMapping Graphs to PlatformsHeuristic Optimisation for Multiple Objectives

    Summary

  • *Heuristic OptimisationMulti-objective optimisation problemA mapping of a problem instance I is called valid, iff , with being objective functions and being constraints. : is the mapping relation of a vertex i to the jth implementation alternative A on resource r.

    Objective functions:Area for HW in gates/slices/NAND2 equivalents ( ) :

    , with for ASICs, for FPGAs

    Code size for SW in bytes ( ) : , with for code size on DSPs.

    ...

  • Heuristic optimisationObjective function fT : system delay (makespan)

    Multi-core scheduling is NP-hard as well

    v1

    v2

    v6

    v5

    v4

    v3

    ProcessGraph

    DSP

    FPGA

    ASIC

    v1

    v2

    v3

    v6

    v5

    v4

    DirectASIC-FPGA

    Bus

    SharedRAM

    read

    write

    processing

    SDRAM

    Sche-dule

    SDRAM

    Schedule

    ASIC

    FIFOASIC-DSP

  • *Heuristic OptimisationDefinition A heuristic is a robust technique for the design of (randomised) algorithms for optimisation problems, and it provides (randomised) algorithms for which one is not able to guarantee at once the efficiency and the quality of the computed feasible solutions, even not with any bounded constant probability P > 0.

  • *Heuristic OptimisationPartitioning analytically not solvableUse heuristic methods Simulated AnnealingTabu SearchKernighan-Lin min-cutGenetic AlgorithmParticle SwarmCustom Heuristics (GCLP, RRES, etc.)...

  • *Heuristic OptimisationClassical Kernighan-Lin min-cutModificationsMore than two partitionsUnbalanced partitions allowedMultiple objectivesOmit change list ...

    Partition 2

    A

    E

    G

    F

    D

    B

    C

  • *SummaryScheduling/Partitioning is a hard optimisation problemHeuristic methods have to be appliedHighly dependent on platform model and high level estimation techniquesMany questions yet unsolvedExecution time profiles for processes (control flow)Estimation uncertaintiesAutomated platform composition...

  • *Outline

    Thank you for your attention

  • Typcial Graphs Industry Design for xDSL Transceiver

    ac_im_firFIR9 taps ?

    ac_im_lp1WDF1.O

    ac_im_lp2WDF 1.O

    ac_rx_firFIR5 taps

    ac_rx_hp1WDF1.O

    8k -> 16k

    ac_rx_lp1 WDF7.O

    ac_rx_gain2

    16k -> 32k

    HOLD32k -> 256k

    ac_rx_lp2WDF5.O

    ac_rx_lp3 WDF 9.O

    ac_rx_gain1

    ac_th_firFIR9 taps ?

    ac_th_ap WDF 1.O

    ac_rx_trim

    Hold 3.O 256k -> 16M

    ac_th_hpWDF 1.O

    ac_tx_gain1

    ac_tx_hp1WDF 3.O

    16k -> 8k

    ac_tx_lp1WDF 7.O

    ac_tx_fir FIR5 taps

    32k -> 16k

    256k -> 32k

    ac_tx_lp2WDF5.O

    ac_tx_lp3 WDF 5.O

    ac_tx_gain3

    ac_tx_trim

    +

    ac_th_tx

    ac_im_data

    +

    ac_tx_hp_dis

    ac_tx_co16

    ac_rx_hp_dis

    ac_th_hp_dis

    ac_rx_fir_dis

    +

    ac_rx_co256

    ac_im_dis

    +

    ac_rx_gain_dis

    ac_tx_gain_dis

    scaling *4

    ac_tx_gain_dis

    scaling *8

    ac_th_dis

    +

    ac_tx_hp_dis

    ac_tx_fir_dis

    ac_rx_gain_dis

    ac_rx_16k

    trimming gain:0db +1.xdB(ac_rx_trim)

    trimming gain:0db -1.xdB(ac_tx_trim)

    24

    round:24 msb ->17

    scaling * 4(done in lpim2 together with 0.5 default wdf scaling)

    ac_im_gain

    ac_tx_hp2WDF 1.O

    ac_tx_gain3_disscaling *2

    ac_im

    ac_th

    ac_tx_im

    ac_rx_trim

    ac_rx_hp1_0

    ac_rx_fir_0 - ac_rx_fir_4

    ac_rx_lp1_0 - ac_rx_lp1_6

    ac_rx_gain1

    ac_rx_gain2

    ac_tx_trim

    ac_tx_gain3

    ac_tx_gain2

    ac_tx_gain1

    ac_tx_lp1_0 -ac_tx_lp1_6

    ac_tx_lp2_0 -ac_tx_lp2_4

    ac_tx_lp3_0 -ac_tx_lp3_4

    ac_tx_hp1_0 -ac_tx_hp1_2

    ac_tx_hp2_0

    ac_tx_fir_0 -ac_tx_fir_4

    ac_im_fir_0 -ac_im_fir_9

    ac_th_fir_0 -ac_th_fir_8

    ac_th_ap_0

    ac_th_hp_0

    ac_im_lp2_0

    ac_im_gain

    ac_im_lp1_0

    +223-1

    -223+1

    ac_tx_gain2

    ac_rx_lp2_0 - ac_rx_lp2_4

    ac_rx_lp3_0 - ac_rx_lp3_8

    scaling *2

    z-nn=0..24

    ac_rx_gain1

    ac_rx_gain_dis

    ac_rx_gain1

    ac_im_delay

    z-1

    Eventually additional logic needed to reprogram IM filter for Flexi Slic

    Eventually additional logic needed to reprogram fir filter for Flexi Slic

  • Graph PropertiesDegree of parallelism = |VCP| / |V|Density = |E| / |V| Rank-Locality rloc = 1 / |E| (rank(vhead) rank(vtail))rank

    = = 1.375

    22

    16

    rloc = = 1.227

    27

    22

    0

    1

    2

    3

    4

    5

    = = 2

    16

    8

    6

    7

  • Restricted Range Exhaustive SearchCreate task graphCreate ordered vector of processesCreate initial mapping

    Start exhaustive search on subset of processes (window)Move window along the vectorFinally map process that leaves the window

    Strong performance for typical graphsDegree of parallelismDensityLocality

    j

    k

    l

    i

    a

    e

    d

    g

    c

    b

    f

    h

    Tentatively mapped

    f

    b

    d

    e

    i

    a

    c

    h

    k

    j

    l

    g

    Vertex vector

    Finally mapped

  • Results NormalisedRelativeCost = f (parallelism, locality)AveragedCostWindow LengthAveragedValidity

    GA

    50

    100

    150

    = [1..10]

    1.00

    1.04

    1.08

    min

    RRES

    min

    TS

    min

    2.6

    5

    10

    15

    W

    < 50

    2.5

    2.7

    2.8

    ES

    2.9

    RRES

    0

    20

    40

    60

    80

    ES

    RRES

  • *The Genome CodingArrange vertices on a stringString elements (alleles) indicate implementation alternative

    What about the order of the vertices? Does it matter?

    Genome of |V| vertices

    ...

    v1

    v|V|-1

    v|V|

    v2

    v3

    v4

    v5

    4

    1

    2

    1

    3

    2

    2

    Id : Ai,j(vk) = (cs, et, gc)

    1 : A0,0(v2) = (50, 256, 0)2 : A0,1(v2) = (40, 340, 0)3 : A1,0(v2) = (88, 192, 0)4 : A1,1(v2) = (72, 224, 0)5 : A3,0(v2) = ( 0, 92, 880)...

    List of implementation alternatives for v2

    vk

  • *Recombination with chromosomes1-point crossoverMulti-point crossoverUniform crossover

    Why does it work?Fundamental schema theorem and the building block hypothesisSchema theoremShort, low-order, above average schemata (building block) proliferateBelow-average schemata die off

    What makes schemata fit in system partitioning?

    Defining length = 7Order o = 5Wildcard * = unspecified

    a

    b

    k

    j

    i

    h

    g

    f

    e

    d

    c

    m

    l

    *

    *

    *

    *

    1

    *

    6

    *

    4

    1

    2

    *

    *

    Chromosome Schema

  • *Combinatorial vs. structural fitnessCombinatorial (area, code size, time)Low resource consumption is ensured for any single vertexCombination of assignments utilise resources optimallyStructural (time)Exact graph matching bet- ween task and architecture subgraphsParallel execution of proces- ses and data transfers

    Structural fitness requires a representation in the chromo- someBuilding blocks are short, low-order, and fit schemata

    h

    g

    f

    e

  • *Coding for structural exploitationLocality preserving chromosome codingAdjacent vertices in task graph shall be adjacent in chromosomeUse two schedules As soon as possibleAs last as possibleArrange vertices vi in increasing average start times: stavg(vi) = stasap(vi) + stalap(vi)

    l

    n

    i

    a

    e

    d

    g

    c

    b

    f

    h

    asap

    alap

    a

    b

    c

    e

    d

    g

    f

    j

    k

    i

    h

    k

    j

    l

    m

    n

    a

    b

    c

    e

    d

    g

    f

    j

    k

    i

    h

    l

    m

    n

    stasap(b)

    stalap(b)

    Rank

    0

    1

    2

    7

    6

    5

    4

    3

    m

  • *ResultsImpact of genome codingCostnewrankrandom

  • *More resultsStructural mutation1-gene mutation (M1g)Swap mutation (Msw)Multi-swap mutation (Mbb)

  • *More resultsComparison with other heuristicsPenalty reward tabu search (pwTS)Simulated annealing (SA)Global criticality/local phase (GCLP)

    Averaged cost Averaged Validity

  • *Conclusion3-operator GA has been implemented and analysedStructural problem components (time) have been exposedGenome coding Locality preserving orderingMutation Multi-swap mutationCrossover depends heavily on building block sizeComparison with heuristics from literature showed superior performance of GA over pwTSIn contrast to published work

  • *ResultsRelated to crossover recombinationUniform10-point5-point1-pointnewrandom

  • *More resultsSelection over mutation probabilityBinary tournament (BT)Survival of the fittest (SOTF)Roulette wheel (RW)

    ***********************************