communication issues on mpsoc platforms: performance ... ws at...artist2 artist2 embedded systems...

Post on 17-Jan-2020

10 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ARTIST2ARTIST2 Embedded Systems Design

Communication Issues on Communication Issues on MPSoCMPSoCPlatforms: Performance, Power Platforms: Performance, Power

and Predictabilityand Predictability

Luca Benini:ARTIST2 Activity LeaderLuca Benini:ARTIST2 Activity LeaderDEIS DEIS –– UniversitUniversitàà didi BolognaBologna

http://www.artist-embedded.org/FP6/

ARTIST Workshop at DATE’06W4: “Design Issues in Distributed,

Communication-Centric Systems”

ARTIST2ARTIST2 Embedded Systems Design

On-chip interconnect delay trend

Relative delay is growing even for optimized interconnects

ARTIST2ARTIST2 Embedded Systems Design

Wire delay vs. logic delay

Taken from W.J. Dally presentation: Computer architecture is all about interconnect (it is now and it will be more so in 2010) HPCA Panel February 4, 2002

ARTIST2ARTIST2 Embedded Systems Design

Unpredicatability & on-chip communication

Global behaviordepends on a set of interacting local behaviorscommunication is key for interaction

Unpredictable communicationunpredictable global behaviorEVEN IF local behavior is predictable

… … ……

interconnectmemarb.

interconnect

interconnect

ext.mem

IP

IP

IP

IP

IP

IP

IP

IP

IP

IP

IP

IP

IP

IP

IP

IP

cpu $ cpu$

Communication predictability is criticaleven in a NON-HRT context

ARTIST2ARTIST2 Embedded Systems Design

On-chip bus Architecture

Many alternativesLarge semiconductor firms (e.g. IBM Coreconnect, STMicroSTBus)Core vendors (e.g. ARM AMBA)Interconnect IP vendors (e.g. SiliconBackplane)

Same topology, different protocolsShared medium → contention → predictability losses

How do we support predictability in high-performance on-chip busses?

ARTIST2ARTIST2 Embedded Systems Design

AMBA bus

AHB: high-speed high-bandwidthmulti-master bus

APB: Simplified processor forgeneral purpose peripherals

System-PeripheralBusCPU

EU IO

EU MemMem

CPU

AMBA High-speed bus Bridge

Master portSlave port

ARTIST2ARTIST2 Embedded Systems Design

AMBA basic transfer

For a write

For a read

Pipelining increasesBus bandwidth

ARTIST2ARTIST2 Embedded Systems Design

Bus arbitraton

ARBITER

Dedicated wires

Shared address bus

HBREQ_M3

HBREQ_M2

HBREQ_M1

Arbitration Protocol is defined, but Arbitration Policy is not

Priority based protocols are the most common (round robin if priorities are not set)

ARTIST2ARTIST2 Embedded Systems Design

The price for arbitration

Time for arbitrationTime for handshaking

Wait state

ARTIST2ARTIST2 Embedded Systems Design

Critical analysis: bottlenecks

ProtocolLacks parallelism

In order completionNo multiple outstanding transactions: cannot hide slave wait states

High arbitration overhead (on single-transfers)Bus-centric vs. transaction-centric

Initiators and targets are exposed to bus architecture (e.g. arbiter)

Predictability losses come fromArbitrationBlocking behavior (in-order completion)

ARTIST2ARTIST2 Embedded Systems Design

STBUS

On-chip interconnect solution by STLevel 1-3: increasing complexity (and performance)

FeaturesHigher parallelism: 2 channels (M-S and S-M)Multiple outstanding transactions with out-of order completionSupports deep pipeliningSupports Packets (request and response) for multiple data transfersSupport for protection, caches, locking

Deployed in a number of large-scale SoCs in STM

ARTIST2ARTIST2 Embedded Systems Design

STBUS Protocol (Type 3)

Target

Initiator port Target port

Initiator

Request channel

Response channel

Transaction

Req Packet Resp Packet

Cell level

Packet level

Transaction level

Signal level

ARTIST2ARTIST2 Embedded Systems Design

STBUS bottlenecks

Protocol is not fully transaction-centricCannot connect initiator to target (e.g. initiator does not have control flow on the response channel)

Packets are atomic on the interconnectCannot initiate/receive multiple packets at the same timeLarge data transfers may starve other initiators

Predictability lossesCaused by congestion and by atomic blocking

ARTIST2ARTIST2 Embedded Systems Design

AMBA AXI

Latest (2003) evolution of AMBAAdvanced eXtensible Interface

FeaturesFully transaction centric: can connect M to S with nothing in betweenHigher parallelism: multiple channelsSupports bus-based power managementSupport for protection, caches, locking

DeploymentBecoming widespread

ARTIST2ARTIST2 Embedded Systems Design

Multi-channel M-S interface

Master

Slave

Address Channel

Write channel

Read channel

Write response ch.

VALID

DATA

READY

Channel hanshaking

4 parallel channels are available!

ARTIST2ARTIST2 Embedded Systems Design

Protocol scalability

2 wait states memory

AHB

STBUSlow buf

STBUShigh buf

AXI

Cannot hidearbitration and slave

responselatency

One new request processedwhile a response is in progress

More requests processedwhile a response is in progress

Interleaving of transfers onthe internal data lanes

More avanced protocols have less blocking…

ARTIST2ARTIST2 Embedded Systems Design

Execution time analysis

Advanced protocols provide better predictabilityBut: more area, higher latency, higher power consumption

AHB AXI STBus STBus (B)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

110%

2 Cores4 Cores6 Cores8 Cores

Rel

ativ

e ex

ecut

ion

time

AHB AXI STBus STBus (B)

0%10%20%30%40%50%60%70%80%90%

100%110%120%130%140%150%160%170%180%

2 Cores4 Cores6 Cores8 Cores

Rel

ativ

e ex

ecut

ion

time

1 kB cache (low bus traffic) 256 B cache (high bus traffic)

ARTIST2ARTIST2 Embedded Systems Design

The additive model

Unary model

Max bandwidthBus

time

t1 t2 t1 t2 t1 t2

Coarse-grain additive model

time

BusMax bandwidth

t1

t2Achievable bandwidth

Characterize the range of applicability of the modelForce the system to work under the additive regime.

ARTIST2ARTIST2 Embedded Systems Design

Tuning of the additive model

Requesting more than TH% of the theoretical maximum bandwidth causes the additive model to fail;

Generalized to a flow-based model for multi-hop networksBenefits of working in additive regime:

Task execution time almost independent of bus utilization;Performance predictability greatly enhanced.

Works well but imposes under-utilizationBreaks down for long, non-interruptible data bursts No deterministic guarantees on single transactions

Theoretical max BW

ARTIST2ARTIST2 Embedded Systems Design

Alternative approach: static slot allocation

Increase utilization – guaranteed delivery latencyCase study: Æthereal multi-hop interconnect (NoC)

Conceptually, two disjoint networksa network with slotted time contention-free arbitration (guaranteed services)a network with contention and buffering (best effort)

Several types of commitment in the networkcombine guaranteed worst-case behaviourwith good average resource usage

priority/arbitration

best-effortrouter

guaranteedrouter

programming

ARTIST2ARTIST2 Embedded Systems Design

Router architecture

Best-effort routerWorm-hole routingInput queueingSource routing

Guaranteed throughput routerContention-free routing

synchronous, using slot tablestime-division multiplexed circuits

Store-and-forward routingHeaderless packets

information is present in slot table

ARTIST2ARTIST2 Embedded Systems Design

Contention-free routing

Latency guarantees are easy in circuit switchingEmulate circuits with packet switchingSchedule packet injection in networksuch that they never contend for same link at same time

in space: disjoint pathsin time: time-division multiplexingor a combination

In a basic, shared bus context: static slot allocation protocol (e.g. SONICS)

t

ARTIST2ARTIST2 Embedded Systems Design

router 1

router 3

router 2networkinterface

networkinterface

networkinterface

1

1

2

1 2

1

3 3

input 2 for router 1 isoutput 1 for router 2

the input routed to the output at this slot

3 3

1

1

2

2

1

1

2

2

1

1

2

2

2

2

1

1

2

2

1

1

4 4

-

-

-

-

o1

-

i1

i1

-

o2

-

-

-

-

o3

-

-

-

-

o4

-

-

i4

-

o1

-

-

-

-

o2

-

-

-

-

o3

i1

-

-

-

o4 Use slots to• avoid contention• divide up bandwidth

-

-

-

-

o1

-

-

-

-

o3

-

i1

-

-

o4

i1

-

-

o2

i3

CFR by Example

ARTIST2ARTIST2 Embedded Systems Design

CFR setup

Use best-effort packets to set up connectionsset-up & tear-down packets like in ATM

Distributed, concurrent, pipelinedSafe: always consistentCompute slot assignment compile time, run time,or combinationConnection opening is guaranteed to complete(but without a latency guarantee)with commitment or rejection

ARTIST2ARTIST2 Embedded Systems Design

Router implementation

Memories (for packet storage)Register-based FIFOs are expensiveRAM-based FIFOs are as expensive

80% of router is memorySpecial hardware FIFOs are very useful

20% of router is memorySpeed of memories

registers are fast enoughRAMs may be too slowHardware FIFOs are fast enough

iqu iqu

iquiqu

switch

iqu

iqu

msu

stu

routers based onregister-file and hardware fifos

drawn to approximatelysame scale (1mm2, 0.26mm2)

ARTIST2ARTIST2 Embedded Systems Design

Critical Analysis

Latency of delivery is guaranteedBut the upper bound in insertion waiting time is loose

Can achieve 100% utilizationOnly if communication requirements do not fluctuate rapidly

Re-configuration of scheduling tables is long and expensiveCan do it only once in a while

Hardware (area and power) overhead is significantLarger switches and slot tables

ARTIST2ARTIST2 Embedded Systems Design

Conclusion and some thoughts

Advanced dynamic arbitration protocols enhance predictabiliyBut they add significant hardware complexityPrice for under-utilization

Static slot allocation protocols give deliver latency and bandwidth guarantees

But they may increase average latencyAnd they have very significant HW cost

Simple protocols with many physical channels (low utilization per channel) may represent a good alternative All break down with large atomic data transfers

These require explicit holistic scheduling approachesCan we combine bursty and fragmented traffic?

Yes, but it’s not easy to analyze these systems

top related