communication issues on mpsoc platforms: performance ... ws at...artist2 artist2 embedded systems...

27
ARTIST2 ARTIST2 Embedded Systems Design Communication Issues on Communication Issues on MPSoC MPSoC Platforms: Performance, Power Platforms: Performance, Power and Predictability and Predictability Luca Benini:ARTIST2 Activity Leader Luca Benini:ARTIST2 Activity Leader DEIS DEIS Universit Universit à à di di Bologna Bologna http://www.artist-embedded.org/FP6/ ARTIST Workshop at DATE’06 W4: “Design Issues in Distributed, Communication-Centric Systems”

Upload: others

Post on 17-Jan-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Communication Issues on MPSoC Platforms: Performance ... WS at...ARTIST2 ARTIST2 Embedded Systems Design Wire delay vs. logic delay Taken from W.J. Dally presentation: Computer architecture

ARTIST2ARTIST2 Embedded Systems Design

Communication Issues on Communication Issues on MPSoCMPSoCPlatforms: Performance, Power Platforms: Performance, Power

and Predictabilityand Predictability

Luca Benini:ARTIST2 Activity LeaderLuca Benini:ARTIST2 Activity LeaderDEIS DEIS –– UniversitUniversitàà didi BolognaBologna

http://www.artist-embedded.org/FP6/

ARTIST Workshop at DATE’06W4: “Design Issues in Distributed,

Communication-Centric Systems”

Page 2: Communication Issues on MPSoC Platforms: Performance ... WS at...ARTIST2 ARTIST2 Embedded Systems Design Wire delay vs. logic delay Taken from W.J. Dally presentation: Computer architecture

ARTIST2ARTIST2 Embedded Systems Design

On-chip interconnect delay trend

Relative delay is growing even for optimized interconnects

Page 3: Communication Issues on MPSoC Platforms: Performance ... WS at...ARTIST2 ARTIST2 Embedded Systems Design Wire delay vs. logic delay Taken from W.J. Dally presentation: Computer architecture

ARTIST2ARTIST2 Embedded Systems Design

Wire delay vs. logic delay

Taken from W.J. Dally presentation: Computer architecture is all about interconnect (it is now and it will be more so in 2010) HPCA Panel February 4, 2002

Page 4: Communication Issues on MPSoC Platforms: Performance ... WS at...ARTIST2 ARTIST2 Embedded Systems Design Wire delay vs. logic delay Taken from W.J. Dally presentation: Computer architecture

ARTIST2ARTIST2 Embedded Systems Design

Unpredicatability & on-chip communication

Global behaviordepends on a set of interacting local behaviorscommunication is key for interaction

Unpredictable communicationunpredictable global behaviorEVEN IF local behavior is predictable

… … ……

interconnectmemarb.

interconnect

interconnect

ext.mem

IP

IP

IP

IP

IP

IP

IP

IP

IP

IP

IP

IP

IP

IP

IP

IP

cpu $ cpu$

Communication predictability is criticaleven in a NON-HRT context

Page 5: Communication Issues on MPSoC Platforms: Performance ... WS at...ARTIST2 ARTIST2 Embedded Systems Design Wire delay vs. logic delay Taken from W.J. Dally presentation: Computer architecture

ARTIST2ARTIST2 Embedded Systems Design

On-chip bus Architecture

Many alternativesLarge semiconductor firms (e.g. IBM Coreconnect, STMicroSTBus)Core vendors (e.g. ARM AMBA)Interconnect IP vendors (e.g. SiliconBackplane)

Same topology, different protocolsShared medium → contention → predictability losses

How do we support predictability in high-performance on-chip busses?

Page 6: Communication Issues on MPSoC Platforms: Performance ... WS at...ARTIST2 ARTIST2 Embedded Systems Design Wire delay vs. logic delay Taken from W.J. Dally presentation: Computer architecture

ARTIST2ARTIST2 Embedded Systems Design

AMBA bus

AHB: high-speed high-bandwidthmulti-master bus

APB: Simplified processor forgeneral purpose peripherals

System-PeripheralBusCPU

EU IO

EU MemMem

CPU

AMBA High-speed bus Bridge

Master portSlave port

Page 7: Communication Issues on MPSoC Platforms: Performance ... WS at...ARTIST2 ARTIST2 Embedded Systems Design Wire delay vs. logic delay Taken from W.J. Dally presentation: Computer architecture

ARTIST2ARTIST2 Embedded Systems Design

AMBA basic transfer

For a write

For a read

Pipelining increasesBus bandwidth

Page 8: Communication Issues on MPSoC Platforms: Performance ... WS at...ARTIST2 ARTIST2 Embedded Systems Design Wire delay vs. logic delay Taken from W.J. Dally presentation: Computer architecture

ARTIST2ARTIST2 Embedded Systems Design

Bus arbitraton

ARBITER

Dedicated wires

Shared address bus

HBREQ_M3

HBREQ_M2

HBREQ_M1

Arbitration Protocol is defined, but Arbitration Policy is not

Priority based protocols are the most common (round robin if priorities are not set)

Page 9: Communication Issues on MPSoC Platforms: Performance ... WS at...ARTIST2 ARTIST2 Embedded Systems Design Wire delay vs. logic delay Taken from W.J. Dally presentation: Computer architecture

ARTIST2ARTIST2 Embedded Systems Design

The price for arbitration

Time for arbitrationTime for handshaking

Wait state

Page 10: Communication Issues on MPSoC Platforms: Performance ... WS at...ARTIST2 ARTIST2 Embedded Systems Design Wire delay vs. logic delay Taken from W.J. Dally presentation: Computer architecture

ARTIST2ARTIST2 Embedded Systems Design

Critical analysis: bottlenecks

ProtocolLacks parallelism

In order completionNo multiple outstanding transactions: cannot hide slave wait states

High arbitration overhead (on single-transfers)Bus-centric vs. transaction-centric

Initiators and targets are exposed to bus architecture (e.g. arbiter)

Predictability losses come fromArbitrationBlocking behavior (in-order completion)

Page 11: Communication Issues on MPSoC Platforms: Performance ... WS at...ARTIST2 ARTIST2 Embedded Systems Design Wire delay vs. logic delay Taken from W.J. Dally presentation: Computer architecture

ARTIST2ARTIST2 Embedded Systems Design

STBUS

On-chip interconnect solution by STLevel 1-3: increasing complexity (and performance)

FeaturesHigher parallelism: 2 channels (M-S and S-M)Multiple outstanding transactions with out-of order completionSupports deep pipeliningSupports Packets (request and response) for multiple data transfersSupport for protection, caches, locking

Deployed in a number of large-scale SoCs in STM

Page 12: Communication Issues on MPSoC Platforms: Performance ... WS at...ARTIST2 ARTIST2 Embedded Systems Design Wire delay vs. logic delay Taken from W.J. Dally presentation: Computer architecture

ARTIST2ARTIST2 Embedded Systems Design

STBUS Protocol (Type 3)

Target

Initiator port Target port

Initiator

Request channel

Response channel

Transaction

Req Packet Resp Packet

Cell level

Packet level

Transaction level

Signal level

Page 13: Communication Issues on MPSoC Platforms: Performance ... WS at...ARTIST2 ARTIST2 Embedded Systems Design Wire delay vs. logic delay Taken from W.J. Dally presentation: Computer architecture

ARTIST2ARTIST2 Embedded Systems Design

STBUS bottlenecks

Protocol is not fully transaction-centricCannot connect initiator to target (e.g. initiator does not have control flow on the response channel)

Packets are atomic on the interconnectCannot initiate/receive multiple packets at the same timeLarge data transfers may starve other initiators

Predictability lossesCaused by congestion and by atomic blocking

Page 14: Communication Issues on MPSoC Platforms: Performance ... WS at...ARTIST2 ARTIST2 Embedded Systems Design Wire delay vs. logic delay Taken from W.J. Dally presentation: Computer architecture

ARTIST2ARTIST2 Embedded Systems Design

AMBA AXI

Latest (2003) evolution of AMBAAdvanced eXtensible Interface

FeaturesFully transaction centric: can connect M to S with nothing in betweenHigher parallelism: multiple channelsSupports bus-based power managementSupport for protection, caches, locking

DeploymentBecoming widespread

Page 15: Communication Issues on MPSoC Platforms: Performance ... WS at...ARTIST2 ARTIST2 Embedded Systems Design Wire delay vs. logic delay Taken from W.J. Dally presentation: Computer architecture

ARTIST2ARTIST2 Embedded Systems Design

Multi-channel M-S interface

Master

Slave

Address Channel

Write channel

Read channel

Write response ch.

VALID

DATA

READY

Channel hanshaking

4 parallel channels are available!

Page 16: Communication Issues on MPSoC Platforms: Performance ... WS at...ARTIST2 ARTIST2 Embedded Systems Design Wire delay vs. logic delay Taken from W.J. Dally presentation: Computer architecture

ARTIST2ARTIST2 Embedded Systems Design

Protocol scalability

2 wait states memory

AHB

STBUSlow buf

STBUShigh buf

AXI

Cannot hidearbitration and slave

responselatency

One new request processedwhile a response is in progress

More requests processedwhile a response is in progress

Interleaving of transfers onthe internal data lanes

More avanced protocols have less blocking…

Page 17: Communication Issues on MPSoC Platforms: Performance ... WS at...ARTIST2 ARTIST2 Embedded Systems Design Wire delay vs. logic delay Taken from W.J. Dally presentation: Computer architecture

ARTIST2ARTIST2 Embedded Systems Design

Execution time analysis

Advanced protocols provide better predictabilityBut: more area, higher latency, higher power consumption

AHB AXI STBus STBus (B)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

110%

2 Cores4 Cores6 Cores8 Cores

Rel

ativ

e ex

ecut

ion

time

AHB AXI STBus STBus (B)

0%10%20%30%40%50%60%70%80%90%

100%110%120%130%140%150%160%170%180%

2 Cores4 Cores6 Cores8 Cores

Rel

ativ

e ex

ecut

ion

time

1 kB cache (low bus traffic) 256 B cache (high bus traffic)

Page 18: Communication Issues on MPSoC Platforms: Performance ... WS at...ARTIST2 ARTIST2 Embedded Systems Design Wire delay vs. logic delay Taken from W.J. Dally presentation: Computer architecture

ARTIST2ARTIST2 Embedded Systems Design

The additive model

Unary model

Max bandwidthBus

time

t1 t2 t1 t2 t1 t2

Coarse-grain additive model

time

BusMax bandwidth

t1

t2Achievable bandwidth

Characterize the range of applicability of the modelForce the system to work under the additive regime.

Page 19: Communication Issues on MPSoC Platforms: Performance ... WS at...ARTIST2 ARTIST2 Embedded Systems Design Wire delay vs. logic delay Taken from W.J. Dally presentation: Computer architecture

ARTIST2ARTIST2 Embedded Systems Design

Tuning of the additive model

Requesting more than TH% of the theoretical maximum bandwidth causes the additive model to fail;

Generalized to a flow-based model for multi-hop networksBenefits of working in additive regime:

Task execution time almost independent of bus utilization;Performance predictability greatly enhanced.

Works well but imposes under-utilizationBreaks down for long, non-interruptible data bursts No deterministic guarantees on single transactions

Theoretical max BW

Page 20: Communication Issues on MPSoC Platforms: Performance ... WS at...ARTIST2 ARTIST2 Embedded Systems Design Wire delay vs. logic delay Taken from W.J. Dally presentation: Computer architecture

ARTIST2ARTIST2 Embedded Systems Design

Alternative approach: static slot allocation

Increase utilization – guaranteed delivery latencyCase study: Æthereal multi-hop interconnect (NoC)

Conceptually, two disjoint networksa network with slotted time contention-free arbitration (guaranteed services)a network with contention and buffering (best effort)

Several types of commitment in the networkcombine guaranteed worst-case behaviourwith good average resource usage

priority/arbitration

best-effortrouter

guaranteedrouter

programming

Page 21: Communication Issues on MPSoC Platforms: Performance ... WS at...ARTIST2 ARTIST2 Embedded Systems Design Wire delay vs. logic delay Taken from W.J. Dally presentation: Computer architecture

ARTIST2ARTIST2 Embedded Systems Design

Router architecture

Best-effort routerWorm-hole routingInput queueingSource routing

Guaranteed throughput routerContention-free routing

synchronous, using slot tablestime-division multiplexed circuits

Store-and-forward routingHeaderless packets

information is present in slot table

Page 22: Communication Issues on MPSoC Platforms: Performance ... WS at...ARTIST2 ARTIST2 Embedded Systems Design Wire delay vs. logic delay Taken from W.J. Dally presentation: Computer architecture

ARTIST2ARTIST2 Embedded Systems Design

Contention-free routing

Latency guarantees are easy in circuit switchingEmulate circuits with packet switchingSchedule packet injection in networksuch that they never contend for same link at same time

in space: disjoint pathsin time: time-division multiplexingor a combination

In a basic, shared bus context: static slot allocation protocol (e.g. SONICS)

t

Page 23: Communication Issues on MPSoC Platforms: Performance ... WS at...ARTIST2 ARTIST2 Embedded Systems Design Wire delay vs. logic delay Taken from W.J. Dally presentation: Computer architecture

ARTIST2ARTIST2 Embedded Systems Design

router 1

router 3

router 2networkinterface

networkinterface

networkinterface

1

1

2

1 2

1

3 3

input 2 for router 1 isoutput 1 for router 2

the input routed to the output at this slot

3 3

1

1

2

2

1

1

2

2

1

1

2

2

2

2

1

1

2

2

1

1

4 4

-

-

-

-

o1

-

i1

i1

-

o2

-

-

-

-

o3

-

-

-

-

o4

-

-

i4

-

o1

-

-

-

-

o2

-

-

-

-

o3

i1

-

-

-

o4 Use slots to• avoid contention• divide up bandwidth

-

-

-

-

o1

-

-

-

-

o3

-

i1

-

-

o4

i1

-

-

o2

i3

CFR by Example

Page 24: Communication Issues on MPSoC Platforms: Performance ... WS at...ARTIST2 ARTIST2 Embedded Systems Design Wire delay vs. logic delay Taken from W.J. Dally presentation: Computer architecture

ARTIST2ARTIST2 Embedded Systems Design

CFR setup

Use best-effort packets to set up connectionsset-up & tear-down packets like in ATM

Distributed, concurrent, pipelinedSafe: always consistentCompute slot assignment compile time, run time,or combinationConnection opening is guaranteed to complete(but without a latency guarantee)with commitment or rejection

Page 25: Communication Issues on MPSoC Platforms: Performance ... WS at...ARTIST2 ARTIST2 Embedded Systems Design Wire delay vs. logic delay Taken from W.J. Dally presentation: Computer architecture

ARTIST2ARTIST2 Embedded Systems Design

Router implementation

Memories (for packet storage)Register-based FIFOs are expensiveRAM-based FIFOs are as expensive

80% of router is memorySpecial hardware FIFOs are very useful

20% of router is memorySpeed of memories

registers are fast enoughRAMs may be too slowHardware FIFOs are fast enough

iqu iqu

iquiqu

switch

iqu

iqu

msu

stu

routers based onregister-file and hardware fifos

drawn to approximatelysame scale (1mm2, 0.26mm2)

Page 26: Communication Issues on MPSoC Platforms: Performance ... WS at...ARTIST2 ARTIST2 Embedded Systems Design Wire delay vs. logic delay Taken from W.J. Dally presentation: Computer architecture

ARTIST2ARTIST2 Embedded Systems Design

Critical Analysis

Latency of delivery is guaranteedBut the upper bound in insertion waiting time is loose

Can achieve 100% utilizationOnly if communication requirements do not fluctuate rapidly

Re-configuration of scheduling tables is long and expensiveCan do it only once in a while

Hardware (area and power) overhead is significantLarger switches and slot tables

Page 27: Communication Issues on MPSoC Platforms: Performance ... WS at...ARTIST2 ARTIST2 Embedded Systems Design Wire delay vs. logic delay Taken from W.J. Dally presentation: Computer architecture

ARTIST2ARTIST2 Embedded Systems Design

Conclusion and some thoughts

Advanced dynamic arbitration protocols enhance predictabiliyBut they add significant hardware complexityPrice for under-utilization

Static slot allocation protocols give deliver latency and bandwidth guarantees

But they may increase average latencyAnd they have very significant HW cost

Simple protocols with many physical channels (low utilization per channel) may represent a good alternative All break down with large atomic data transfers

These require explicit holistic scheduling approachesCan we combine bursty and fragmented traffic?

Yes, but it’s not easy to analyze these systems