communication issues on mpsoc platforms: performance ... ws at...artist2 artist2 embedded systems...
TRANSCRIPT
ARTIST2ARTIST2 Embedded Systems Design
Communication Issues on Communication Issues on MPSoCMPSoCPlatforms: Performance, Power Platforms: Performance, Power
and Predictabilityand Predictability
Luca Benini:ARTIST2 Activity LeaderLuca Benini:ARTIST2 Activity LeaderDEIS DEIS –– UniversitUniversitàà didi BolognaBologna
http://www.artist-embedded.org/FP6/
ARTIST Workshop at DATE’06W4: “Design Issues in Distributed,
Communication-Centric Systems”
ARTIST2ARTIST2 Embedded Systems Design
On-chip interconnect delay trend
Relative delay is growing even for optimized interconnects
ARTIST2ARTIST2 Embedded Systems Design
Wire delay vs. logic delay
Taken from W.J. Dally presentation: Computer architecture is all about interconnect (it is now and it will be more so in 2010) HPCA Panel February 4, 2002
ARTIST2ARTIST2 Embedded Systems Design
Unpredicatability & on-chip communication
Global behaviordepends on a set of interacting local behaviorscommunication is key for interaction
Unpredictable communicationunpredictable global behaviorEVEN IF local behavior is predictable
… … ……
interconnectmemarb.
interconnect
interconnect
ext.mem
IP
IP
IP
IP
IP
IP
IP
IP
IP
IP
IP
IP
IP
IP
IP
IP
cpu $ cpu$
Communication predictability is criticaleven in a NON-HRT context
ARTIST2ARTIST2 Embedded Systems Design
On-chip bus Architecture
Many alternativesLarge semiconductor firms (e.g. IBM Coreconnect, STMicroSTBus)Core vendors (e.g. ARM AMBA)Interconnect IP vendors (e.g. SiliconBackplane)
Same topology, different protocolsShared medium → contention → predictability losses
How do we support predictability in high-performance on-chip busses?
ARTIST2ARTIST2 Embedded Systems Design
AMBA bus
AHB: high-speed high-bandwidthmulti-master bus
APB: Simplified processor forgeneral purpose peripherals
System-PeripheralBusCPU
EU IO
EU MemMem
CPU
AMBA High-speed bus Bridge
Master portSlave port
ARTIST2ARTIST2 Embedded Systems Design
AMBA basic transfer
For a write
For a read
Pipelining increasesBus bandwidth
ARTIST2ARTIST2 Embedded Systems Design
Bus arbitraton
ARBITER
Dedicated wires
Shared address bus
HBREQ_M3
HBREQ_M2
HBREQ_M1
Arbitration Protocol is defined, but Arbitration Policy is not
Priority based protocols are the most common (round robin if priorities are not set)
ARTIST2ARTIST2 Embedded Systems Design
The price for arbitration
Time for arbitrationTime for handshaking
Wait state
ARTIST2ARTIST2 Embedded Systems Design
Critical analysis: bottlenecks
ProtocolLacks parallelism
In order completionNo multiple outstanding transactions: cannot hide slave wait states
High arbitration overhead (on single-transfers)Bus-centric vs. transaction-centric
Initiators and targets are exposed to bus architecture (e.g. arbiter)
Predictability losses come fromArbitrationBlocking behavior (in-order completion)
ARTIST2ARTIST2 Embedded Systems Design
STBUS
On-chip interconnect solution by STLevel 1-3: increasing complexity (and performance)
FeaturesHigher parallelism: 2 channels (M-S and S-M)Multiple outstanding transactions with out-of order completionSupports deep pipeliningSupports Packets (request and response) for multiple data transfersSupport for protection, caches, locking
Deployed in a number of large-scale SoCs in STM
ARTIST2ARTIST2 Embedded Systems Design
STBUS Protocol (Type 3)
Target
Initiator port Target port
Initiator
Request channel
Response channel
Transaction
Req Packet Resp Packet
Cell level
Packet level
Transaction level
Signal level
ARTIST2ARTIST2 Embedded Systems Design
STBUS bottlenecks
Protocol is not fully transaction-centricCannot connect initiator to target (e.g. initiator does not have control flow on the response channel)
Packets are atomic on the interconnectCannot initiate/receive multiple packets at the same timeLarge data transfers may starve other initiators
Predictability lossesCaused by congestion and by atomic blocking
ARTIST2ARTIST2 Embedded Systems Design
AMBA AXI
Latest (2003) evolution of AMBAAdvanced eXtensible Interface
FeaturesFully transaction centric: can connect M to S with nothing in betweenHigher parallelism: multiple channelsSupports bus-based power managementSupport for protection, caches, locking
DeploymentBecoming widespread
ARTIST2ARTIST2 Embedded Systems Design
Multi-channel M-S interface
Master
Slave
Address Channel
Write channel
Read channel
Write response ch.
VALID
DATA
READY
Channel hanshaking
4 parallel channels are available!
ARTIST2ARTIST2 Embedded Systems Design
Protocol scalability
2 wait states memory
AHB
STBUSlow buf
STBUShigh buf
AXI
Cannot hidearbitration and slave
responselatency
One new request processedwhile a response is in progress
More requests processedwhile a response is in progress
Interleaving of transfers onthe internal data lanes
More avanced protocols have less blocking…
ARTIST2ARTIST2 Embedded Systems Design
Execution time analysis
Advanced protocols provide better predictabilityBut: more area, higher latency, higher power consumption
AHB AXI STBus STBus (B)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
110%
2 Cores4 Cores6 Cores8 Cores
Rel
ativ
e ex
ecut
ion
time
AHB AXI STBus STBus (B)
0%10%20%30%40%50%60%70%80%90%
100%110%120%130%140%150%160%170%180%
2 Cores4 Cores6 Cores8 Cores
Rel
ativ
e ex
ecut
ion
time
1 kB cache (low bus traffic) 256 B cache (high bus traffic)
ARTIST2ARTIST2 Embedded Systems Design
The additive model
Unary model
Max bandwidthBus
time
t1 t2 t1 t2 t1 t2
Coarse-grain additive model
time
BusMax bandwidth
t1
t2Achievable bandwidth
Characterize the range of applicability of the modelForce the system to work under the additive regime.
ARTIST2ARTIST2 Embedded Systems Design
Tuning of the additive model
Requesting more than TH% of the theoretical maximum bandwidth causes the additive model to fail;
Generalized to a flow-based model for multi-hop networksBenefits of working in additive regime:
Task execution time almost independent of bus utilization;Performance predictability greatly enhanced.
Works well but imposes under-utilizationBreaks down for long, non-interruptible data bursts No deterministic guarantees on single transactions
Theoretical max BW
ARTIST2ARTIST2 Embedded Systems Design
Alternative approach: static slot allocation
Increase utilization – guaranteed delivery latencyCase study: Æthereal multi-hop interconnect (NoC)
Conceptually, two disjoint networksa network with slotted time contention-free arbitration (guaranteed services)a network with contention and buffering (best effort)
Several types of commitment in the networkcombine guaranteed worst-case behaviourwith good average resource usage
priority/arbitration
best-effortrouter
guaranteedrouter
programming
ARTIST2ARTIST2 Embedded Systems Design
Router architecture
Best-effort routerWorm-hole routingInput queueingSource routing
Guaranteed throughput routerContention-free routing
synchronous, using slot tablestime-division multiplexed circuits
Store-and-forward routingHeaderless packets
information is present in slot table
ARTIST2ARTIST2 Embedded Systems Design
Contention-free routing
Latency guarantees are easy in circuit switchingEmulate circuits with packet switchingSchedule packet injection in networksuch that they never contend for same link at same time
in space: disjoint pathsin time: time-division multiplexingor a combination
In a basic, shared bus context: static slot allocation protocol (e.g. SONICS)
t
ARTIST2ARTIST2 Embedded Systems Design
router 1
router 3
router 2networkinterface
networkinterface
networkinterface
1
1
2
1 2
1
3 3
input 2 for router 1 isoutput 1 for router 2
the input routed to the output at this slot
3 3
1
1
2
2
1
1
2
2
1
1
2
2
2
2
1
1
2
2
1
1
4 4
-
-
-
-
o1
-
i1
i1
-
o2
-
-
-
-
o3
-
-
-
-
o4
-
-
i4
-
o1
-
-
-
-
o2
-
-
-
-
o3
i1
-
-
-
o4 Use slots to• avoid contention• divide up bandwidth
-
-
-
-
o1
-
-
-
-
o3
-
i1
-
-
o4
i1
-
-
o2
i3
CFR by Example
ARTIST2ARTIST2 Embedded Systems Design
CFR setup
Use best-effort packets to set up connectionsset-up & tear-down packets like in ATM
Distributed, concurrent, pipelinedSafe: always consistentCompute slot assignment compile time, run time,or combinationConnection opening is guaranteed to complete(but without a latency guarantee)with commitment or rejection
ARTIST2ARTIST2 Embedded Systems Design
Router implementation
Memories (for packet storage)Register-based FIFOs are expensiveRAM-based FIFOs are as expensive
80% of router is memorySpecial hardware FIFOs are very useful
20% of router is memorySpeed of memories
registers are fast enoughRAMs may be too slowHardware FIFOs are fast enough
iqu iqu
iquiqu
switch
iqu
iqu
msu
stu
routers based onregister-file and hardware fifos
drawn to approximatelysame scale (1mm2, 0.26mm2)
ARTIST2ARTIST2 Embedded Systems Design
Critical Analysis
Latency of delivery is guaranteedBut the upper bound in insertion waiting time is loose
Can achieve 100% utilizationOnly if communication requirements do not fluctuate rapidly
Re-configuration of scheduling tables is long and expensiveCan do it only once in a while
Hardware (area and power) overhead is significantLarger switches and slot tables
ARTIST2ARTIST2 Embedded Systems Design
Conclusion and some thoughts
Advanced dynamic arbitration protocols enhance predictabiliyBut they add significant hardware complexityPrice for under-utilization
Static slot allocation protocols give deliver latency and bandwidth guarantees
But they may increase average latencyAnd they have very significant HW cost
Simple protocols with many physical channels (low utilization per channel) may represent a good alternative All break down with large atomic data transfers
These require explicit holistic scheduling approachesCan we combine bursty and fragmented traffic?
Yes, but it’s not easy to analyze these systems