communication-centric design
DESCRIPTION
Communication-Centric Design. Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge Workshop on On- and Off-Chip Interconnection Networks for Multicore Systems, 6-7 Dec. 2006, Stanford. Power Efficient - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Communication-Centric Design](https://reader036.vdocuments.net/reader036/viewer/2022062816/56813a97550346895da29046/html5/thumbnails/1.jpg)
Communication-Centric Design
Robert MullinsComputer Architecture GroupComputer Laboratory, University of CambridgeWorkshop on On- and Off-Chip Interconnection Networks for Multicore Systems, 6-7 Dec. 2006, Stanford.
![Page 2: Communication-Centric Design](https://reader036.vdocuments.net/reader036/viewer/2022062816/56813a97550346895da29046/html5/thumbnails/2.jpg)
2
Convergence to flexible parallel architectures
• Power Efficient– Better match application
characteristics (streaming, coarse-grain parallelism…)
– Constraint-driven execution
• Simple– Increased regularity– S/W programmable– Limited core/tile set – Ease verification issues
• Flexible– Multi-use platform
FPGAsFPGAs
Embedded Embedded ProcessorsProcessorsGPUsGPUs
Multi-CoreMulti-CoreProcessorsProcessors
SoC PlatformsSoC Platforms
?
![Page 3: Communication-Centric Design](https://reader036.vdocuments.net/reader036/viewer/2022062816/56813a97550346895da29046/html5/thumbnails/3.jpg)
3
Our Group’s Research
• Now: support evolution of existing platforms – Low-latency and low-power
on-chip networks– System-timing
considerations– Networking communications
within FPGAs– Flexible networked SoC
systems, virtual IP– On-chip serial interconnects– Multi-wavelength optical
communication (off-chip)– Fault tolerant design
• Future:– Networks of processors to
processing networks– Processing Fabrics
FPGAsFPGAs
Embedded Embedded ProcessorsProcessorsGPUsGPUs
Multi-CoreMulti-CoreProcessorsProcessors
SoC PlatformsSoC Platforms
?
![Page 4: Communication-Centric Design](https://reader036.vdocuments.net/reader036/viewer/2022062816/56813a97550346895da29046/html5/thumbnails/4.jpg)
4
Low-Latency Virtual-Channel Packet-Switched Routers
• Goal was to develop a virtual-channel network for a tiled processor architecture
• Collaboration with Krste Asanović’s SCALE group at MIT
• Problem faced is rising interconnect costs• Networking communications can increase
communication latencies by an order of magnitude or more!
![Page 5: Communication-Centric Design](https://reader036.vdocuments.net/reader036/viewer/2022062816/56813a97550346895da29046/html5/thumbnails/5.jpg)
5
The Lochside test chip (2004/5)• UMC 0.18um Process• 4x4 mesh network, 25mm2
• Single Cycle Routers (router + link = 1 clock)
• May be clocked by both traditional H-tree and DCG
• 4 virtual-channels/input• 80-bit links
– 64-bit data + 16-bit control • 250MHz (worst-case PVT)
16Gb/s/channel (~35 FO4)• Approx 5M transistors
TILE
TrafficGenerator, Debug &
Test
R
Mullins, West and Moore (ISCA’04, ASP-DAC’06)
![Page 6: Communication-Centric Design](https://reader036.vdocuments.net/reader036/viewer/2022062816/56813a97550346895da29046/html5/thumbnails/6.jpg)
6
Virtual-Channel Flow Control
![Page 7: Communication-Centric Design](https://reader036.vdocuments.net/reader036/viewer/2022062816/56813a97550346895da29046/html5/thumbnails/7.jpg)
7
Typical Router Pipeline
• Router pipeline depth limits minimum latency– Even under low traffic conditions– Can make packet buffers less effective– Incurs pipelining overheads
![Page 8: Communication-Centric Design](https://reader036.vdocuments.net/reader036/viewer/2022062816/56813a97550346895da29046/html5/thumbnails/8.jpg)
8
Speculative Router Architecture
• VC and switch allocation may be performed concurrently:– Speculate that waiting packets will be successful in acquiring a VC– Prioritize non-speculative requests over speculative ones
Li-Shiuan Peh and William J. Dally, “A Delay Model and Speculative Architecture for Pipelined Routers”, In Proceedings HPCA’01, 2001.
![Page 9: Communication-Centric Design](https://reader036.vdocuments.net/reader036/viewer/2022062816/56813a97550346895da29046/html5/thumbnails/9.jpg)
9
Single Cycle Speculative Router
![Page 10: Communication-Centric Design](https://reader036.vdocuments.net/reader036/viewer/2022062816/56813a97550346895da29046/html5/thumbnails/10.jpg)
10
![Page 11: Communication-Centric Design](https://reader036.vdocuments.net/reader036/viewer/2022062816/56813a97550346895da29046/html5/thumbnails/11.jpg)
11
![Page 12: Communication-Centric Design](https://reader036.vdocuments.net/reader036/viewer/2022062816/56813a97550346895da29046/html5/thumbnails/12.jpg)
12
Single Cycle Router Architecture
• Once speculation mechanism is in place a range of accuracy/cycle-time trade-offs can be made– Blocked VC, pipeline and speculate – use low
priority switch scheduler– Switch and VC next request calculation
• Don’t bother calculating next switch requests just use current set. Safe to be pessimistic about what has been granted.
• Need to be more accurate for VC allocation– Abort logic accuracy
![Page 13: Communication-Centric Design](https://reader036.vdocuments.net/reader036/viewer/2022062816/56813a97550346895da29046/html5/thumbnails/13.jpg)
13
Single Cycle Router Architecture
• Decreasing accuracy often leads to poorer schedule and more aborts but reduces the router’s cycle time
• Impact of speculation on single cycle router:– 10% more cycles on average– clock period reduced by factor of 1.6– Network latency reduced by a factor of ~1.5
• Need to be careful about updating arbiter state correctly after speculation outcome is known
![Page 14: Communication-Centric Design](https://reader036.vdocuments.net/reader036/viewer/2022062816/56813a97550346895da29046/html5/thumbnails/14.jpg)
14
Lochside Router Clock Period
100% standard cellFF/Clocking 23% (8.3 FO4)FIFOs/Control/Datapath 53% (19 FO4)Link 22% (7.9 FO4) range 4.6-7.9
• Could move to router/link pipeline• Option to pipeline control - maintaining single cycle best case• Impact of technology scaling• Scalability: doubling VCs to 8, only adds ~10% to cycle time
30-35 FO4 delays (~800MHz)
5-port router4 VCs per port64-bit links, ~1.5mm90nm technology
![Page 15: Communication-Centric Design](https://reader036.vdocuments.net/reader036/viewer/2022062816/56813a97550346895da29046/html5/thumbnails/15.jpg)
15
Router Power Optimisation
• Local and global clock gating & signal gating– Global clock gating exploits early-request
signals from neighbouring routers– Slightly pessimistic (based on what is
requested not granted)– Factor 2-4 reduction power consumption
• Peak 0.15mW/Mhz (0.35 unopt.)• Low Random 0.06mW/Mhz (0.27 unopt.)
Mullins, SoC’06
![Page 16: Communication-Centric Design](https://reader036.vdocuments.net/reader036/viewer/2022062816/56813a97550346895da29046/html5/thumbnails/16.jpg)
16
• 22% Static power• 11% Inter-Router Links• ~1% Global Clock tree• 65% Dynamic Power
– Power Breakdown• ~50% local clock tree and input FIFOs• ~30% on router datapath• ~20% on scheduling and arbitration
(Low random traffic case)
Analysis of Power Consumption
} Due to increase as % as technology scales
![Page 17: Communication-Centric Design](https://reader036.vdocuments.net/reader036/viewer/2022062816/56813a97550346895da29046/html5/thumbnails/17.jpg)
17
Distributed Clock Generator (DCG)
• Exploits self-timed circuitry to generate and a clock in a distributed fashion
• Low-skew and low-power solution to providing global synchrony
• Mesh topology• Simple proof of concept
provided by Lochside test chip S. Fairbanks and S. Moore “Self-timed
circuitry for global clocking”, ASYNC’05
![Page 18: Communication-Centric Design](https://reader036.vdocuments.net/reader036/viewer/2022062816/56813a97550346895da29046/html5/thumbnails/18.jpg)
18
Beyond global synchrony
• Clock distribution issues– Challenge as network is physically distributed
• Increasing process variation
• Synchronization– Core clock frequencies may vary, perhaps adaptively– Link and router DVS or other energy/perf. trade-offs
• Selecting a global network clock frequency– Run at maximum frequency continuously?– Use a multitude of network clock frequencies?– Select a global compromise?
![Page 19: Communication-Centric Design](https://reader036.vdocuments.net/reader036/viewer/2022062816/56813a97550346895da29046/html5/thumbnails/19.jpg)
19
Beyond Global Synchrony
Synchronous Delay Insensitive
Global None
Timing Assumptions
Loca
l Rela
tive
Wire
Dela
y
Less Detection
Sub-S
ystem
Loca
l
Isoch
ronic
For
ks
Multipl
e cloc
ks
Data-D
riven
and
Pausib
le Cloc
ksBun
dled D
ata
Quasi-
Delay
Insen
sitive
Local Clocks, Interaction with data (becoming
aperiodic)
• A complete spectrum of approaches to system-timing exist
![Page 20: Communication-Centric Design](https://reader036.vdocuments.net/reader036/viewer/2022062816/56813a97550346895da29046/html5/thumbnails/20.jpg)
20
Data-Driven and Pausible Clocks
Mullins/Moore, ASYNC’07
![Page 21: Communication-Centric Design](https://reader036.vdocuments.net/reader036/viewer/2022062816/56813a97550346895da29046/html5/thumbnails/21.jpg)
21
Example: AsAP project (UC Davis, 2006)
Yu et al, ISSCC’06
![Page 22: Communication-Centric Design](https://reader036.vdocuments.net/reader036/viewer/2022062816/56813a97550346895da29046/html5/thumbnails/22.jpg)
22
Example: MAIA chip (Berkeley, 2000)
• GALS architecture, data-flow driven processing elements (“satellites”)
Zhang et al, ISSCC’00
![Page 23: Communication-Centric Design](https://reader036.vdocuments.net/reader036/viewer/2022062816/56813a97550346895da29046/html5/thumbnails/23.jpg)
23
Data-Driven Clocking for On-Chip Routers
• Router should be clocked when one or more inputs are valid (or flits are buffered)
• Free running (paternoster) elevator– Chain of open compartments – Must synchronise before you jump on!
• Traditional elevator– Wait for someone to arrive– Close doors, decide who is in and who is out– Metastability issue again (potentially painful!)
![Page 24: Communication-Centric Design](https://reader036.vdocuments.net/reader036/viewer/2022062816/56813a97550346895da29046/html5/thumbnails/24.jpg)
24
Data-Driven Clock Implementation
Local Clock Generator TemplateSample inputs
when at least one input is ready (and clock is low)
Assert Lock
Either admitted or locked out
(Close Lift Doors)
Incoming data
![Page 25: Communication-Centric Design](https://reader036.vdocuments.net/reader036/viewer/2022062816/56813a97550346895da29046/html5/thumbnails/25.jpg)
25
Data-driven clocking benefits
Self-timed power gating?
DI barrier synchronisation and scheduling extensions
NO GLOBAL CLOCK
![Page 26: Communication-Centric Design](https://reader036.vdocuments.net/reader036/viewer/2022062816/56813a97550346895da29046/html5/thumbnails/26.jpg)
26
Networks of processors to processing networks
• Will a single universal parallel architecture be the eventual outcome of this convergence?
FPGAsFPGAs
Embedded Embedded ProcessorsProcessorsGPUsGPUs
Multi-CoreMulti-CoreProcessorsProcessors
SoC PlatformsSoC Platforms
?
![Page 27: Communication-Centric Design](https://reader036.vdocuments.net/reader036/viewer/2022062816/56813a97550346895da29046/html5/thumbnails/27.jpg)
27
Current Focus– Network of Processors– Number of processors increase– Core architectures tailored to “many-core”
environment– Remove hard tile boundaries
• Why fix granularity of cores, communication and memory hierarchies?
– Move away from processor + router model• Everything is on the network• Richer interconnection of components, increased
flexibility – Add network-based services
• Network aids collaboration, focuses resources, supports dynamic optimisations, scheduling, …
• Tailor virtual architecture to application– Processing Network or Fabric
Increased flexibility to optim
ise computation
and comm
unication
![Page 28: Communication-Centric Design](https://reader036.vdocuments.net/reader036/viewer/2022062816/56813a97550346895da29046/html5/thumbnails/28.jpg)
28
Thank You.