recursive partitioning multicast: a bandwidth-efficient routing for networks-on-chip
DESCRIPTION
Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip. Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim Department of Computer Science and Engineering Texas A&M University. MIT Raw (0.18um, 300MHz) 16-core chip Four 4x4 mesh networks. Intel Polaris - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip](https://reader035.vdocuments.net/reader035/viewer/2022062314/56813ae2550346895da334ff/html5/thumbnails/1.jpg)
Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for
Networks-On-Chip
Lei Wang, Yuho Jin, Hyungjun Kim and
Eun Jung KimDepartment of Computer Science and Engineering
Texas A&M University
![Page 2: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip](https://reader035.vdocuments.net/reader035/viewer/2022062314/56813ae2550346895da334ff/html5/thumbnails/2.jpg)
Lei Wang - NOCS 2009 2
Multi-Core Wave & Networks-On-Chip
Uniprocessors hit the power wall. Multi-processors provide high performance at lower power budget.
Shared-bus architecture has scalability limitation. Networks-On-Chip (NOCs) orchestrate chip-wide communications towards
future many-core processors.
MIT Raw (0.18um, 300MHz)16-core chipFour 4x4 mesh networks
Intel Polaris (65nm, 4GHz)80-core chip8x10 mesh network
![Page 3: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip](https://reader035.vdocuments.net/reader035/viewer/2022062314/56813ae2550346895da334ff/html5/thumbnails/3.jpg)
Lei Wang - NOCS 2009 3
Challenges in On-Chip Communication
High performance Low communication latency is critical for high system performance.
Bandwidth-efficient Well-designed routing algorithms provide high network throughput.
Power and Area Constraints Simple topologies and slim routers reduce communication power c
onsumption and save chip area. Efficient Multicast supporting
Cache coherence protocols heavily rely on multicast or broadcast communication characteristics.
We propose a bandwidth-efficient routing for multicast communication in NOCs with low latency and power consumption.
![Page 4: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip](https://reader035.vdocuments.net/reader035/viewer/2022062314/56813ae2550346895da334ff/html5/thumbnails/4.jpg)
Lei Wang - NOCS 2009 4
Prior Work in Multicast Communication
Routing Evaluation Criteria for Multicast Communication [Ni93] Multicast in multicomputer system
Tree-based Multicast Routing for DSM Multiprocessor [Torrellas96] Short message multicast in DSM system
Virtual Circuit Tree Multicasting for NOCs[Lipasti08] Demonstrate necessity of multicasting on-chip Propose table-based multicast routing
Region-based Multicast for CMPs [Duato08] Multicast routing for irregular topology in CMPs
![Page 5: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip](https://reader035.vdocuments.net/reader035/viewer/2022062314/56813ae2550346895da334ff/html5/thumbnails/5.jpg)
Lei Wang - NOCS 2009 5
Outline
Motivation Multicast Router Design
State-of-art Unicast Router Architecture Replication Schemes Destination List Management
Recursive Partitioning Multicast (RPM) Network Partitioning Routing Rules Example Deadlock Avoidance
Evaluation Conclusion
![Page 6: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip](https://reader035.vdocuments.net/reader035/viewer/2022062314/56813ae2550346895da334ff/html5/thumbnails/6.jpg)
Lei Wang - NOCS 2009 6
Different Bandwidth Usage Example
Left Path requires 11 link traversals, 12 buffer writes, 15 buffer reads, and 15 crossbar traversals
Right Path requires 5 link traversals, 6 buffer writes, 10 buffer reads, and 10 cross-bar traversals
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Source
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Destination
![Page 7: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip](https://reader035.vdocuments.net/reader035/viewer/2022062314/56813ae2550346895da334ff/html5/thumbnails/7.jpg)
Lei Wang - NOCS 2009 7
State-of-Art Wormhole Unicast Router
Output 4
RouteComputation
VCAllocatorSwitch
Allocator
VC 1
VC 2
VC n
Input buffers
VC 1
VC 2
VC n
Input buffers
Input 0
Input 4
Output 0
.
.
.
.
.
.
Crossbar switch
RC VA SA ST LT
RCVASA
ST LT
Router Link
LinkRouter
RC: Route Computation VA: VC Allocation; SA: Switch Allocation
ST: Switch Traversal; LT: Link Traversal
![Page 8: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip](https://reader035.vdocuments.net/reader035/viewer/2022062314/56813ae2550346895da334ff/html5/thumbnails/8.jpg)
Lei Wang - NOCS 2009 8
What we need in a Multicast Router?
Packet Replication Synchronous Replication Asynchronous Replication
Destination List Management All-destination Encoding Bit String Encoding Multiple-region Broadcast Encoding
![Page 9: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip](https://reader035.vdocuments.net/reader035/viewer/2022062314/56813ae2550346895da334ff/html5/thumbnails/9.jpg)
Lei Wang - NOCS 2009 9
Synchronous Replication
Packet replication happens at Switch Traversal Stage.
Input 0
Input 3
Output 0
Output 1
Output 2
Output 3
Input 1
Input 2
T M M H
3210
Time (Cycle)
HM
H
M
Head flit
Middle flit
T Tail flit
![Page 10: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip](https://reader035.vdocuments.net/reader035/viewer/2022062314/56813ae2550346895da334ff/html5/thumbnails/10.jpg)
Lei Wang - NOCS 2009 10
Asynchronous Replication
Input 0
Input 3
Output 0
Output 1
Output 2
Output 3
Input 1
Input 2
T M M H
3210
Time (Cycle)
HMM
H
M
Head flit
Middle flit
T Tail flit
![Page 11: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip](https://reader035.vdocuments.net/reader035/viewer/2022062314/56813ae2550346895da334ff/html5/thumbnails/11.jpg)
Lei Wang - NOCS 2009 11
Network Partitioning
Three Parts (5, 6, 7)
Three Parts (0, 1, 7)
Three Parts (3, 4, 5) Three Parts (1, 2, 3)
Source node
Eight Parts
N
S
EW
01
2
3
4
5
7
8
![Page 12: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip](https://reader035.vdocuments.net/reader035/viewer/2022062314/56813ae2550346895da334ff/html5/thumbnails/12.jpg)
Lei Wang - NOCS 2009 12
Basic Routing Rules
NE
SW
NE
SW
Source
Destination
N
S
EW
North: top right corner. West: top left corner. South: bottom left corner. East: bottom right corner.
![Page 13: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip](https://reader035.vdocuments.net/reader035/viewer/2022062314/56813ae2550346895da334ff/html5/thumbnails/13.jpg)
Lei Wang - NOCS 2009 13
Optimized Routing Rules
Source
Destination
Deadlock!!!
![Page 14: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip](https://reader035.vdocuments.net/reader035/viewer/2022062314/56813ae2550346895da334ff/html5/thumbnails/14.jpg)
Lei Wang - NOCS 2009 14
RPM Example-step 1
MM
MSource DestinationMulticast Packet Partitioning
![Page 15: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip](https://reader035.vdocuments.net/reader035/viewer/2022062314/56813ae2550346895da334ff/html5/thumbnails/15.jpg)
Lei Wang - NOCS 2009 15
RPM Example-step 2
M
MM
Ejection
MSource DestinationMulticast Packet Partitioning
![Page 16: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip](https://reader035.vdocuments.net/reader035/viewer/2022062314/56813ae2550346895da334ff/html5/thumbnails/16.jpg)
Lei Wang - NOCS 2009 16
RPM Example-step 3
M
MM
MSource DestinationMulticast Packet Partitioning
![Page 17: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip](https://reader035.vdocuments.net/reader035/viewer/2022062314/56813ae2550346895da334ff/html5/thumbnails/17.jpg)
Lei Wang - NOCS 2009 17
RPM Example-step 4
M
M MM
Ejection Ejection
Ejection
MSource DestinationMulticast Packet Partitioning
![Page 18: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip](https://reader035.vdocuments.net/reader035/viewer/2022062314/56813ae2550346895da334ff/html5/thumbnails/18.jpg)
Lei Wang - NOCS 2009 18
RPM Example-step 5
M
Ejection
M
MSource DestinationMulticast Packet Partitioning
![Page 19: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip](https://reader035.vdocuments.net/reader035/viewer/2022062314/56813ae2550346895da334ff/html5/thumbnails/19.jpg)
Lei Wang - NOCS 2009 19
Deadlock Avoidance RPM has no turn restrictions, potentially introducing deadlock. We use Virtual Network (VN) to avoid deadlock.
Two VNs lie in the same physical network. Virtual Channels of each port are equally divided into each virtual network
. Virtual network Id (0 or 1) for each packet is decided at the source.
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Virtual Network 0
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Virtual Network 1
![Page 20: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip](https://reader035.vdocuments.net/reader035/viewer/2022062314/56813ae2550346895da334ff/html5/thumbnails/20.jpg)
Lei Wang - NOCS 2009 20
Evaluation Methodology Performance Model: Cycle-accurate Network Simulator
Models all router pipeline stages in detail Highly parameterized
Power Model: Orion with both dynamic and leakage power models
Topology 8×8 Mesh (6×6 Mesh, 10×10 Mesh, 16×16 Mesh)
Routing RPM
VC/Port 4
VC Depth 4
Packet Length (flits) 4
Unicast Traffic Pattern Uniform Random (Bit Complement, Transpose)
Multicast Packet Portion 10% (5%, 20%, 40%, 80%)
Multicast Destination Number
0 -16 (uniformly distributed)
Network configuration
![Page 21: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip](https://reader035.vdocuments.net/reader035/viewer/2022062314/56813ae2550346895da334ff/html5/thumbnails/21.jpg)
Lei Wang - NOCS 2009 21
Uniform Random Traffic
Latency is improved around 50% before network saturation. Network throughput is extended 40%.
0
20
40
60
80
100
120
0.01 0.03 0.05 0.07 0.09 0.15
Injection rate (flits/cycle/core)
La
ten
cy (
cycl
e)
RPM Mul unicast VCTM(20%) VCTM(40%) VCTM(80%)
50%
40%
40%
![Page 22: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip](https://reader035.vdocuments.net/reader035/viewer/2022062314/56813ae2550346895da334ff/html5/thumbnails/22.jpg)
Lei Wang - NOCS 2009 22
Link Utilization
00.05
0.10.15
0.20.25
0.30.35
0.40.45
0.01
0.03
0.05
0.07
0.09
0.15
0.25
0.35
0.45
Injection Rate (flits/cycle/core)
Lin
k U
tiliz
atio
n (
op
/cyc
le)
RPM VCTM(20%) VCTM(40%) VCTM(80%)
33%
45%
In low workload, RPM saves 33% link utilization. In high workload, RPM saves 45% link utlization.
![Page 23: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip](https://reader035.vdocuments.net/reader035/viewer/2022062314/56813ae2550346895da334ff/html5/thumbnails/23.jpg)
Lei Wang - NOCS 2009 23
Dynamic Power Consumption
02
46
810
12
RP
MV
CT
MR
PM
VC
TM
RP
MV
CT
MR
PM
VC
TM
RP
MV
CT
MR
PM
VC
TM
RP
MV
CT
MR
PM
VC
TM
RP
MV
CT
MR
PM
VC
TM
RP
MV
CT
MR
PM
VC
TM
RP
MV
CT
MR
PM
VC
TM
RP
MV
CT
MR
PM
VC
TM
RP
MV
CT
MR
PM
VC
TM
0.010.020.030.040.050.060.070.080.09 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Injection Rate(flits/cycle/core)
Dyn
am
ic P
ow
er(
W)
Buffer VC Arbiter SW Arbiter Xbar Link
40%50%
![Page 24: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip](https://reader035.vdocuments.net/reader035/viewer/2022062314/56813ae2550346895da334ff/html5/thumbnails/24.jpg)
Lei Wang - NOCS 2009 24
Scalability Study-Network Size
0
20
40
60
80
100
120
140
6×6 8×8 10×10 16×16
Network Size
La
ten
cy (
cycl
e)
RPM VCTM
Over 50%
![Page 25: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip](https://reader035.vdocuments.net/reader035/viewer/2022062314/56813ae2550346895da334ff/html5/thumbnails/25.jpg)
Lei Wang - NOCS 2009 25
Scalability Study-Multicast Traffic Portion
0
20
40
60
80
100
120
140
5% 10% 20% 40% 80% 100%
Portion of multicast traffic
Late
ncy
(cyc
le)
RPM VCTM
![Page 26: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip](https://reader035.vdocuments.net/reader035/viewer/2022062314/56813ae2550346895da334ff/html5/thumbnails/26.jpg)
Lei Wang - NOCS 2009 26
Scalability Study-Destination Number
0
20
40
60
80
100
120
140
4 8 16 32
Max. number of destinations
Late
ncy
(cyc
le)
RPM VCTM
![Page 27: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip](https://reader035.vdocuments.net/reader035/viewer/2022062314/56813ae2550346895da334ff/html5/thumbnails/27.jpg)
Lei Wang - NOCS 2009 27
Conclusion
Propose a new multicast routing algorithm, Recursive Partitioning Multicast (RPM) Bandwidth-efficient and Scalable
Performance Improvement Up to 50% latency reduction 33% link utilization reduction
Power Savings Up to 40% total dynamic power savings 25% crossbar and link power savings
![Page 28: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip](https://reader035.vdocuments.net/reader035/viewer/2022062314/56813ae2550346895da334ff/html5/thumbnails/28.jpg)
Lei Wang - NOCS 2009 28
Thank you!
![Page 29: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip](https://reader035.vdocuments.net/reader035/viewer/2022062314/56813ae2550346895da334ff/html5/thumbnails/29.jpg)
Lei Wang - NOCS 2009 29
Backup
![Page 30: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip](https://reader035.vdocuments.net/reader035/viewer/2022062314/56813ae2550346895da334ff/html5/thumbnails/30.jpg)
Lei Wang - NOCS 2009 30
Hardware Implementation of Routing logic
![Page 31: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip](https://reader035.vdocuments.net/reader035/viewer/2022062314/56813ae2550346895da334ff/html5/thumbnails/31.jpg)
Lei Wang - NOCS 2009 31
Bit Complement Traffic
0
20
40
60
80
100
120
0.01 0.03 0.05 0.07 0.09 0.15
Injection Rate (flits/cycle/core)
Late
ncy
(cyc
le)
RPM Mul unicast VCTM (20%) VCTM (40%) VCTM (80%)
![Page 32: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip](https://reader035.vdocuments.net/reader035/viewer/2022062314/56813ae2550346895da334ff/html5/thumbnails/32.jpg)
Lei Wang - NOCS 2009 32
Transpose Traffic
0
20
40
60
80
100
120
0.01 0.03 0.05 0.07 0.09 0.15
Injection Rate (flits/cycle/core)
Late
ncy
(cyc
le)
RPM Mul unicast VCTM (20%) VCTM (40%) VCTM (80%)