recursive partitioning multicast: a bandwidth-efficient routing for networks-on-chip

Post on 05-Jan-2016

40 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip. Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim Department of Computer Science and Engineering Texas A&M University. MIT Raw (0.18um, 300MHz) 16-core chip Four 4x4 mesh networks. Intel Polaris - PowerPoint PPT Presentation

TRANSCRIPT

Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for

Networks-On-Chip

Lei Wang, Yuho Jin, Hyungjun Kim and

Eun Jung KimDepartment of Computer Science and Engineering

Texas A&M University

Lei Wang - NOCS 2009 2

Multi-Core Wave & Networks-On-Chip

Uniprocessors hit the power wall. Multi-processors provide high performance at lower power budget.

Shared-bus architecture has scalability limitation. Networks-On-Chip (NOCs) orchestrate chip-wide communications towards

future many-core processors.

MIT Raw (0.18um, 300MHz)16-core chipFour 4x4 mesh networks

Intel Polaris (65nm, 4GHz)80-core chip8x10 mesh network

Lei Wang - NOCS 2009 3

Challenges in On-Chip Communication

High performance Low communication latency is critical for high system performance.

Bandwidth-efficient Well-designed routing algorithms provide high network throughput.

Power and Area Constraints Simple topologies and slim routers reduce communication power c

onsumption and save chip area. Efficient Multicast supporting

Cache coherence protocols heavily rely on multicast or broadcast communication characteristics.

We propose a bandwidth-efficient routing for multicast communication in NOCs with low latency and power consumption.

Lei Wang - NOCS 2009 4

Prior Work in Multicast Communication

Routing Evaluation Criteria for Multicast Communication [Ni93] Multicast in multicomputer system

Tree-based Multicast Routing for DSM Multiprocessor [Torrellas96] Short message multicast in DSM system

Virtual Circuit Tree Multicasting for NOCs[Lipasti08] Demonstrate necessity of multicasting on-chip Propose table-based multicast routing

Region-based Multicast for CMPs [Duato08] Multicast routing for irregular topology in CMPs

Lei Wang - NOCS 2009 5

Outline

Motivation Multicast Router Design

State-of-art Unicast Router Architecture Replication Schemes Destination List Management

Recursive Partitioning Multicast (RPM) Network Partitioning Routing Rules Example Deadlock Avoidance

Evaluation Conclusion

Lei Wang - NOCS 2009 6

Different Bandwidth Usage Example

Left Path requires 11 link traversals, 12 buffer writes, 15 buffer reads, and 15 crossbar traversals

Right Path requires 5 link traversals, 6 buffer writes, 10 buffer reads, and 10 cross-bar traversals

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Source

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Destination

Lei Wang - NOCS 2009 7

State-of-Art Wormhole Unicast Router

Output 4

RouteComputation

VCAllocatorSwitch

Allocator

VC 1

VC 2

VC n

Input buffers

VC 1

VC 2

VC n

Input buffers

Input 0

Input 4

Output 0

.

.

.

.

.

.

Crossbar switch

RC VA SA ST LT

RCVASA

ST LT

Router Link

LinkRouter

RC: Route Computation VA: VC Allocation; SA: Switch Allocation

ST: Switch Traversal; LT: Link Traversal

Lei Wang - NOCS 2009 8

What we need in a Multicast Router?

Packet Replication Synchronous Replication Asynchronous Replication

Destination List Management All-destination Encoding Bit String Encoding Multiple-region Broadcast Encoding

Lei Wang - NOCS 2009 9

Synchronous Replication

Packet replication happens at Switch Traversal Stage.

Input 0

Input 3

Output 0

Output 1

Output 2

Output 3

Input 1

Input 2

T M M H

3210

Time (Cycle)

HM

H

M

Head flit

Middle flit

T Tail flit

Lei Wang - NOCS 2009 10

Asynchronous Replication

Input 0

Input 3

Output 0

Output 1

Output 2

Output 3

Input 1

Input 2

T M M H

3210

Time (Cycle)

HMM

H

M

Head flit

Middle flit

T Tail flit

Lei Wang - NOCS 2009 11

Network Partitioning

Three Parts (5, 6, 7)

Three Parts (0, 1, 7)

Three Parts (3, 4, 5) Three Parts (1, 2, 3)

Source node

Eight Parts

N

S

EW

01

2

3

4

5

7

8

Lei Wang - NOCS 2009 12

Basic Routing Rules

NE

SW

NE

SW

Source

Destination

N

S

EW

North: top right corner. West: top left corner. South: bottom left corner. East: bottom right corner.

Lei Wang - NOCS 2009 13

Optimized Routing Rules

Source

Destination

Deadlock!!!

Lei Wang - NOCS 2009 14

RPM Example-step 1

MM

MSource DestinationMulticast Packet Partitioning

Lei Wang - NOCS 2009 15

RPM Example-step 2

M

MM

Ejection

MSource DestinationMulticast Packet Partitioning

Lei Wang - NOCS 2009 16

RPM Example-step 3

M

MM

MSource DestinationMulticast Packet Partitioning

Lei Wang - NOCS 2009 17

RPM Example-step 4

M

M MM

Ejection Ejection

Ejection

MSource DestinationMulticast Packet Partitioning

Lei Wang - NOCS 2009 18

RPM Example-step 5

M

Ejection

M

MSource DestinationMulticast Packet Partitioning

Lei Wang - NOCS 2009 19

Deadlock Avoidance RPM has no turn restrictions, potentially introducing deadlock. We use Virtual Network (VN) to avoid deadlock.

Two VNs lie in the same physical network. Virtual Channels of each port are equally divided into each virtual network

. Virtual network Id (0 or 1) for each packet is decided at the source.

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Virtual Network 0

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Virtual Network 1

Lei Wang - NOCS 2009 20

Evaluation Methodology Performance Model: Cycle-accurate Network Simulator

Models all router pipeline stages in detail Highly parameterized

Power Model: Orion with both dynamic and leakage power models

Topology 8×8 Mesh (6×6 Mesh, 10×10 Mesh, 16×16 Mesh)

Routing RPM

VC/Port 4

VC Depth 4

Packet Length (flits) 4

Unicast Traffic Pattern Uniform Random (Bit Complement, Transpose)

Multicast Packet Portion 10% (5%, 20%, 40%, 80%)

Multicast Destination Number

0 -16 (uniformly distributed)

Network configuration

Lei Wang - NOCS 2009 21

Uniform Random Traffic

Latency is improved around 50% before network saturation. Network throughput is extended 40%.

0

20

40

60

80

100

120

0.01 0.03 0.05 0.07 0.09 0.15

Injection rate (flits/cycle/core)

La

ten

cy (

cycl

e)

RPM Mul unicast VCTM(20%) VCTM(40%) VCTM(80%)

50%

40%

40%

Lei Wang - NOCS 2009 22

Link Utilization

00.05

0.10.15

0.20.25

0.30.35

0.40.45

0.01

0.03

0.05

0.07

0.09

0.15

0.25

0.35

0.45

Injection Rate (flits/cycle/core)

Lin

k U

tiliz

atio

n (

op

/cyc

le)

RPM VCTM(20%) VCTM(40%) VCTM(80%)

33%

45%

In low workload, RPM saves 33% link utilization. In high workload, RPM saves 45% link utlization.

Lei Wang - NOCS 2009 23

Dynamic Power Consumption

02

46

810

12

RP

MV

CT

MR

PM

VC

TM

RP

MV

CT

MR

PM

VC

TM

RP

MV

CT

MR

PM

VC

TM

RP

MV

CT

MR

PM

VC

TM

RP

MV

CT

MR

PM

VC

TM

RP

MV

CT

MR

PM

VC

TM

RP

MV

CT

MR

PM

VC

TM

RP

MV

CT

MR

PM

VC

TM

RP

MV

CT

MR

PM

VC

TM

0.010.020.030.040.050.060.070.080.09 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Injection Rate(flits/cycle/core)

Dyn

am

ic P

ow

er(

W)

Buffer VC Arbiter SW Arbiter Xbar Link

40%50%

Lei Wang - NOCS 2009 24

Scalability Study-Network Size

0

20

40

60

80

100

120

140

6×6 8×8 10×10 16×16

Network Size

La

ten

cy (

cycl

e)

RPM VCTM

Over 50%

Lei Wang - NOCS 2009 25

Scalability Study-Multicast Traffic Portion

0

20

40

60

80

100

120

140

5% 10% 20% 40% 80% 100%

Portion of multicast traffic

Late

ncy

(cyc

le)

RPM VCTM

Lei Wang - NOCS 2009 26

Scalability Study-Destination Number

0

20

40

60

80

100

120

140

4 8 16 32

Max. number of destinations

Late

ncy

(cyc

le)

RPM VCTM

Lei Wang - NOCS 2009 27

Conclusion

Propose a new multicast routing algorithm, Recursive Partitioning Multicast (RPM) Bandwidth-efficient and Scalable

Performance Improvement Up to 50% latency reduction 33% link utilization reduction

Power Savings Up to 40% total dynamic power savings 25% crossbar and link power savings

Lei Wang - NOCS 2009 28

Thank you!

Lei Wang - NOCS 2009 29

Backup

Lei Wang - NOCS 2009 30

Hardware Implementation of Routing logic

Lei Wang - NOCS 2009 31

Bit Complement Traffic

0

20

40

60

80

100

120

0.01 0.03 0.05 0.07 0.09 0.15

Injection Rate (flits/cycle/core)

Late

ncy

(cyc

le)

RPM Mul unicast VCTM (20%) VCTM (40%) VCTM (80%)

Lei Wang - NOCS 2009 32

Transpose Traffic

0

20

40

60

80

100

120

0.01 0.03 0.05 0.07 0.09 0.15

Injection Rate (flits/cycle/core)

Late

ncy

(cyc

le)

RPM Mul unicast VCTM (20%) VCTM (40%) VCTM (80%)

top related