a low latency wormhole router for asynchronous on-chip

23
2010-01-20 Advanced Processor Technologies Group Advanced Processor Technologies Group The School of Computer Science The School of Computer Science A Low Latency Wormhole Router for A Low Latency Wormhole Router for Asynchronous On Asynchronous On - - chip Networks chip Networks Wei Song and Doug Edwards Advanced Processor Technologies Group (APT) School of Computer Science University of Manchester, UK

Upload: others

Post on 12-Feb-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Low Latency Wormhole Router for Asynchronous On-chip

2010-01-20Advanced Processor Technologies GroupAdvanced Processor Technologies GroupThe School of Computer ScienceThe School of Computer Science

A Low Latency Wormhole Router for A Low Latency Wormhole Router for Asynchronous OnAsynchronous On--chip Networkschip Networks

Wei Song and Doug Edwards

Advanced Processor Technologies Group (APT)School of Computer ScienceUniversity of Manchester, UK

Page 2: A Low Latency Wormhole Router for Asynchronous On-chip

2010-01-20Advanced Processor Technologies GroupAdvanced Processor Technologies GroupThe School of Computer ScienceThe School of Computer Science

OutlineOutline

• Asynchronous On-chip Networks– Globally Asynchronous and Locally Synchronous

(GALS)– Quasi Delay Insensitive (QDI) pipeline– Target: general methods to improve speed

• Solution– Channel Slicing– Using Lookahead pipeline on critical cycles

• Outcome– 32-bit wormhole router– 41.4% latency reduction with 28.3% area overhead

Page 3: A Low Latency Wormhole Router for Asynchronous On-chip

2010-01-20Advanced Processor Technologies GroupAdvanced Processor Technologies GroupThe School of Computer ScienceThe School of Computer Science

QDI pipeline

• Low PowerNo clock tree

• Tolerance to Process VariationUsing delay insensitive handshakes

Page 4: A Low Latency Wormhole Router for Asynchronous On-chip

2010-01-20Advanced Processor Technologies GroupAdvanced Processor Technologies GroupThe School of Computer ScienceThe School of Computer Science

Asynchronous Data Flow

• One-hot coding – 01 0– 10 1– 00 idle, bubble

• Bubble propagation

• Critical cycle

Page 5: A Low Latency Wormhole Router for Asynchronous On-chip

2010-01-20Advanced Processor Technologies GroupAdvanced Processor Technologies GroupThe School of Computer ScienceThe School of Computer Science

Asynchronous On-chip Network

• NoC– Network-on-Chip

– A scalable and distributed communication fabric

• GALS– Synchronous IP Blocks

– Fully asynchronous routers

Page 6: A Low Latency Wormhole Router for Asynchronous On-chip

2010-01-20Advanced Processor Technologies GroupAdvanced Processor Technologies GroupThe School of Computer ScienceThe School of Computer Science

Data-path AbstractionSwitch Allocator

Input Port 0

Input Port P-1

Output Port 0

Output Port P-1PxP

Crossbar

W

W

Page 7: A Low Latency Wormhole Router for Asynchronous On-chip

2010-01-20Advanced Processor Technologies GroupAdvanced Processor Technologies GroupThe School of Computer ScienceThe School of Computer Science

Synchronized Pipeline Style

Extra latency overhead

1-bit sub-channel

Page 8: A Low Latency Wormhole Router for Asynchronous On-chip

2010-01-20Advanced Processor Technologies GroupAdvanced Processor Technologies GroupThe School of Computer ScienceThe School of Computer Science

Using Independent Sub-channels

Channel Slicing

Page 9: A Low Latency Wormhole Router for Asynchronous On-chip

2010-01-20Advanced Processor Technologies GroupAdvanced Processor Technologies GroupThe School of Computer ScienceThe School of Computer Science

Problem in Flow Control

time

Re-synchronize sub-channels

Crossbar is shared by all sub-channels

Page 10: A Low Latency Wormhole Router for Asynchronous On-chip

2010-01-20Advanced Processor Technologies GroupAdvanced Processor Technologies GroupThe School of Computer ScienceThe School of Computer Science

Solution: Re-synchronization

• Re-synchronize once per frame

• Algorithm:1. Wait for head flit2. Routing3. Data transmission

(parallel)4. Tail detected5. Go to 1

A sub-channel controller for each sub-channel

Page 11: A Low Latency Wormhole Router for Asynchronous On-chip

2010-01-20Advanced Processor Technologies GroupAdvanced Processor Technologies GroupThe School of Computer ScienceThe School of Computer Science

Critical Cycle Analysis

• Long interconnect– Buffer insertion

– More pipeline stages

– Wave-pipeline

• Crossbar– High fan-out

– Routing control

– Inside the router

– Critical cycle

Page 12: A Low Latency Wormhole Router for Asynchronous On-chip

2010-01-20Advanced Processor Technologies GroupAdvanced Processor Technologies GroupThe School of Computer ScienceThe School of Computer Science

Lookahead Pipeline

Initial States

Data forward

Reset || Data

Reset || Data

Normal QDI pipeline Lookahead pipeline[Montek Singh, 2007]

1. Early acknowledge; 2. don’t need an explicit bubble; 3. not strict QDI.

Page 13: A Low Latency Wormhole Router for Asynchronous On-chip

2010-01-20Advanced Processor Technologies GroupAdvanced Processor Technologies GroupThe School of Computer ScienceThe School of Computer Science

Using Lookahead in Router

• Only utilized on the critical cycle.

• No significant area overhead.

• Timing assumptions are ensured.

Page 14: A Low Latency Wormhole Router for Asynchronous On-chip

2010-01-20Advanced Processor Technologies GroupAdvanced Processor Technologies GroupThe School of Computer ScienceThe School of Computer Science

A Wormhole Router Design

arbiter

arbiter

5 input ports

5 output ports

ctl

ctl

80

16

80

16

80

16

80

16

d_i_0

ack_i_0

d_i_4

ack_i_4

d_o_0

ack_o_0

d_o_4

ack_o_4

• 5-port router for the mesh topology

• 32-bit data-width– 16 1-of-4 sub-

channels

• 2-stage input buffer– Control on the ack

of the 2nd stage

• 2-stage output buffer– Make lookahead

inside

Page 15: A Low Latency Wormhole Router for Asynchronous On-chip

2010-01-20Advanced Processor Technologies GroupAdvanced Processor Technologies GroupThe School of Computer ScienceThe School of Computer Science

Data-path of a Sub-channel

Control signals from the sub-channel controller

Gates for the Lookahead pipeline

Page 16: A Low Latency Wormhole Router for Asynchronous On-chip

2010-01-20Advanced Processor Technologies GroupAdvanced Processor Technologies GroupThe School of Computer ScienceThe School of Computer Science

Latency Reduction Shown in STG

Page 17: A Low Latency Wormhole Router for Asynchronous On-chip

2010-01-20Advanced Processor Technologies GroupAdvanced Processor Technologies GroupThe School of Computer ScienceThe School of Computer Science

Implementation and Simulation

• Verilog HDL netlists– Controller are generated from STGs using Petrify– Data-path are manually designed

• Implementation– Faraday Standard Cell Library using UMC 130nm

technology– Synopsys DC + ICC + StarRC

• Simulation– Post-layout simulation with back-annotated latency

from RC extraction– Typical corner (25 oC, 1.2V)

Page 18: A Low Latency Wormhole Router for Asynchronous On-chip

2010-01-20Advanced Processor Technologies GroupAdvanced Processor Technologies GroupThe School of Computer ScienceThe School of Computer Science

Speed Performance

• Channel Slicing and Lookahead (CS+LH)– 590 MHz, 41.4% cycle period reduction

• Channel Slicing only (ChSlice)– 450 MHz, 24.1% cycle period reduction

• Traditional (without ChSlice or LH)– 345 MHz

CS + LH ChSlice Traditional

Cycle period 1.7 ns 2.2 ns 2.9 ns

Router latency 1.7 ns 2.1 ns 2.8 ns

Arbitration 0.8 ns 0.8 ns 0.8 ns

Page 19: A Low Latency Wormhole Router for Asynchronous On-chip

2010-01-20Advanced Processor Technologies GroupAdvanced Processor Technologies GroupThe School of Computer ScienceThe School of Computer Science

Area Consumption

• Area in units of NAND2X1 Gate• Channel Slicing 23.0% overhead• Lookahead 5.3% overhead• Total 28.3% overhead

CS + LH ChSlice Traditional

Input Buffer 6.2K 5.8K 4.3K

Output Buffer 4.5K 4.5K 4.4K

Crossbar 3.3K 3.2K 2.4K

Arbitration 14.5K 13.9K 11.3K

Page 20: A Low Latency Wormhole Router for Asynchronous On-chip

2010-01-20Advanced Processor Technologies GroupAdvanced Processor Technologies GroupThe School of Computer ScienceThe School of Computer Science

Data Width Effect

C

C

C

C

CCC

CCC

Cycle period increases when sub-channels are synchronized.

Cycle period is fixed when Channel Slicing is in use.

Page 21: A Low Latency Wormhole Router for Asynchronous On-chip

2010-01-20Advanced Processor Technologies GroupAdvanced Processor Technologies GroupThe School of Computer ScienceThe School of Computer Science

Compare with Other Routers

• Full standard cell design• Delay insensitive, tolerance to process variation

Period Tech Special cell Library

Pipeline style

MANGO [2005] 1.26 ns 0.12 um

0.13 um

90 nm

0.18 um

0.13 um

Unknown Bundled-data

ANoC [2005] 4 ns Yes QDI

ASPIN [2008] 0.88 ns Custom Bundled-data

QNoC [2009] 4.8 ns Std cell Bundled-date

CS+LH [2010] 1.7 ns Std cell Lookahead

Page 22: A Low Latency Wormhole Router for Asynchronous On-chip

2010-01-20Advanced Processor Technologies GroupAdvanced Processor Technologies GroupThe School of Computer ScienceThe School of Computer Science

Conclusion

• QDI pipelines: low power and tolerant to process variation

• Channel Slicing: no C-element tree• Lookahead: fast critical cycle.

• The wormhole router– 1.7 ns, 590MHz– 41.4% latency reduction with 28.3% area

overhead

Page 23: A Low Latency Wormhole Router for Asynchronous On-chip

2010-01-20Advanced Processor Technologies GroupAdvanced Processor Technologies GroupThe School of Computer ScienceThe School of Computer Science

Thanks! Questions?

Contact info.

Wei Song

[email protected]