on-chip network for manycore architecture myong hyon “brandon” cho

On-chip Network forManycore Architecture

Myong Hyon “Brandon” Cho

Multicore to Manycore?

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

© Tilera Corporation

Intel Xeon E7-x8xx

10 cores

32nm

2011

Westmere-EX architecture

2.4GHz, 30MB L3, 130W(E7-8870)

© Intel Corporation© Advanced Micro Devices, Inc.

AMD FX 8-core

8 cores

32nm

2012

Vishera (Bulldozer/Piledriver)architecture

4.0GHz, 8MB L3, 125W(FX-8350)

Tilera TILE-Gx72

72 cores

40nm

2013

TILE-Gx architecture

1.0GHz, 18MB L3, ~60W

Multicore as the only way out


1965 1970 1975 1980 1985 1990 1995 2000 2005 20101.00E-01

1.00E+00

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

1.00E+06

Transistors (in thousands)

Data credited to Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanovic

1965 1970 1975 1980 1985 1990 1995 2000 2005 20101.00E-01

1.00E+00

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

1.00E+06


Frequency (MHz)

Performance

1965 1970 1975 1980 1985 1990 1995 2000 2005 20101.00E-01

1.00E+00

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

1.00E+06


Frequency (MHz)

Performance

1965 1970 1975 1980 1985 1990 1995 2000 2005 20101.00E-01

1.00E+00

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

1.00E+06


Frequency (MHz)

Performance

1965 1970 1975 1980 1985 1990 1995 2000 2005 20101.00E-01

1.00E+00

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

1.00E+06


Frequency (MHz)

Performance

Number of cores

vs. Other possibilities


© Wikipedia / Jurii

SiGe?

© Wikipedia / AlexanderAIUS

Graphene?

© iStockphoto / Andrey Volodin

Organic?

© The Economist

Quantum?

vs. Other possibilities


NehalemTylersburgWestmere

Sandy BridgeRomleyIvy Bridge

HaswellHaswellRockwell

SkylakeSkylakeSkymont

2009 2010 2011 2012 2013 2014 2015 2016 2017 2018

45nm 32nm 22nm 14nm 10nm

Intel Server Microarchitecture Roadmapaccording to computerbase.de, 2011

NoC as the key to manycore success


realizes every communication between cores.

On-chip network

consumes energy proportionally to traffic size.

provides key mechanisms for parallel programming.

Outline

NoCfor

Manycore

Network-level

Optimization

Physical-level

Design

@ 45nm

System-level

Optimization


PROM

NoCARC’09

ENC

NOCS’11

BAN

PACT’09

EM2 Chip

’12/’13

Network-level Optimization:

As simple as oblivious network,As efficient as adaptive network

PROM – path-based oblivious routing


Path-based, Randomized, Oblivious, Minimal RoutingMyong Hyon Cho, Mieszko Lis, Keun Sup Shim, Michel Kinsy, and Srinivas Devadas

NoCArc’09

overcomes the limitation of oblivious routing by enhanced path diversity.

Oblivious routing vs Adaptive routing


Local and Simple

Oblivious routing

Possibly poor resource utilization

Possibly betterresource utilization

Adaptive routing

Global informationrequired

Oblivious routing vs Adaptive routing


Local and Simple

Oblivious routing

Possibly poor resource utilization

Possibly betterresource utilization

Adaptive routing

Global informationrequired

For on-chip networks…

Because performance/area overhead of adaptive routing is more significant in on-chip networks than in large-scale networks.

Poor utilization of oblivious routing


SB

DB

DA

SA

DOR (XY)

Path diversity improves oblivious routing


SB

DB

DA

SA

O1TURN

• Diversity helps improve utilization and reduce congestion.

Path diversity improves oblivious routing


• Diversity helps improve utilization and reduce congestion.

IA

SB

DB

IB DA

SA

SB

IA

DB

DA

SA

IBIB DB

SA

SB

DA

IA

Valiant ROMM (2-phase)

Network-level deadlock

• A dependency cycle on network resources causes network-level deadlocks.


Q1




x

Q1

Q2

Q1

Q2

Channel Dependency Graph (CDG)




x

x

Q1

Q3

Q2

Q1

Q2

Q3





x

x

x

x

Q1

Q3

Q2Q4

Q1

Q2

Q3

Q4


Deadlock prevention


DOR never creates dependency cycles.

XY and YX paths of O1TURN cause cycles.

O1TURN requires 2 networks to separate them.

Each phase of ROMM cause cycles.

n-phase ROMM uses n networks to separate them.

Each phase of Valiant cause cycles.

Valiant requires 2 networks to separate them.

…which we found to be wrong!n-phase ROMM only requires 2 networks.

Various oblivious routing schemes


DOR O1TURN2-phase ROMM

n-phase ROMM

Valiant

Path diversity None Minimum Limited Fair~Large Large

# networksfor deadlockprevention

1 2 2n

*erroneouslyproposed

2

# hops minimal minimal minimal minimal non-minimal

Comm. overhead

None Nonelog2(N)bits/pkt

(n-1) log2(N)bits/pkt

log2(N)bits/pkt

PROM Routing


Path-based

Oblivious Minimal

Randomized

Goal: Best minimal-path diversity

- Use ALL possible minimal routes- Each minimal route has the SAME CHANCE to be taken.

PROM Routing


DA

SA

Path-based

Oblivious Minimal

Randomized

At each hop, where there are multiple choices,

25%

75%

…compare the number of possible minimal paths after each choice

PROM Routing


DA

SA 75%

Path-based

Oblivious Minimal

Randomized



PROM Routing


DA

SA

33%

67%75%

Path-based

Oblivious Minimal

Randomized



PROM Routing


DA

SA 67%75%

Path-based

Oblivious Minimal

Randomized



PROM Routing


DA

SA 67%75%

50%

50%

Path-based

Oblivious Minimal

Randomized



PROM Routing


DA

SA 67%75%

50%

Path-based

Oblivious Minimal

Randomized



PROM Routing


Path-based

Oblivious Minimal

DA

SA 67%75%

50%100%

Randomized

The chance of this path to be taken is:

75%×67%×50%×100%= 25%



Probability Calculation


• The probability function is reduced to a simple ratio.

Y

DA

SA

X

x

y

NY = (x+y-1)!x!(y-1)!

NX = (x+y-1)!(x-1)!y!

PY = NY

NX+NY

X+y

y =

PX = X+y

x When X>0 and y>0

= x!(y-1)!

1

( + ) x!(y-1)!

1

(x-1)!y!

1

PX PY

X+yx

X+yy

Large-box Problem


• Paths are equally taken, but links are not.

srcdst

link utilization on the minimal-path box

DA

SA

When the MPB is large- edges are underutilized.- inner links are congested,possibly with other flows inside.

Uniform PROM


Immediate Upstream Router

PX PY

Don’t careX+y

x X+y

y

Parameterized PROM


Immediate Upstream Router

PX PY

On the X axis

On the Y axis

X+y+fx+f

X+y+fy

X+y+fx

X+y+fy+f

Parameterized PROM


f=10 f=25f=0

link utilization on the minimal-path boxparameterized PROM

Deadlock prevention


• Turn Models [Glass et al./J.ACM’94]:- Each turn model is a set of allowed turns.- No deadlock if all routes conform to the same turn model.

West-First Turn Model North-Last Turn Model

Deadlock prevention


Any minimal routing on a 2D mesh network conforms to either one of two turn models.*

* Keun Sup Shim, Myong Hyon Cho, Michel Kinsy, Tina Wen, Mieszko Lis, Edward Suh, and

Srinivas Devadas, Static Virtual Channel Allocation in Oblivious Routing, NOCS’09

No north-east nor south-east turnsconforms to the West-First turn model

No north-west nor south-west turnsconforms to the North-Last turn model

Performance Evaluation


Various oblivious routing schemes


DOR O1TURN2-phase ROMM

n-phase ROMM

Valiant PROM

Path diversity

None Minimum Limited Fair~Large Large Fair~Large

# networksfor deadlockprevention

1 2 2 n* 2 2

# hops minimal minimal minimal minimalnon-

minimalminimal

Comm. overhead

None Nonelog2(N)bits/pkt

(n-1) log2(N)bits/pkt

log2(N)bits/pkt

None

Heavy-loadPerformance

Fair Good Bad Worst Worst Best

BAN – bandwidth adaptive network


achieves adaptivity with oblivious routing, using locally arbitrated bi-directional network links.

Oblivious Routing in On-Chip Bandwidth-Adaptive NetworksMyong Hyon Cho, Mieszko Lis, Keun Sup Shim, Michel Kinsy, Tina Wen, and Srinivas Devadas

PACT’09

Oblivious routing failure


SA

SB

DB

DA

congested

Where can we do better?


Adaptive Network, not routing


SA

SB

DB

DA

Increasedbandwidth

• A set of bidirectional links connects network nodes.- The bandwidth of the link in one direction can be increased at the expense of the other direction.

Adaptive Network, not routing


SA DB

DASB

SA DB

DASB

(a)When yellow flow is dominant

(b)When gray flow is dominant

Routes do not change, and arbitration is all local.

BAN Hardware


Most hardware overhead in the crossbar

BandwidthAllocatorpressure pressure

direction

1-to

-v D

EM

UX

(1, …, v)

v-to

-1 M

UX

Xbarswitch

1-to

-v D

EM

UX

(1, …, v)

v-to

-1 M

UX

Xbarswitch

nop

nop

from other nodes from other nodes to other nodes

to other nodesto other nodes

to other nodes

Crossbar – 2 links, Unidirectional


• 4-input, 4-output, 4 Virtual Channels

Crossbar– 2 links, Bidirectional



Links Switch# xBar Inputs

# xBar Outputs

Relative xBar Size

Unidirectional

VC-to-Port(fully connected) 16 4 64

Bidirectional


Crossbar Size – 2 links




# xBar Outputs

Relative xBar Size

Unidirectional


Bidirectional


Hybrid


Crossbar Size – 4 links


The hybrid configuration has a 1.5 times larger crossbar, which typically increases the node size by around 15%.

Bandwidth Allocation


• Local arbiters between any two adjacent routers

Bandwidth Arbiter3 flits 1 flit

The arbitration follows demands from each router, always leaving at least one link in one direction

if there is any flit that can move in that direction.

Symmetry vs. Anti-symmetry


Bit-complement Transpose

*Both under dimension order routing

Anti-symmetric Traffic


Symmetric Traffic


Symmetric Traffic with Burstiness


Traffic Pattern Non-bursty Bursty

Bit-complement 0% 20%

Uniform Random 8% 26%

How about real application traffic…?


• The traffic patterns in many real applications are not symmetric as data is processed by a sequence of modules.

System-level Optimization:

autonomous & fine-grainedthread migration protocol by NoC

ENC – exclusive native context


provides the first deadlock-free protocol for autonomous thread migration for any microarchitecture.

Deadlock-Free Fine-Grained Thread MigrationMyong Hyon Cho, Mieszko Lis, Keun Sup Shim, Omer Khan, and Srinivas Devadas

NOCS’11 – Best Paper Award

Why thread migrations again?

• For a simple reason: it’s cheaper on a single die (so we can do it more often).


ThreadMotion [Rangan et al., ISCA09]

Higher Voltage/Frequency

Lower Voltage/Frequency

cache misses cache hits




Architectural Core Salvaging [Powell et al., ISCA09]

has no defectsfloating-point ops

has a defective floating-pointunit




Execution Migration Machine (EM2) [Lis et al., SPAA11/CSAIL-TR]

Each has the only copy of data on-chip.data misses

Migration protocols aren’t catching up...

• …use a centralized scheduler (e.g., an OS). - slow!

• …store contexts in extra buffer or in the memory hierarchy.- expensive and inefficient!

• …bring restrictions on how threads can migrate.- cannot exploit the full power of migration!


Need a fast migration protocol that...

• …provides functional correctness for arbitrary migrations.

• …supports autonomous migration scheduling.

• …with a simple & small implementation.


Protocol-level Deadlock


Core C

Router C

Core D

Router D

F

E

D

D

A

B

C

C

Core E

Router E

Core F

Router F

Core A

Router A

Core B

Router B

If an autonomous migration protocol is careless…

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORYMIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

• SWAP : A deadlock-prone autonomous migration protocol

• An eviction swaps the locations of two threads.

threads



• 100 random, synthetic migration patterns.• 64 threads on 64 core, migrating in every 100 cycles• Network-level deadlock-free routing (DOR-XY)

1 2 3 40

10

20

30

40

50

60

70

80

90

100

2 VCs / No Buffer4 VCs / No Buffer2 VCs / 4 contexts2 VCs / 8 contexts

Number of Hotspots

Dea

dlo

ck (

%)

Exclusive Native Context(ENC) protocols

• Key idea: always accept arrived packets. - evict a running thread in an unblockable way!




core

a running thread A

Exclusive Native Context(ENC) protocols



coremigration

a running thread A

eviction

Exclusive Native Context(ENC) protocol



coremigration

eviction

a running thread A

migrating threads must not block evicted threads.




coremigration

eviction

a running thread A

Separating virtual channel sets is a simple solution.




coremigration

eviction

a running thread A

native core

exclusivespace

Each thread has its own native core.


Application performance results


• Total migration distance : no overhead in real applications

RANDOM

FFT RADIX LU OCEAN WATER0

0.2

0.4

0.6

0.8

1

1.2

SWAP

SWAPinf

ENC

DEA

DLO

CK

DEA

DLO

CK

DEA

DLO

CK

Nor

mal

ized

Tot

al H

op C

ount

RANDOM FFT RADIX LU OCEAN WATER0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2SWAP SWAPinf ENC

Nor

mal

ized

Com

pleti

on T

ime



• Completion time : 11.7% overhead of ENC over SWAPinf (on avg.)

DE

AD

LO

CK D

EA

DL

OC

K DE

AD

LO

CK

Physical-level Design:

NoC router implementation for EM2 (IBM SOI 45nm)

EM2 Implementation - Overview


110-core Shared Memory Processor

ISA EM2 Stack ISA

Shared MemoryArchitecture

1. EM2

2. RA (Remote Access)3. EM2+RA

Cache 8KB I$ / 32KB D$ at each core Total 4.4MB on Chip Single-cycle read hits, two-cycle write hits

Technology IBM SOI12SO 45nm

IPARM sc12 library (High voltage threshold),IBM SRAM compiler,IBM IO library (wire-bonding), IBM PLL, etc.

NoC router specification for EM2


Channels

Communication Unicast, in-order

ArchitecturalPerformance 1 cycle/hop

SchedulingAlgorithm Maximal scheduling

Routing DOR

Network Buffer Single 4-flit ingress buffer for each port

Remote Access

Migration (EM2)

DRAM Access

Migration

Eviction

Request

Response

Request

Response

Six independent 64-bit channels

6 Independent Physical Networks


330um

330

um

6-network router with maximal scheduling

Metal Layers Usage

m1, m2, m3 Local logic

c1, c2 Local routing

b1, b2, b3Remote routing/ power grid

ua, ub Global power grid

lb Chip IO

Six 64-bit networks needs a width of 222um.

Tile Floorplanning


Router

Core

32KB D$

Pre

dict

or

8KB I$

Tile Floorplanning


Tile floorplan for EM2 tile

855um

917

um

Tile Floorplanning


Placement Results

ROUTER

CORE

PREDICTOR

EM2 tile


Width 855um

Height 917um

RC extracted STA(@typical)

WorkingFrequency

105MHz

Hold timeSlack

0.2ns

PowerEstimation (10% activity)

50mW

D$ D$ D$ D$ I$ I$

D$tags

I$tags

Chip Floorplanning


Connecting Router Links


Chip-level Clock Tree

B

Tile-level Clock Tree

A

EM2 chip


Width 10mm

Height 10mm

~357 Million Transistors

11-by-10EM2 tile array

CLKD-CAPs D-CAPs

I/O

18man-month

EM2 tile arraybelow

the top 2 metal layers

More Link Bandwidth?


Wires connecting to router pins


EM2 only(no RA)

BarnesLU-contiguous

Ocean-contiguous

RadixWater-n-squared

Maximum 5 18 15 64 5

Average 2.2 1.6 6.8 4.1 2.1

Thread Concentration on 64-core EM2

* simulated for a 64-core version EM2

Application Migration Patterns

Applications can saturatethe resource cap


In YX routing, threads going into the ‘hot core’ are more congested on the horizontal links.

Applications can saturatethe resource cap


In YX routing, threads evicted from the ‘hot core’ are more congested on the vertical links.

BAN on EM2 (Simulation study)


0

0.2

0.4

0.6

0.8

1

1.2Average Migration Latency

UN BAN

Nor

mal

ized

Mig

ratio

n La

tenc

yEM2 only(no RA)

BarnesLU-contiguous

Ocean-contiguous

RadixWater-n-squared

Maximum 5 18 15 64 5

Average 2.2 1.6 6.8 4.1 2.1

BARNES LU OCEAN RADIX WATER

* simulated for a 64-core version EM2

WATER

Outline

NoCfor

Manycore

Network-level

Optimization

Physical-level

Design@ 45nm

System-level

Optimization


PROM

NoCARC’09

ENC

NOCS’11

BAN

PACT’09

EM2 Chip

’12/’13

Extra slides


# xBar Outputs

Relative xBar Size

Unidirectional


Bidirectional


Crossbar Size – 2 lanes



Unidirectional

Port-to-Port(w/ input VC mux) 4 4 16

Bidirectional

Port-to-Port(w/ input VC mux) 8 8 64

Link Arbitration Frequency

93

• How frequently directions need to change?

• Few links change their directions in 10~20 cycles.

Infrequent Link Arbitration

94

unidirectional

N=100

N=1

Infrequent Link Arbitration

95

unidirectional

N=100

N=1



Router Cto

Router D

Core Cto

Router C

Router Dto

Core D

Core Dto

Router D

Router Dto

Router CRouter C

toCore C

D to C

C to D

Packets are assumed tobe consumedat the destination.

Packets are assumed tobe consumedat the destination.

Cyclic Resource Dependency Graph


node1

core1 core2

NetN2

NetN1

C1N1

N1C1

N2C2

C2N2

node2Network

migration

Acyclic Resource Dependency Graph


node1

core1 core2

NetN2

NetN1

C1N1

N1C1

N2C2

C2N2

node2

N2Net

NetNative

NetNative

N1Net

Network

migration

eviction

• ENC0 : A thread always visits its native core first!


threads

native cores

Exclusive Native Context Zero (ENC0)

Exclusive Native Context (ENC)

• ENC0 : A thread always visits its native core first!

• ENC : A thread goes to its native core only if evicted by another thread.


threads

native cores

• ENC saved 10 network hops (52.6%) in this example.

• Moving out a thread context must be atomic (extra logic cost).

Exclusive Native Context (ENC)


threads

native cores

• ENC saved 10 network hops (52.6%) in this example.

• Moving a thread context onto the network must be atomic.

A B

Execution Migration Machine (EM2)

• In many parallel applications, each thread mostly works on its private data.

• In EM2, a migrating thread mostly returns to a specific core.


Memory accesses on home core

Round Robin Scheduling


“N” “E” “W” “S” “C”

RR counter

+1

MUX

wins the output port

“Bubble” cycles when no flit is available on an Input port (non-maximal).

Maximal Scheduling


“C” “N” “E” “W” “S”

MUX

“S” “C” “N” “E” “W”

MUX

“W” “S” “C” “N” “E”

MUX

“E” “W” “S” “C” “N”

MUX

“N” “E” “W” “S” “C”

MUX

Fixed Priority Logic (left-to-right)

RR counter

+1

wins the output port

Maximal scheduling without bubblesArea cost: 6.7% (Tile)



• Total migration distance : no overhead in real applications


0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

SWAP SWAPinf ENC0 ENC

Nor

mal

ized

Hop

Cou

nt

DE

AD

LO

CK D

EA

DL

OC

K DE

AD

LO

CK



• Completion time : 11.7% overhead of ENC over SWAPinf (on avg.)


0.20.40.60.8

11.21.41.61.8

2SWAP SWAPinf ENC0 ENC

Nor

mal

ized

Com

pleti

on T

ime

DE

AD

LO

CK D

EA

DL

OC

K DE

AD

LO

CK

on-chip network for manycore architecture myong hyon “brandon” cho

Documents

outlinemit computer

way outmit computer

oblivious network

adaptive network8prom

enhanced path diversity

networklevel optimization

chip networkconsumes

key mechanisms