on-chip network for manycore architecture myong hyon “brandon” cho

106
On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Upload: pierce-wade

Post on 28-Dec-2015

220 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

On-chip Network forManycore Architecture

Myong Hyon “Brandon” Cho

Page 2: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Multicore to Manycore?

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

© Tilera Corporation

Intel Xeon E7-x8xx

10 cores

32nm

2011

Westmere-EX architecture

2.4GHz, 30MB L3, 130W(E7-8870)

© Intel Corporation© Advanced Micro Devices, Inc.

AMD FX 8-core

8 cores

32nm

2012

Vishera (Bulldozer/Piledriver)architecture

4.0GHz, 8MB L3, 125W(FX-8350)

Tilera TILE-Gx72

72 cores

40nm

2013

TILE-Gx architecture

1.0GHz, 18MB L3, ~60W

Page 3: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Multicore as the only way out

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

1965 1970 1975 1980 1985 1990 1995 2000 2005 20101.00E-01

1.00E+00

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

1.00E+06

Transistors (in thousands)

Data credited to Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanovic

1965 1970 1975 1980 1985 1990 1995 2000 2005 20101.00E-01

1.00E+00

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

1.00E+06

Transistors (in thousands)

Frequency (MHz)

Performance

1965 1970 1975 1980 1985 1990 1995 2000 2005 20101.00E-01

1.00E+00

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

1.00E+06

Transistors (in thousands)

Frequency (MHz)

Performance

1965 1970 1975 1980 1985 1990 1995 2000 2005 20101.00E-01

1.00E+00

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

1.00E+06

Transistors (in thousands)

Frequency (MHz)

Performance

1965 1970 1975 1980 1985 1990 1995 2000 2005 20101.00E-01

1.00E+00

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

1.00E+06

Transistors (in thousands)

Frequency (MHz)

Performance

Number of cores

Page 4: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

vs. Other possibilities

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

© Wikipedia / Jurii

SiGe?

© Wikipedia / AlexanderAIUS

Graphene?

© iStockphoto / Andrey Volodin

Organic?

© The Economist

Quantum?

Page 5: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

vs. Other possibilities

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

NehalemTylersburgWestmere

Sandy BridgeRomleyIvy Bridge

HaswellHaswellRockwell

SkylakeSkylakeSkymont

2009 2010 2011 2012 2013 2014 2015 2016 2017 2018

45nm 32nm 22nm 14nm 10nm

Intel Server Microarchitecture Roadmapaccording to computerbase.de, 2011

Page 6: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

NoC as the key to manycore success

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

realizes every communication between cores.

On-chip network

consumes energy proportionally to traffic size.

provides key mechanisms for parallel programming.

Page 7: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Outline

NoCfor

Manycore

Network-level

Optimization

Physical-level

Design

@ 45nm

System-level

Optimization

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

PROM

NoCARC’09

ENC

NOCS’11

BAN

PACT’09

EM2 Chip

’12/’13

Page 8: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Network-level Optimization:

As simple as oblivious network,As efficient as adaptive network

Page 9: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

PROM – path-based oblivious routing

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Path-based, Randomized, Oblivious, Minimal RoutingMyong Hyon Cho, Mieszko Lis, Keun Sup Shim, Michel Kinsy, and Srinivas Devadas

NoCArc’09

overcomes the limitation of oblivious routing by enhanced path diversity.

Page 10: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Oblivious routing vs Adaptive routing

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Local and Simple

Oblivious routing

Possibly poor resource utilization

Possibly betterresource utilization

Adaptive routing

Global informationrequired

Page 11: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Oblivious routing vs Adaptive routing

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Local and Simple

Oblivious routing

Possibly poor resource utilization

Possibly betterresource utilization

Adaptive routing

Global informationrequired

For on-chip networks…

Because performance/area overhead of adaptive routing is more significant in on-chip networks than in large-scale networks.

Page 12: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Poor utilization of oblivious routing

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

SB

DB

DA

SA

DOR (XY)

Page 13: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Path diversity improves oblivious routing

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

SB

DB

DA

SA

O1TURN

• Diversity helps improve utilization and reduce congestion.

Page 14: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Path diversity improves oblivious routing

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

• Diversity helps improve utilization and reduce congestion.

IA

SB

DB

IB DA

SA

SB

IA

DB

DA

SA

IBIB DB

SA

SB

DA

IA

Valiant ROMM (2-phase)

Page 15: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Network-level deadlock

• A dependency cycle on network resources causes network-level deadlocks.

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Q1

Page 16: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Network-level deadlock

• A dependency cycle on network resources causes network-level deadlocks.

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

x

Q1

Q2

Q1

Q2

Channel Dependency Graph (CDG)

Page 17: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Network-level deadlock

• A dependency cycle on network resources causes network-level deadlocks.

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

x

x

Q1

Q3

Q2

Q1

Q2

Q3

Channel Dependency Graph (CDG)

Page 18: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Network-level deadlock

• A dependency cycle on network resources causes network-level deadlocks.

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

x

x

x

x

Q1

Q3

Q2Q4

Q1

Q2

Q3

Q4

Channel Dependency Graph (CDG)

Page 19: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Deadlock prevention

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

DOR never creates dependency cycles.

XY and YX paths of O1TURN cause cycles.

O1TURN requires 2 networks to separate them.

Each phase of ROMM cause cycles.

n-phase ROMM uses n networks to separate them.

Each phase of Valiant cause cycles.

Valiant requires 2 networks to separate them.

…which we found to be wrong!n-phase ROMM only requires 2 networks.

Page 20: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Various oblivious routing schemes

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

DOR O1TURN2-phase ROMM

n-phase ROMM

Valiant

Path diversity None Minimum Limited Fair~Large Large

# networksfor deadlockprevention

1 2 2n

*erroneouslyproposed

2

# hops minimal minimal minimal minimal non-minimal

Comm. overhead

None Nonelog2(N)bits/pkt

(n-1) log2(N)bits/pkt

log2(N)bits/pkt

Page 21: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

PROM Routing

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Path-based

Oblivious Minimal

Randomized

Goal: Best minimal-path diversity

- Use ALL possible minimal routes- Each minimal route has the SAME CHANCE to be taken.

Page 22: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

PROM Routing

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

DA

SA

Path-based

Oblivious Minimal

Randomized

At each hop, where there are multiple choices,

25%

75%

…compare the number of possible minimal paths after each choice

Page 23: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

PROM Routing

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

DA

SA 75%

Path-based

Oblivious Minimal

Randomized

At each hop, where there are multiple choices,

…compare the number of possible minimal paths after each choice

Page 24: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

PROM Routing

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

DA

SA

33%

67%75%

Path-based

Oblivious Minimal

Randomized

At each hop, where there are multiple choices,

…compare the number of possible minimal paths after each choice

Page 25: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

PROM Routing

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

DA

SA 67%75%

Path-based

Oblivious Minimal

Randomized

At each hop, where there are multiple choices,

…compare the number of possible minimal paths after each choice

Page 26: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

PROM Routing

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

DA

SA 67%75%

50%

50%

Path-based

Oblivious Minimal

Randomized

At each hop, where there are multiple choices,

…compare the number of possible minimal paths after each choice

Page 27: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

PROM Routing

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

DA

SA 67%75%

50%

Path-based

Oblivious Minimal

Randomized

At each hop, where there are multiple choices,

…compare the number of possible minimal paths after each choice

Page 28: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

PROM Routing

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Path-based

Oblivious Minimal

DA

SA 67%75%

50%100%

Randomized

The chance of this path to be taken is:

75%×67%×50%×100%= 25%

At each hop, where there are multiple choices,

…compare the number of possible minimal paths after each choice

Page 29: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Probability Calculation

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

• The probability function is reduced to a simple ratio.

Y

DA

SA

X

x

y

NY = (x+y-1)!x!(y-1)!

NX = (x+y-1)!(x-1)!y!

PY = NY

NX+NY

X+y

y =

PX = X+y

x When X>0 and y>0

= x!(y-1)!

1

( + ) x!(y-1)!

1

(x-1)!y!

1

PX PY

X+yx

X+yy

Page 30: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Large-box Problem

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

• Paths are equally taken, but links are not.

srcdst

link utilization on the minimal-path box

DA

SA

When the MPB is large- edges are underutilized.- inner links are congested,possibly with other flows inside.

Page 31: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Uniform PROM

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Immediate Upstream Router

PX PY

Don’t careX+y

x X+y

y

Page 32: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Parameterized PROM

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Immediate Upstream Router

PX PY

On the X axis

On the Y axis

X+y+fx+f

X+y+fy

X+y+fx

X+y+fy+f

Page 33: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Parameterized PROM

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

f=10 f=25f=0

link utilization on the minimal-path boxparameterized PROM

Page 34: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Deadlock prevention

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

• Turn Models [Glass et al./J.ACM’94]:- Each turn model is a set of allowed turns.- No deadlock if all routes conform to the same turn model.

West-First Turn Model North-Last Turn Model

Page 35: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Deadlock prevention

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Any minimal routing on a 2D mesh network conforms to either one of two turn models.*

* Keun Sup Shim, Myong Hyon Cho, Michel Kinsy, Tina Wen, Mieszko Lis, Edward Suh, and

Srinivas Devadas, Static Virtual Channel Allocation in Oblivious Routing, NOCS’09

No north-east nor south-east turnsconforms to the West-First turn model

No north-west nor south-west turnsconforms to the North-Last turn model

Page 36: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Performance Evaluation

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Page 37: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Performance Evaluation

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Page 38: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Various oblivious routing schemes

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

DOR O1TURN2-phase ROMM

n-phase ROMM

Valiant PROM

Path diversity

None Minimum Limited Fair~Large Large Fair~Large

# networksfor deadlockprevention

1 2 2 n* 2 2

# hops minimal minimal minimal minimalnon-

minimalminimal

Comm. overhead

None Nonelog2(N)bits/pkt

(n-1) log2(N)bits/pkt

log2(N)bits/pkt

None

Heavy-loadPerformance

Fair Good Bad Worst Worst Best

Page 39: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

BAN – bandwidth adaptive network

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

achieves adaptivity with oblivious routing, using locally arbitrated bi-directional network links.

Oblivious Routing in On-Chip Bandwidth-Adaptive NetworksMyong Hyon Cho, Mieszko Lis, Keun Sup Shim, Michel Kinsy, Tina Wen, and Srinivas Devadas

PACT’09

Page 40: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Oblivious routing failure

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

SA

SB

DB

DA

congested

Page 41: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Where can we do better?

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Page 42: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Adaptive Network, not routing

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

SA

SB

DB

DA

Increasedbandwidth

• A set of bidirectional links connects network nodes.- The bandwidth of the link in one direction can be increased at the expense of the other direction.

Page 43: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Adaptive Network, not routing

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

SA DB

DASB

SA DB

DASB

(a)When yellow flow is dominant

(b)When gray flow is dominant

Routes do not change, and arbitration is all local.

Page 44: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

BAN Hardware

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Most hardware overhead in the crossbar

BandwidthAllocatorpressure pressure

direction

1-to

-v D

EM

UX

(1, …, v)

v-to

-1 M

UX

Xbarswitch

1-to

-v D

EM

UX

(1, …, v)

v-to

-1 M

UX

Xbarswitch

nop

nop

from other nodes from other nodes to other nodes

to other nodesto other nodes

to other nodes

Page 45: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Crossbar – 2 links, Unidirectional

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

• 4-input, 4-output, 4 Virtual Channels

Page 46: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Crossbar– 2 links, Bidirectional

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

• 4-input, 4-output, 4 Virtual Channels

Page 47: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Links Switch# xBar Inputs

# xBar Outputs

Relative xBar Size

Unidirectional

VC-to-Port(fully connected) 16 4 64

Bidirectional

VC-to-Port(fully connected) 16 8 128

Crossbar Size – 2 links

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

• 4-input, 4-output, 4 Virtual Channels

Page 48: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Links Switch# xBar Inputs

# xBar Outputs

Relative xBar Size

Unidirectional

VC-to-Port(fully connected) 16 8 128

Bidirectional

VC-to-Port(fully connected) 16 16 256

Hybrid

VC-to-Port(fully connected) 16 12 192

Crossbar Size – 4 links

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

The hybrid configuration has a 1.5 times larger crossbar, which typically increases the node size by around 15%.

Page 49: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Bandwidth Allocation

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

• Local arbiters between any two adjacent routers

Bandwidth Arbiter3 flits 1 flit

The arbitration follows demands from each router, always leaving at least one link in one direction

if there is any flit that can move in that direction.

Page 50: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Symmetry vs. Anti-symmetry

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Bit-complement Transpose

*Both under dimension order routing

Page 51: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Anti-symmetric Traffic

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Page 52: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Symmetric Traffic

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Page 53: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Symmetric Traffic with Burstiness

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Traffic Pattern Non-bursty Bursty

Bit-complement 0% 20%

Uniform Random 8% 26%

Page 54: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

How about real application traffic…?

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

• The traffic patterns in many real applications are not symmetric as data is processed by a sequence of modules.

Page 55: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

System-level Optimization:

autonomous & fine-grainedthread migration protocol by NoC

Page 56: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

ENC – exclusive native context

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

provides the first deadlock-free protocol for autonomous thread migration for any microarchitecture.

Deadlock-Free Fine-Grained Thread MigrationMyong Hyon Cho, Mieszko Lis, Keun Sup Shim, Omer Khan, and Srinivas Devadas

NOCS’11 – Best Paper Award

Page 57: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Why thread migrations again?

• For a simple reason: it’s cheaper on a single die (so we can do it more often).

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

ThreadMotion [Rangan et al., ISCA09]

Higher Voltage/Frequency

Lower Voltage/Frequency

cache misses cache hits

Page 58: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Why thread migrations again?

• For a simple reason: it’s cheaper on a single die (so we can do it more often).

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Architectural Core Salvaging [Powell et al., ISCA09]

has no defectsfloating-point ops

has a defective floating-pointunit

Page 59: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Why thread migrations again?

• For a simple reason: it’s cheaper on a single die (so we can do it more often).

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Execution Migration Machine (EM2) [Lis et al., SPAA11/CSAIL-TR]

Each has the only copy of data on-chip.data misses

Page 60: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Migration protocols aren’t catching up...

• …use a centralized scheduler (e.g., an OS). - slow!

• …store contexts in extra buffer or in the memory hierarchy.- expensive and inefficient!

• …bring restrictions on how threads can migrate.- cannot exploit the full power of migration!

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Page 61: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Need a fast migration protocol that...

• …provides functional correctness for arbitrary migrations.

• …supports autonomous migration scheduling.

• …with a simple & small implementation.

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Page 62: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Protocol-level Deadlock

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Core C

Router C

Core D

Router D

F

E

D

D

A

B

C

C

Core E

Router E

Core F

Router F

Core A

Router A

Core B

Router B

Page 63: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

If an autonomous migration protocol is careless…

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORYMIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

• SWAP : A deadlock-prone autonomous migration protocol

• An eviction swaps the locations of two threads.

threads

Page 64: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Protocol-level Deadlock

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

• 100 random, synthetic migration patterns.• 64 threads on 64 core, migrating in every 100 cycles• Network-level deadlock-free routing (DOR-XY)

1 2 3 40

10

20

30

40

50

60

70

80

90

100

2 VCs / No Buffer4 VCs / No Buffer2 VCs / 4 contexts2 VCs / 8 contexts

Number of Hotspots

Dea

dlo

ck (

%)

Page 65: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Exclusive Native Context(ENC) protocols

• Key idea: always accept arrived packets. - evict a running thread in an unblockable way!

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Page 66: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

• Key idea: always accept arrived packets. - evict a running thread in an unblockable way!

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

core

a running thread A

Exclusive Native Context(ENC) protocols

Page 67: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

• Key idea: always accept arrived packets. - evict a running thread in an unblockable way!

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

coremigration

a running thread A

eviction

Exclusive Native Context(ENC) protocol

Page 68: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

• Key idea: always accept arrived packets. - evict a running thread in an unblockable way!

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

coremigration

eviction

a running thread A

migrating threads must not block evicted threads.

Exclusive Native Context(ENC) protocol

Page 69: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

• Key idea: always accept arrived packets. - evict a running thread in an unblockable way!

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

coremigration

eviction

a running thread A

Separating virtual channel sets is a simple solution.

Exclusive Native Context(ENC) protocol

Page 70: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

• Key idea: always accept arrived packets. - evict a running thread in an unblockable way!

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

coremigration

eviction

a running thread A

native core

exclusivespace

Each thread has its own native core.

Exclusive Native Context(ENC) protocol

Page 71: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Application performance results

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

• Total migration distance : no overhead in real applications

RANDOM

FFT RADIX LU OCEAN WATER0

0.2

0.4

0.6

0.8

1

1.2

SWAP

SWAPinf

ENC

DEA

DLO

CK

DEA

DLO

CK

DEA

DLO

CK

Nor

mal

ized

Tot

al H

op C

ount

Page 72: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

RANDOM FFT RADIX LU OCEAN WATER0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2SWAP SWAPinf ENC

Nor

mal

ized

Com

pleti

on T

ime

Application performance results

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

• Completion time : 11.7% overhead of ENC over SWAPinf (on avg.)

DE

AD

LO

CK D

EA

DL

OC

K DE

AD

LO

CK

Page 73: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Physical-level Design:

NoC router implementation for EM2 (IBM SOI 45nm)

Page 74: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

EM2 Implementation - Overview

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

110-core Shared Memory Processor

ISA EM2 Stack ISA

Shared MemoryArchitecture

1. EM2

2. RA (Remote Access)3. EM2+RA

Cache 8KB I$ / 32KB D$ at each core Total 4.4MB on Chip Single-cycle read hits, two-cycle write hits

Technology IBM SOI12SO 45nm

IPARM sc12 library (High voltage threshold),IBM SRAM compiler,IBM IO library (wire-bonding), IBM PLL, etc.

Page 75: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

NoC router specification for EM2

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Channels

Communication Unicast, in-order

ArchitecturalPerformance 1 cycle/hop

SchedulingAlgorithm Maximal scheduling

Routing DOR

Network Buffer Single 4-flit ingress buffer for each port

Remote Access

Migration (EM2)

DRAM Access

Migration

Eviction

Request

Response

Request

Response

Six independent 64-bit channels

Page 76: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

6 Independent Physical Networks

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

330um

330

um

6-network router with maximal scheduling

Metal Layers Usage

m1, m2, m3 Local logic

c1, c2 Local routing

b1, b2, b3Remote routing/ power grid

ua, ub Global power grid

lb Chip IO

Six 64-bit networks needs a width of 222um.

Page 77: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Tile Floorplanning

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Router

Core

32KB D$

Pre

dict

or

8KB I$

Page 78: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Tile Floorplanning

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Tile floorplan for EM2 tile

855um

917

um

Page 79: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Tile Floorplanning

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Placement Results

ROUTER

CORE

PREDICTOR

Page 80: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

EM2 tile

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Width 855um

Height 917um

RC extracted STA(@typical)

WorkingFrequency

105MHz

Hold timeSlack

0.2ns

PowerEstimation (10% activity)

50mW

D$ D$ D$ D$ I$ I$

D$tags

I$tags

Page 81: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Chip Floorplanning

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Page 82: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Connecting Router Links

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Chip-level Clock Tree

B

Tile-level Clock Tree

A

Page 83: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

EM2 chip

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Width 10mm

Height 10mm

~357 Million Transistors

11-by-10EM2 tile array

CLKD-CAPs D-CAPs

I/O

18man-month

EM2 tile arraybelow

the top 2 metal layers

Page 84: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

More Link Bandwidth?

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Wires connecting to router pins

Page 85: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

EM2 only(no RA)

BarnesLU-contiguous

Ocean-contiguous

RadixWater-n-squared

Maximum 5 18 15 64 5

Average 2.2 1.6 6.8 4.1 2.1

Thread Concentration on 64-core EM2

* simulated for a 64-core version EM2

Application Migration Patterns

Page 86: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Applications can saturatethe resource cap

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

In YX routing, threads going into the ‘hot core’ are more congested on the horizontal links.

Page 87: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Applications can saturatethe resource cap

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

In YX routing, threads evicted from the ‘hot core’ are more congested on the vertical links.

Page 88: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Applications can saturatethe resource cap

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

In YX routing, threads evicted from the ‘hot core’ are more congested on the vertical links.

Page 89: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

BAN on EM2 (Simulation study)

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

0

0.2

0.4

0.6

0.8

1

1.2Average Migration Latency

UN BAN

Nor

mal

ized

Mig

ratio

n La

tenc

yEM2 only(no RA)

BarnesLU-contiguous

Ocean-contiguous

RadixWater-n-squared

Maximum 5 18 15 64 5

Average 2.2 1.6 6.8 4.1 2.1

BARNES LU OCEAN RADIX WATER

* simulated for a 64-core version EM2

WATER

Page 90: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Outline

NoCfor

Manycore

Network-level

Optimization

Physical-level

Design@ 45nm

System-level

Optimization

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

PROM

NoCARC’09

ENC

NOCS’11

BAN

PACT’09

EM2 Chip

’12/’13

Page 91: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Extra slides

Page 92: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Links Switch# xBar Inputs

# xBar Outputs

Relative xBar Size

Unidirectional

VC-to-Port(fully connected) 16 4 64

Bidirectional

VC-to-Port(fully connected) 16 8 128

Crossbar Size – 2 lanes

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

• 4-input, 4-output, 4 Virtual Channels

Unidirectional

Port-to-Port(w/ input VC mux) 4 4 16

Bidirectional

Port-to-Port(w/ input VC mux) 8 8 64

Page 93: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Link Arbitration Frequency

93

• How frequently directions need to change?

• Few links change their directions in 10~20 cycles.

Page 94: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Infrequent Link Arbitration

94

unidirectional

N=100

N=1

Page 95: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Infrequent Link Arbitration

95

unidirectional

N=100

N=1

Page 96: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Protocol-level Deadlock

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Router Cto

Router D

Core Cto

Router C

Router Dto

Core D

Core Dto

Router D

Router Dto

Router CRouter C

toCore C

D to C

C to D

Packets are assumed tobe consumedat the destination.

Packets are assumed tobe consumedat the destination.

Page 97: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Cyclic Resource Dependency Graph

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

node1

core1 core2

NetN2

NetN1

C1N1

N1C1

N2C2

C2N2

node2Network

migration

Page 98: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Acyclic Resource Dependency Graph

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

node1

core1 core2

NetN2

NetN1

C1N1

N1C1

N2C2

C2N2

node2

N2Net

NetNative

NetNative

N1Net

Network

migration

eviction

Page 99: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

• ENC0 : A thread always visits its native core first!

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

threads

native cores

Exclusive Native Context Zero (ENC0)

Page 100: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Exclusive Native Context (ENC)

• ENC0 : A thread always visits its native core first!

• ENC : A thread goes to its native core only if evicted by another thread.

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

threads

native cores

• ENC saved 10 network hops (52.6%) in this example.

• Moving out a thread context must be atomic (extra logic cost).

Page 101: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Exclusive Native Context (ENC)

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

threads

native cores

• ENC saved 10 network hops (52.6%) in this example.

• Moving a thread context onto the network must be atomic.

A B

Page 102: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Execution Migration Machine (EM2)

• In many parallel applications, each thread mostly works on its private data.

• In EM2, a migrating thread mostly returns to a specific core.

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Memory accesses on home core

Page 103: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Round Robin Scheduling

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

“N” “E” “W” “S” “C”

RR counter

+1

MUX

wins the output port

“Bubble” cycles when no flit is available on an Input port (non-maximal).

Page 104: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Maximal Scheduling

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

“C” “N” “E” “W” “S”

MUX

“S” “C” “N” “E” “W”

MUX

“W” “S” “C” “N” “E”

MUX

“E” “W” “S” “C” “N”

MUX

“N” “E” “W” “S” “C”

MUX

Fixed Priority Logic (left-to-right)

RR counter

+1

wins the output port

Maximal scheduling without bubblesArea cost: 6.7% (Tile)

Page 105: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Application performance results

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

• Total migration distance : no overhead in real applications

RANDOM FFT RADIX LU OCEAN WATER0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

SWAP SWAPinf ENC0 ENC

Nor

mal

ized

Hop

Cou

nt

DE

AD

LO

CK D

EA

DL

OC

K DE

AD

LO

CK

Page 106: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Application performance results

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

• Completion time : 11.7% overhead of ENC over SWAPinf (on avg.)

RANDOM FFT RADIX LU OCEAN WATER0

0.20.40.60.8

11.21.41.61.8

2SWAP SWAPinf ENC0 ENC

Nor

mal

ized

Com

pleti

on T

ime

DE

AD

LO

CK D

EA

DL

OC

K DE

AD

LO

CK