on-chip network for manycore architecture myong hyon “brandon” cho
TRANSCRIPT
On-chip Network forManycore Architecture
Myong Hyon “Brandon” Cho
Multicore to Manycore?
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
© Tilera Corporation
Intel Xeon E7-x8xx
10 cores
32nm
2011
Westmere-EX architecture
2.4GHz, 30MB L3, 130W(E7-8870)
© Intel Corporation© Advanced Micro Devices, Inc.
AMD FX 8-core
8 cores
32nm
2012
Vishera (Bulldozer/Piledriver)architecture
4.0GHz, 8MB L3, 125W(FX-8350)
Tilera TILE-Gx72
72 cores
40nm
2013
TILE-Gx architecture
1.0GHz, 18MB L3, ~60W
Multicore as the only way out
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
1965 1970 1975 1980 1985 1990 1995 2000 2005 20101.00E-01
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
1.00E+06
Transistors (in thousands)
Data credited to Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanovic
1965 1970 1975 1980 1985 1990 1995 2000 2005 20101.00E-01
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
1.00E+06
Transistors (in thousands)
Frequency (MHz)
Performance
1965 1970 1975 1980 1985 1990 1995 2000 2005 20101.00E-01
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
1.00E+06
Transistors (in thousands)
Frequency (MHz)
Performance
1965 1970 1975 1980 1985 1990 1995 2000 2005 20101.00E-01
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
1.00E+06
Transistors (in thousands)
Frequency (MHz)
Performance
1965 1970 1975 1980 1985 1990 1995 2000 2005 20101.00E-01
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
1.00E+06
Transistors (in thousands)
Frequency (MHz)
Performance
Number of cores
vs. Other possibilities
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
© Wikipedia / Jurii
SiGe?
© Wikipedia / AlexanderAIUS
Graphene?
© iStockphoto / Andrey Volodin
Organic?
© The Economist
Quantum?
vs. Other possibilities
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
NehalemTylersburgWestmere
Sandy BridgeRomleyIvy Bridge
HaswellHaswellRockwell
SkylakeSkylakeSkymont
2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
45nm 32nm 22nm 14nm 10nm
Intel Server Microarchitecture Roadmapaccording to computerbase.de, 2011
NoC as the key to manycore success
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
realizes every communication between cores.
On-chip network
consumes energy proportionally to traffic size.
provides key mechanisms for parallel programming.
Outline
NoCfor
Manycore
Network-level
Optimization
Physical-level
Design
@ 45nm
System-level
Optimization
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
PROM
NoCARC’09
ENC
NOCS’11
BAN
PACT’09
EM2 Chip
’12/’13
Network-level Optimization:
As simple as oblivious network,As efficient as adaptive network
PROM – path-based oblivious routing
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Path-based, Randomized, Oblivious, Minimal RoutingMyong Hyon Cho, Mieszko Lis, Keun Sup Shim, Michel Kinsy, and Srinivas Devadas
NoCArc’09
overcomes the limitation of oblivious routing by enhanced path diversity.
Oblivious routing vs Adaptive routing
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Local and Simple
Oblivious routing
Possibly poor resource utilization
Possibly betterresource utilization
Adaptive routing
Global informationrequired
Oblivious routing vs Adaptive routing
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Local and Simple
Oblivious routing
Possibly poor resource utilization
Possibly betterresource utilization
Adaptive routing
Global informationrequired
For on-chip networks…
Because performance/area overhead of adaptive routing is more significant in on-chip networks than in large-scale networks.
Poor utilization of oblivious routing
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
SB
DB
DA
SA
DOR (XY)
Path diversity improves oblivious routing
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
SB
DB
DA
SA
O1TURN
• Diversity helps improve utilization and reduce congestion.
Path diversity improves oblivious routing
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
• Diversity helps improve utilization and reduce congestion.
IA
SB
DB
IB DA
SA
SB
IA
DB
DA
SA
IBIB DB
SA
SB
DA
IA
Valiant ROMM (2-phase)
Network-level deadlock
• A dependency cycle on network resources causes network-level deadlocks.
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Q1
Network-level deadlock
• A dependency cycle on network resources causes network-level deadlocks.
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
x
Q1
Q2
Q1
Q2
Channel Dependency Graph (CDG)
Network-level deadlock
• A dependency cycle on network resources causes network-level deadlocks.
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
x
x
Q1
Q3
Q2
Q1
Q2
Q3
Channel Dependency Graph (CDG)
Network-level deadlock
• A dependency cycle on network resources causes network-level deadlocks.
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
x
x
x
x
Q1
Q3
Q2Q4
Q1
Q2
Q3
Q4
Channel Dependency Graph (CDG)
Deadlock prevention
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
DOR never creates dependency cycles.
XY and YX paths of O1TURN cause cycles.
O1TURN requires 2 networks to separate them.
Each phase of ROMM cause cycles.
n-phase ROMM uses n networks to separate them.
Each phase of Valiant cause cycles.
Valiant requires 2 networks to separate them.
…which we found to be wrong!n-phase ROMM only requires 2 networks.
Various oblivious routing schemes
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
DOR O1TURN2-phase ROMM
n-phase ROMM
Valiant
Path diversity None Minimum Limited Fair~Large Large
# networksfor deadlockprevention
1 2 2n
*erroneouslyproposed
2
# hops minimal minimal minimal minimal non-minimal
Comm. overhead
None Nonelog2(N)bits/pkt
(n-1) log2(N)bits/pkt
log2(N)bits/pkt
PROM Routing
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Path-based
Oblivious Minimal
Randomized
Goal: Best minimal-path diversity
- Use ALL possible minimal routes- Each minimal route has the SAME CHANCE to be taken.
PROM Routing
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
DA
SA
Path-based
Oblivious Minimal
Randomized
At each hop, where there are multiple choices,
25%
75%
…compare the number of possible minimal paths after each choice
PROM Routing
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
DA
SA 75%
Path-based
Oblivious Minimal
Randomized
At each hop, where there are multiple choices,
…compare the number of possible minimal paths after each choice
PROM Routing
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
DA
SA
33%
67%75%
Path-based
Oblivious Minimal
Randomized
At each hop, where there are multiple choices,
…compare the number of possible minimal paths after each choice
PROM Routing
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
DA
SA 67%75%
Path-based
Oblivious Minimal
Randomized
At each hop, where there are multiple choices,
…compare the number of possible minimal paths after each choice
PROM Routing
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
DA
SA 67%75%
50%
50%
Path-based
Oblivious Minimal
Randomized
At each hop, where there are multiple choices,
…compare the number of possible minimal paths after each choice
PROM Routing
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
DA
SA 67%75%
50%
Path-based
Oblivious Minimal
Randomized
At each hop, where there are multiple choices,
…compare the number of possible minimal paths after each choice
PROM Routing
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Path-based
Oblivious Minimal
DA
SA 67%75%
50%100%
Randomized
The chance of this path to be taken is:
75%×67%×50%×100%= 25%
At each hop, where there are multiple choices,
…compare the number of possible minimal paths after each choice
Probability Calculation
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
• The probability function is reduced to a simple ratio.
Y
DA
SA
X
x
y
NY = (x+y-1)!x!(y-1)!
NX = (x+y-1)!(x-1)!y!
PY = NY
NX+NY
X+y
y =
PX = X+y
x When X>0 and y>0
= x!(y-1)!
1
( + ) x!(y-1)!
1
(x-1)!y!
1
PX PY
X+yx
X+yy
Large-box Problem
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
• Paths are equally taken, but links are not.
srcdst
link utilization on the minimal-path box
DA
SA
When the MPB is large- edges are underutilized.- inner links are congested,possibly with other flows inside.
Uniform PROM
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Immediate Upstream Router
PX PY
Don’t careX+y
x X+y
y
Parameterized PROM
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Immediate Upstream Router
PX PY
On the X axis
On the Y axis
X+y+fx+f
X+y+fy
X+y+fx
X+y+fy+f
Parameterized PROM
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
f=10 f=25f=0
link utilization on the minimal-path boxparameterized PROM
Deadlock prevention
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
• Turn Models [Glass et al./J.ACM’94]:- Each turn model is a set of allowed turns.- No deadlock if all routes conform to the same turn model.
West-First Turn Model North-Last Turn Model
Deadlock prevention
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Any minimal routing on a 2D mesh network conforms to either one of two turn models.*
* Keun Sup Shim, Myong Hyon Cho, Michel Kinsy, Tina Wen, Mieszko Lis, Edward Suh, and
Srinivas Devadas, Static Virtual Channel Allocation in Oblivious Routing, NOCS’09
No north-east nor south-east turnsconforms to the West-First turn model
No north-west nor south-west turnsconforms to the North-Last turn model
Performance Evaluation
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Performance Evaluation
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Various oblivious routing schemes
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
DOR O1TURN2-phase ROMM
n-phase ROMM
Valiant PROM
Path diversity
None Minimum Limited Fair~Large Large Fair~Large
# networksfor deadlockprevention
1 2 2 n* 2 2
# hops minimal minimal minimal minimalnon-
minimalminimal
Comm. overhead
None Nonelog2(N)bits/pkt
(n-1) log2(N)bits/pkt
log2(N)bits/pkt
None
Heavy-loadPerformance
Fair Good Bad Worst Worst Best
BAN – bandwidth adaptive network
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
achieves adaptivity with oblivious routing, using locally arbitrated bi-directional network links.
Oblivious Routing in On-Chip Bandwidth-Adaptive NetworksMyong Hyon Cho, Mieszko Lis, Keun Sup Shim, Michel Kinsy, Tina Wen, and Srinivas Devadas
PACT’09
Oblivious routing failure
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
SA
SB
DB
DA
congested
Where can we do better?
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Adaptive Network, not routing
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
SA
SB
DB
DA
Increasedbandwidth
• A set of bidirectional links connects network nodes.- The bandwidth of the link in one direction can be increased at the expense of the other direction.
Adaptive Network, not routing
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
SA DB
DASB
SA DB
DASB
(a)When yellow flow is dominant
(b)When gray flow is dominant
Routes do not change, and arbitration is all local.
BAN Hardware
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Most hardware overhead in the crossbar
BandwidthAllocatorpressure pressure
direction
1-to
-v D
EM
UX
(1, …, v)
v-to
-1 M
UX
Xbarswitch
1-to
-v D
EM
UX
(1, …, v)
v-to
-1 M
UX
Xbarswitch
nop
nop
from other nodes from other nodes to other nodes
to other nodesto other nodes
to other nodes
Crossbar – 2 links, Unidirectional
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
• 4-input, 4-output, 4 Virtual Channels
Crossbar– 2 links, Bidirectional
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
• 4-input, 4-output, 4 Virtual Channels
Links Switch# xBar Inputs
# xBar Outputs
Relative xBar Size
Unidirectional
VC-to-Port(fully connected) 16 4 64
Bidirectional
VC-to-Port(fully connected) 16 8 128
Crossbar Size – 2 links
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
• 4-input, 4-output, 4 Virtual Channels
Links Switch# xBar Inputs
# xBar Outputs
Relative xBar Size
Unidirectional
VC-to-Port(fully connected) 16 8 128
Bidirectional
VC-to-Port(fully connected) 16 16 256
Hybrid
VC-to-Port(fully connected) 16 12 192
Crossbar Size – 4 links
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
The hybrid configuration has a 1.5 times larger crossbar, which typically increases the node size by around 15%.
Bandwidth Allocation
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
• Local arbiters between any two adjacent routers
Bandwidth Arbiter3 flits 1 flit
The arbitration follows demands from each router, always leaving at least one link in one direction
if there is any flit that can move in that direction.
Symmetry vs. Anti-symmetry
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Bit-complement Transpose
*Both under dimension order routing
Anti-symmetric Traffic
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Symmetric Traffic
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Symmetric Traffic with Burstiness
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Traffic Pattern Non-bursty Bursty
Bit-complement 0% 20%
Uniform Random 8% 26%
How about real application traffic…?
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
• The traffic patterns in many real applications are not symmetric as data is processed by a sequence of modules.
System-level Optimization:
autonomous & fine-grainedthread migration protocol by NoC
ENC – exclusive native context
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
provides the first deadlock-free protocol for autonomous thread migration for any microarchitecture.
Deadlock-Free Fine-Grained Thread MigrationMyong Hyon Cho, Mieszko Lis, Keun Sup Shim, Omer Khan, and Srinivas Devadas
NOCS’11 – Best Paper Award
Why thread migrations again?
• For a simple reason: it’s cheaper on a single die (so we can do it more often).
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
ThreadMotion [Rangan et al., ISCA09]
Higher Voltage/Frequency
Lower Voltage/Frequency
cache misses cache hits
Why thread migrations again?
• For a simple reason: it’s cheaper on a single die (so we can do it more often).
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Architectural Core Salvaging [Powell et al., ISCA09]
has no defectsfloating-point ops
has a defective floating-pointunit
Why thread migrations again?
• For a simple reason: it’s cheaper on a single die (so we can do it more often).
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Execution Migration Machine (EM2) [Lis et al., SPAA11/CSAIL-TR]
Each has the only copy of data on-chip.data misses
Migration protocols aren’t catching up...
• …use a centralized scheduler (e.g., an OS). - slow!
• …store contexts in extra buffer or in the memory hierarchy.- expensive and inefficient!
• …bring restrictions on how threads can migrate.- cannot exploit the full power of migration!
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Need a fast migration protocol that...
• …provides functional correctness for arbitrary migrations.
• …supports autonomous migration scheduling.
• …with a simple & small implementation.
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Protocol-level Deadlock
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Core C
Router C
Core D
Router D
F
E
D
D
A
B
C
C
Core E
Router E
Core F
Router F
Core A
Router A
Core B
Router B
If an autonomous migration protocol is careless…
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORYMIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
• SWAP : A deadlock-prone autonomous migration protocol
• An eviction swaps the locations of two threads.
threads
Protocol-level Deadlock
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
• 100 random, synthetic migration patterns.• 64 threads on 64 core, migrating in every 100 cycles• Network-level deadlock-free routing (DOR-XY)
1 2 3 40
10
20
30
40
50
60
70
80
90
100
2 VCs / No Buffer4 VCs / No Buffer2 VCs / 4 contexts2 VCs / 8 contexts
Number of Hotspots
Dea
dlo
ck (
%)
Exclusive Native Context(ENC) protocols
• Key idea: always accept arrived packets. - evict a running thread in an unblockable way!
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
• Key idea: always accept arrived packets. - evict a running thread in an unblockable way!
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
core
a running thread A
Exclusive Native Context(ENC) protocols
• Key idea: always accept arrived packets. - evict a running thread in an unblockable way!
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
coremigration
a running thread A
eviction
Exclusive Native Context(ENC) protocol
• Key idea: always accept arrived packets. - evict a running thread in an unblockable way!
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
coremigration
eviction
a running thread A
migrating threads must not block evicted threads.
Exclusive Native Context(ENC) protocol
• Key idea: always accept arrived packets. - evict a running thread in an unblockable way!
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
coremigration
eviction
a running thread A
Separating virtual channel sets is a simple solution.
Exclusive Native Context(ENC) protocol
• Key idea: always accept arrived packets. - evict a running thread in an unblockable way!
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
coremigration
eviction
a running thread A
native core
exclusivespace
Each thread has its own native core.
Exclusive Native Context(ENC) protocol
Application performance results
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
• Total migration distance : no overhead in real applications
RANDOM
FFT RADIX LU OCEAN WATER0
0.2
0.4
0.6
0.8
1
1.2
SWAP
SWAPinf
ENC
DEA
DLO
CK
DEA
DLO
CK
DEA
DLO
CK
Nor
mal
ized
Tot
al H
op C
ount
RANDOM FFT RADIX LU OCEAN WATER0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2SWAP SWAPinf ENC
Nor
mal
ized
Com
pleti
on T
ime
Application performance results
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
• Completion time : 11.7% overhead of ENC over SWAPinf (on avg.)
DE
AD
LO
CK D
EA
DL
OC
K DE
AD
LO
CK
Physical-level Design:
NoC router implementation for EM2 (IBM SOI 45nm)
EM2 Implementation - Overview
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
110-core Shared Memory Processor
ISA EM2 Stack ISA
Shared MemoryArchitecture
1. EM2
2. RA (Remote Access)3. EM2+RA
Cache 8KB I$ / 32KB D$ at each core Total 4.4MB on Chip Single-cycle read hits, two-cycle write hits
Technology IBM SOI12SO 45nm
IPARM sc12 library (High voltage threshold),IBM SRAM compiler,IBM IO library (wire-bonding), IBM PLL, etc.
NoC router specification for EM2
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Channels
Communication Unicast, in-order
ArchitecturalPerformance 1 cycle/hop
SchedulingAlgorithm Maximal scheduling
Routing DOR
Network Buffer Single 4-flit ingress buffer for each port
Remote Access
Migration (EM2)
DRAM Access
Migration
Eviction
Request
Response
Request
Response
Six independent 64-bit channels
6 Independent Physical Networks
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
330um
330
um
6-network router with maximal scheduling
Metal Layers Usage
m1, m2, m3 Local logic
c1, c2 Local routing
b1, b2, b3Remote routing/ power grid
ua, ub Global power grid
lb Chip IO
Six 64-bit networks needs a width of 222um.
Tile Floorplanning
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Router
Core
32KB D$
Pre
dict
or
8KB I$
Tile Floorplanning
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Tile floorplan for EM2 tile
855um
917
um
Tile Floorplanning
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Placement Results
ROUTER
CORE
PREDICTOR
EM2 tile
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Width 855um
Height 917um
RC extracted STA(@typical)
WorkingFrequency
105MHz
Hold timeSlack
0.2ns
PowerEstimation (10% activity)
50mW
D$ D$ D$ D$ I$ I$
D$tags
I$tags
Chip Floorplanning
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Connecting Router Links
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Chip-level Clock Tree
B
Tile-level Clock Tree
A
EM2 chip
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Width 10mm
Height 10mm
~357 Million Transistors
11-by-10EM2 tile array
CLKD-CAPs D-CAPs
I/O
18man-month
EM2 tile arraybelow
the top 2 metal layers
More Link Bandwidth?
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Wires connecting to router pins
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
EM2 only(no RA)
BarnesLU-contiguous
Ocean-contiguous
RadixWater-n-squared
Maximum 5 18 15 64 5
Average 2.2 1.6 6.8 4.1 2.1
Thread Concentration on 64-core EM2
* simulated for a 64-core version EM2
Application Migration Patterns
Applications can saturatethe resource cap
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
In YX routing, threads going into the ‘hot core’ are more congested on the horizontal links.
Applications can saturatethe resource cap
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
In YX routing, threads evicted from the ‘hot core’ are more congested on the vertical links.
Applications can saturatethe resource cap
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
In YX routing, threads evicted from the ‘hot core’ are more congested on the vertical links.
BAN on EM2 (Simulation study)
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
0
0.2
0.4
0.6
0.8
1
1.2Average Migration Latency
UN BAN
Nor
mal
ized
Mig
ratio
n La
tenc
yEM2 only(no RA)
BarnesLU-contiguous
Ocean-contiguous
RadixWater-n-squared
Maximum 5 18 15 64 5
Average 2.2 1.6 6.8 4.1 2.1
BARNES LU OCEAN RADIX WATER
* simulated for a 64-core version EM2
WATER
Outline
NoCfor
Manycore
Network-level
Optimization
Physical-level
Design@ 45nm
System-level
Optimization
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
PROM
NoCARC’09
ENC
NOCS’11
BAN
PACT’09
EM2 Chip
’12/’13
Extra slides
Links Switch# xBar Inputs
# xBar Outputs
Relative xBar Size
Unidirectional
VC-to-Port(fully connected) 16 4 64
Bidirectional
VC-to-Port(fully connected) 16 8 128
Crossbar Size – 2 lanes
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
• 4-input, 4-output, 4 Virtual Channels
Unidirectional
Port-to-Port(w/ input VC mux) 4 4 16
Bidirectional
Port-to-Port(w/ input VC mux) 8 8 64
Link Arbitration Frequency
93
• How frequently directions need to change?
• Few links change their directions in 10~20 cycles.
Infrequent Link Arbitration
94
unidirectional
N=100
N=1
Infrequent Link Arbitration
95
unidirectional
N=100
N=1
Protocol-level Deadlock
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Router Cto
Router D
Core Cto
Router C
Router Dto
Core D
Core Dto
Router D
Router Dto
Router CRouter C
toCore C
D to C
C to D
Packets are assumed tobe consumedat the destination.
Packets are assumed tobe consumedat the destination.
Cyclic Resource Dependency Graph
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
node1
core1 core2
NetN2
NetN1
C1N1
N1C1
N2C2
C2N2
node2Network
migration
Acyclic Resource Dependency Graph
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
node1
core1 core2
NetN2
NetN1
C1N1
N1C1
N2C2
C2N2
node2
N2Net
NetNative
NetNative
N1Net
Network
migration
eviction
• ENC0 : A thread always visits its native core first!
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
threads
native cores
Exclusive Native Context Zero (ENC0)
Exclusive Native Context (ENC)
• ENC0 : A thread always visits its native core first!
• ENC : A thread goes to its native core only if evicted by another thread.
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
threads
native cores
• ENC saved 10 network hops (52.6%) in this example.
• Moving out a thread context must be atomic (extra logic cost).
Exclusive Native Context (ENC)
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
threads
native cores
• ENC saved 10 network hops (52.6%) in this example.
• Moving a thread context onto the network must be atomic.
A B
Execution Migration Machine (EM2)
• In many parallel applications, each thread mostly works on its private data.
• In EM2, a migrating thread mostly returns to a specific core.
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Memory accesses on home core
Round Robin Scheduling
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
“N” “E” “W” “S” “C”
RR counter
+1
MUX
wins the output port
“Bubble” cycles when no flit is available on an Input port (non-maximal).
Maximal Scheduling
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
“C” “N” “E” “W” “S”
MUX
“S” “C” “N” “E” “W”
MUX
“W” “S” “C” “N” “E”
MUX
“E” “W” “S” “C” “N”
MUX
“N” “E” “W” “S” “C”
MUX
Fixed Priority Logic (left-to-right)
RR counter
+1
wins the output port
Maximal scheduling without bubblesArea cost: 6.7% (Tile)
Application performance results
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
• Total migration distance : no overhead in real applications
RANDOM FFT RADIX LU OCEAN WATER0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
SWAP SWAPinf ENC0 ENC
Nor
mal
ized
Hop
Cou
nt
DE
AD
LO
CK D
EA
DL
OC
K DE
AD
LO
CK
Application performance results
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
• Completion time : 11.7% overhead of ENC over SWAPinf (on avg.)
RANDOM FFT RADIX LU OCEAN WATER0
0.20.40.60.8
11.21.41.61.8
2SWAP SWAPinf ENC0 ENC
Nor
mal
ized
Com
pleti
on T
ime
DE
AD
LO
CK D
EA
DL
OC
K DE
AD
LO
CK