manolis katevenis forth and university of crete, greece interprocessor communication seen as...
TRANSCRIPT
Manolis Katevenis
FORTH and University of Crete, Greece
Interprocessor Communication
seen as load/store instruction generalization
Interprocessor Communication - M. Katevenis 2
Summary
• Central importance of Interprocessor Communication
• Must go from low-speed I/O to high-speed p2p commun.
• Data Transfer Primitives: Remote DMA, Remote Queues
• Cache operations on top of Network Interface primitives?
• Combined Translation/Routing Tables – Data Migration
Interprocessor Communication - M. Katevenis 3
Communication Primitives: Intra-Node, Inter-Node
• Processor to Memory communication:– Load/Store primitives
• Interprocessor communication:– Read/Write (send/receive)
data transfer primitives
P1 M1 M2 P2Mem.
Func
tiona
lU
nits
FileReg.
Arithmetic, Logic,& Control Instructions
Node 2Node 1Rec'v
Remote DMA
PC
MemoryInstructions
Store
Load
Processor
Send
Write
Read
Interprocessor Communication - M. Katevenis 4
Remote DMA as generalization of single-word Instr’ns
• block size is chosen so as to reduce overheads relative to data payload
Mem.1
Memory Memory
Mem.2 Mem.1 Mem.2P P
Addr
Data block
Processor ProcessorAddr
Registers Registers
A.src
sz
Data
A.dst
(a) (c)
(b) (d)
Store Instruction Load Instruction
Remote Write DMA Remote Read DMA
Data
sz
Data block
word word
A.dst
A.src
Interprocessor Communication - M. Katevenis 5
Remote DMA is for One-to-One Communication
• Independent (unsynchronized) transfers have to occur into distinct memory regions
• Expensive when many potential senders but few actual ones– buffer reservation cost, polling-for-completion overhead
Sender1
P3
P2
P1
buffe
r2
Rec
eive
r
buffe
r1
memory 1
memory 2
memory 3
Sender2
Remote DMA
Network
Interprocessor Communication - M. Katevenis 6
Multi-Party Synchronization: Remote Queues
• Atomic enqueue into shared space, unlike dedicated sp. with RDMA
• Space reserved only for # of actual senders – not potential senders
• Speeds up polling / waiting on multiple receive channels
• Appropriate for synchronization (remote enqueue) & job dispatching (remote dequeue) – generalization of atomic operations
P2
Multiple Senders receivers)(multiple
Remote Queue
P1 P3
P4
m1 m2
m1
m2Shared Space
Interprocessor Communication - M. Katevenis 7
Cache Operations related to RDMA
• Network interface as close to the processor as (L1) cache• Messages or RDMA commands composed via store instr’ns• Cache line eviction is a case of Remote Write DMA• Is there potential for the cache controller to use the primitives
supplied by the network interface?
ProcWrite Buffer
Cache Line
Non-Cacheable Data
Local Memory
to N
etw
ork
s t o
r e
with write-combine
Remote Write DMAor Message send
Evict: Write Back,Replace, or Flush
Interprocessor Communication - M. Katevenis 8
Cache Read Misses are like Remote Read DMA’s
• Network Interface as close to the processor as (L1) cache• Should the NI be combined with the cache controller?• Should portions of cache coherence protocols be left to the
software, with NI hardware assistance for the rest?
Proc
NI
LocalMemory
read miss
RDMA
fetch data
Remote Read DMA req.
to/fr
om N
etw
ork
Cac
he C
trlr fetch req.
req.
Interprocessor Communication - M. Katevenis 9
Network Routing as Generalization of Address Decoding
• (a) Physical address decoding in a uniprocessor• (b) geographical address routing in a multiprocessor
Addr.Physical
2
Data
(a) (b)
body hdr
Packet
addressdestinationof physical
2 MS bits
SRAM0
SRAM1
I/O
1x: I/O01: SRAM 100: SRAM 0
Bridge
Node2
Node3
Node1
Node0
P
SRAM
00:1x:01:
10:11:P
Interprocessor Communication - M. Katevenis 10
Address Translation for Transparent Data Migration
• Two methods to support data migration:
(a) Translation table: first consult an indirection/routing table/directory
(b) Cache style: search multiple places in parallel
invalidinvaid
d_bm10100:
101:110:111:
m00m01d_am11
000:001:010:011:
00:
01:
10:
11:
3 virtual page numberaddress
Translation
physical pages
DiscTable
Lookup
(a)
011
100000
001
address
?
?
?
?
virtual block number
blocks
3
physicalCompare/SearchTag
(b)
Interprocessor Communication - M. Katevenis 11
Progressive Translation: Localize Migration Updates
• Packets carry global virtual addresses
• Tables provide physical route (address) for the next few steps
• When page 9 migrates within D, only tables in that domain need updating
• Variable-size-page translation tables look like internet routing tables (longest-prefix matches if we want small-page-within-big-region migration)
• Tables that partition the system, for protection against untrusted operating systems, look like internet firewalls
010000xx1xxx011x
0010000x
11101111
110x10xx
1000101x1001
1001
110x
1001
Tbl_ATbl_B
Tbl_D1Tbl_D3
pg9'
pg9
Tbl_C Tbl_D2
Domain D
P
Interprocessor Communication - M. Katevenis 12
Conclusions
• Hardware should provide “Primitives” – not “solutions”
– few, simple, general-purpose, flexibly combinable primitives
• Data transport primitive: Remote DMA
• Synchronization primitive: Remote Enqueue
• Cache operation on top of Network Interface primitives?
• Translation/Routing Tables for Data Migration support