manolis katevenis forth and university of crete, greece interprocessor communication seen as...

12
Manolis Katevenis FORTH and University of Crete, Greece Interprocessor Communication seen as load/store instruction generalization

Upload: darleen-byrd

Post on 28-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Manolis Katevenis FORTH and University of Crete, Greece Interprocessor Communication seen as load/store instruction generalization

Manolis Katevenis

FORTH and University of Crete, Greece

Interprocessor Communication

seen as load/store instruction generalization

Page 2: Manolis Katevenis FORTH and University of Crete, Greece Interprocessor Communication seen as load/store instruction generalization

Interprocessor Communication - M. Katevenis 2

Summary

• Central importance of Interprocessor Communication

• Must go from low-speed I/O to high-speed p2p commun.

• Data Transfer Primitives: Remote DMA, Remote Queues

• Cache operations on top of Network Interface primitives?

• Combined Translation/Routing Tables – Data Migration

Page 3: Manolis Katevenis FORTH and University of Crete, Greece Interprocessor Communication seen as load/store instruction generalization

Interprocessor Communication - M. Katevenis 3

Communication Primitives: Intra-Node, Inter-Node

• Processor to Memory communication:– Load/Store primitives

• Interprocessor communication:– Read/Write (send/receive)

data transfer primitives

P1 M1 M2 P2Mem.

Func

tiona

lU

nits

FileReg.

Arithmetic, Logic,& Control Instructions

Node 2Node 1Rec'v

Remote DMA

PC

MemoryInstructions

Store

Load

Processor

Send

Write

Read

Page 4: Manolis Katevenis FORTH and University of Crete, Greece Interprocessor Communication seen as load/store instruction generalization

Interprocessor Communication - M. Katevenis 4

Remote DMA as generalization of single-word Instr’ns

• block size is chosen so as to reduce overheads relative to data payload

Mem.1

Memory Memory

Mem.2 Mem.1 Mem.2P P

Addr

Data block

Processor ProcessorAddr

Registers Registers

A.src

sz

Data

A.dst

(a) (c)

(b) (d)

Store Instruction Load Instruction

Remote Write DMA Remote Read DMA

Data

sz

Data block

word word

A.dst

A.src

Page 5: Manolis Katevenis FORTH and University of Crete, Greece Interprocessor Communication seen as load/store instruction generalization

Interprocessor Communication - M. Katevenis 5

Remote DMA is for One-to-One Communication

• Independent (unsynchronized) transfers have to occur into distinct memory regions

• Expensive when many potential senders but few actual ones– buffer reservation cost, polling-for-completion overhead

Sender1

P3

P2

P1

buffe

r2

Rec

eive

r

buffe

r1

memory 1

memory 2

memory 3

Sender2

Remote DMA

Network

Page 6: Manolis Katevenis FORTH and University of Crete, Greece Interprocessor Communication seen as load/store instruction generalization

Interprocessor Communication - M. Katevenis 6

Multi-Party Synchronization: Remote Queues

• Atomic enqueue into shared space, unlike dedicated sp. with RDMA

• Space reserved only for # of actual senders – not potential senders

• Speeds up polling / waiting on multiple receive channels

• Appropriate for synchronization (remote enqueue) & job dispatching (remote dequeue) – generalization of atomic operations

P2

Multiple Senders receivers)(multiple

Remote Queue

P1 P3

P4

m1 m2

m1

m2Shared Space

Page 7: Manolis Katevenis FORTH and University of Crete, Greece Interprocessor Communication seen as load/store instruction generalization

Interprocessor Communication - M. Katevenis 7

Cache Operations related to RDMA

• Network interface as close to the processor as (L1) cache• Messages or RDMA commands composed via store instr’ns• Cache line eviction is a case of Remote Write DMA• Is there potential for the cache controller to use the primitives

supplied by the network interface?

ProcWrite Buffer

Cache Line

Non-Cacheable Data

Local Memory

to N

etw

ork

s t o

r e

with write-combine

Remote Write DMAor Message send

Evict: Write Back,Replace, or Flush

Page 8: Manolis Katevenis FORTH and University of Crete, Greece Interprocessor Communication seen as load/store instruction generalization

Interprocessor Communication - M. Katevenis 8

Cache Read Misses are like Remote Read DMA’s

• Network Interface as close to the processor as (L1) cache• Should the NI be combined with the cache controller?• Should portions of cache coherence protocols be left to the

software, with NI hardware assistance for the rest?

Proc

NI

LocalMemory

read miss

RDMA

fetch data

Remote Read DMA req.

to/fr

om N

etw

ork

Cac

he C

trlr fetch req.

req.

Page 9: Manolis Katevenis FORTH and University of Crete, Greece Interprocessor Communication seen as load/store instruction generalization

Interprocessor Communication - M. Katevenis 9

Network Routing as Generalization of Address Decoding

• (a) Physical address decoding in a uniprocessor• (b) geographical address routing in a multiprocessor

Addr.Physical

2

Data

(a) (b)

body hdr

Packet

addressdestinationof physical

2 MS bits

SRAM0

SRAM1

I/O

1x: I/O01: SRAM 100: SRAM 0

Bridge

Node2

Node3

Node1

Node0

P

SRAM

00:1x:01:

10:11:P

Page 10: Manolis Katevenis FORTH and University of Crete, Greece Interprocessor Communication seen as load/store instruction generalization

Interprocessor Communication - M. Katevenis 10

Address Translation for Transparent Data Migration

• Two methods to support data migration:

(a) Translation table: first consult an indirection/routing table/directory

(b) Cache style: search multiple places in parallel

invalidinvaid

d_bm10100:

101:110:111:

m00m01d_am11

000:001:010:011:

00:

01:

10:

11:

3 virtual page numberaddress

Translation

physical pages

DiscTable

Lookup

(a)

011

100000

001

address

?

?

?

?

virtual block number

blocks

3

physicalCompare/SearchTag

(b)

Page 11: Manolis Katevenis FORTH and University of Crete, Greece Interprocessor Communication seen as load/store instruction generalization

Interprocessor Communication - M. Katevenis 11

Progressive Translation: Localize Migration Updates

• Packets carry global virtual addresses

• Tables provide physical route (address) for the next few steps

• When page 9 migrates within D, only tables in that domain need updating

• Variable-size-page translation tables look like internet routing tables (longest-prefix matches if we want small-page-within-big-region migration)

• Tables that partition the system, for protection against untrusted operating systems, look like internet firewalls

010000xx1xxx011x

0010000x

11101111

110x10xx

1000101x1001

1001

110x

1001

Tbl_ATbl_B

Tbl_D1Tbl_D3

pg9'

pg9

Tbl_C Tbl_D2

Domain D

P

Page 12: Manolis Katevenis FORTH and University of Crete, Greece Interprocessor Communication seen as load/store instruction generalization

Interprocessor Communication - M. Katevenis 12

Conclusions

• Hardware should provide “Primitives” – not “solutions”

– few, simple, general-purpose, flexibly combinable primitives

• Data transport primitive: Remote DMA

• Synchronization primitive: Remote Enqueue

• Cache operation on top of Network Interface primitives?

• Translation/Routing Tables for Data Migration support