playing distributed systems with memory-to-memory communication liviu iftode department of computer...

74
Playing Distributed Playing Distributed Systems with Systems with Memory-to-Memory Memory-to-Memory Communication Communication Liviu Iftode Liviu Iftode Department of Computer Science Department of Computer Science University of Maryland University of Maryland

Post on 15-Jan-2016

237 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Playing Distributed Systems Playing Distributed Systems withwith

Memory-to-Memory Memory-to-Memory CommunicationCommunication

Liviu IftodeLiviu IftodeDepartment of Computer ScienceDepartment of Computer Science

University of MarylandUniversity of Maryland

Page 2: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

OutlineOutline

The M2M GameThe M2M Game M2M Toys: VIA, InfiniBand, DAFSM2M Toys: VIA, InfiniBand, DAFS Playing with M2MPlaying with M2M

Software DSMSoftware DSM Intra-Server CommunicationIntra-Server Communication Fault-Tolerance and AvailabilityFault-Tolerance and Availability TCP OffloadingTCP Offloading

ConclusionsConclusions

Most of this work has been done in the Distributed Computing (Disco) Lab at Most of this work has been done in the Distributed Computing (Disco) Lab at Rutgers University, http://discolab.rutgers.eduRutgers University, http://discolab.rutgers.edu

Page 3: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

How it all started...How it all started...

Cost-effective alternative to multicomputersCost-effective alternative to multicomputers Commodity networks of high-volume Commodity networks of high-volume

uniprocessor or multiprocessor systemsuniprocessor or multiprocessor systems track technology besttrack technology best low cost/performance ratiolow cost/performance ratio

Networking became the headache of this Networking became the headache of this approachapproach large software overheadslarge software overheads

Multicomputers Clusters of computers

Page 4: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Too much OS...Too much OS...

Application

OS

Network Interface

Application

OSCO

PY

CO

PY

Applications interact with network interface through the OS: Applications interact with network interface through the OS: exclusive access, protection, buffering, etcexclusive access, protection, buffering, etc

OS involvement increases latency & overhead OS involvement increases latency & overhead Multiple copies (App-> OS, OS-> App) reduce effective bandwidthMultiple copies (App-> OS, OS-> App) reduce effective bandwidth

ReceiveSend

Network Interface

Page 5: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

User-Level User-Level ProtectedProtected CommunicationCommunication

Application

OS

Application

Application has direct access to the network interfaceApplication has direct access to the network interface OS involved only in connection setup to ensure protectionOS involved only in connection setup to ensure protection Performance benefits: zero-copy, low-overheadPerformance benefits: zero-copy, low-overhead Special support in the network interfaceSpecial support in the network interface

NIC

OS

Send [Receive]

NIC

Page 6: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Two User-Level Communication Two User-Level Communication ModelsModels

Active Messages: send(local_buffer, Active Messages: send(local_buffer, remote _handlerremote _handler))

Application

OS

Application

NIC

OS

Send Handler

NIC

Application

OS

Application

NIC

OS

Send Buffer

NIC

Memory-to-Memory: send(local_buffer, Memory-to-Memory: send(local_buffer, remote_bufferremote_buffer))

Page 7: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Memory-to-Memory Memory-to-Memory CommunicationCommunication

Receive operation not required Receive operation not required Also called: (virtually) mapped comm, send-controlled Also called: (virtually) mapped comm, send-controlled

comm, deliberate update, remote write, remote DMA, non-comm, deliberate update, remote write, remote DMA, non-intrusive/silent commintrusive/silent comm

Application buffers must be (pre)Application buffers must be (pre)registeredregistered with the NIC with the NIC

send(local_buffer, remote_buffer)

NIC

OS

Rid=import(rem_buf)

send(local_buf1,Rid)

send(local_buf2,Rid)

NIC

OS

export(rem_buf)

Sender Receiver

Page 8: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

M2M Communication HistoryM2M Communication History

Started both in universities (SHRIMP-Princeton, Started both in universities (SHRIMP-Princeton, UNet-Cornell ) and in industry (Hamlyn-HP, UNet-Cornell ) and in industry (Hamlyn-HP, Memory Channel-DEC)Memory Channel-DEC)

First application: High-Performance ComputingFirst application: High-Performance Computing Software DSM: HLRC (Princeton), Cashmere(Rochester)Software DSM: HLRC (Princeton), Cashmere(Rochester) Lightweight message-passing librariesLightweight message-passing libraries

Lightweight transport layer for cluster-based Lightweight transport layer for cluster-based servers and storageservers and storage

Industrial StandardsIndustrial Standards Virtual Interface Architecture (VIA)Virtual Interface Architecture (VIA) InfiniBand I/O ArchitectureInfiniBand I/O Architecture Direct Access File System (DAFS) ProtocolDirect Access File System (DAFS) Protocol

Page 9: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

OutlineOutline

The M2M GameThe M2M Game M2M Toys: VIA, InfiniBand, DAFSM2M Toys: VIA, InfiniBand, DAFS Playing with M2M Playing with M2M

Software DSMSoftware DSM Intra-Server CommunicationIntra-Server Communication Fault Tolerance and AvailabilityFault Tolerance and Availability TCP OffloadingTCP Offloading

ConclusionsConclusions

Page 10: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

What is VIA?What is VIA?

M2M communication architecture similar to U-Net M2M communication architecture similar to U-Net and VMMC/SHRIMP and VMMC/SHRIMP

Standard initiated by Compaq, Intel, and Microsoft Standard initiated by Compaq, Intel, and Microsoft in 1997 for cluster interconnect in 1997 for cluster interconnect

Point-to-point connection oriented protocol Point-to-point connection oriented protocol Two communication modelsTwo communication models

send/receive: a pair of descriptors queues send/receive: a pair of descriptors queues M2M: M2M: RDMA write and RDMA read RDMA write and RDMA read

Page 11: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Virtual Interface ArchitectureVirtual Interface Architecture

SENDSENDQUEUEQUEUE

RECVRECVQUEUEQUEUE

KernelKernelAgentAgent

VI User LibraryVI User Library

VI NICVI NIC

Data transfer at Data transfer at user leveluser level

Polling or Polling or interrupt for interrupt for completions completions

Setup & Memory Setup & Memory registration registration through kernelthrough kernel

Set

up &

Mem

ory

regi

stra

tion

ApplicationApplication

COMPCOMPQUEUEQUEUE

Page 12: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

InfiniBand: An I/O Architecture InfiniBand: An I/O Architecture with M2M with M2M

Point-to-point switched-based I/O interconnect to Point-to-point switched-based I/O interconnect to replace the bus-based I/O architecture for replace the bus-based I/O architecture for serversservers more bandwidthmore bandwidth more protectionmore protection

Trade association founded by Compaq, Dell, HP, Trade association founded by Compaq, Dell, HP, IBM, Intel, Microsoft and Sun in 1999IBM, Intel, Microsoft and Sun in 1999

M2M communication similar to VIA M2M communication similar to VIA RDMA write, RDMA read RDMA write, RDMA read Remote atomic operationsRemote atomic operations

Page 13: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

InfiniBand I/O ArchitectureInfiniBand I/O Architecture

Processor Memory

HCA

I/O ModuleTCA

I/O ModuleTCA

I/O ModuleTCA

Switched I/O fabric

Hardware protocols for message-passing between Hardware protocols for message-passing between devices implemented in channel adaptersdevices implemented in channel adapters

A channel adapter(CA) is a programmable DMA A channel adapter(CA) is a programmable DMA engine with special protection features that allow engine with special protection features that allow DMA operations to be initiated locally and DMA operations to be initiated locally and remotelyremotely

Page 14: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

M2M Communication in M2M Communication in InfiniBandInfiniBand

Memory region: virtually contiguous area of memory Memory region: virtually contiguous area of memory registered with the channel adapter (L_key)registered with the channel adapter (L_key)

Memory window: protected remote access to a Memory window: protected remote access to a specified area of the memory region (R_key)specified area of the memory region (R_key)

Remote DMA Read/Write {L_key, R_key}Remote DMA Read/Write {L_key, R_key}

Physicalmemory

Memoryregion

Memoryregion

Physicalmemory

Memorywindow

Local_key Remote_key

RDMA

Page 15: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

InfiniBand Work Queue InfiniBand Work Queue OperationsOperations

Page 16: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

InfiniBand Communication StackInfiniBand Communication Stack

Page 17: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Direct Access File SystemDirect Access File System

Lightweight remote file access protocol designed Lightweight remote file access protocol designed to take advantage of M2M interconnect to take advantage of M2M interconnect technologies technologies

DAFS Collaborative group including 85 DAFS Collaborative group including 85 companies proposed the standard in 2001companies proposed the standard in 2001

High PerformanceHigh Performance Optimized for high throughput and low latencyOptimized for high throughput and low latency Transfer directly to/from user buffersTransfer directly to/from user buffers Efficient file sharing using lock caching Efficient file sharing using lock caching

Network-attached storage solution for data Network-attached storage solution for data centerscenters

Page 18: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

DAFS ModelDAFS ModelApplication

BuffersDAFS Client

VIPLUser

Kernel

NIC

VI NICDriver

DAFS Server

DAFS File Server

Buffers Driver

KVIPL

VI NICDriver

NIC

File access API

Page 19: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

DAFS vs Traditional File Access DAFS vs Traditional File Access MethodsMethods

Page 20: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

M2M Product Market

VIA: Emulex (former Giganet) InfiniBand: Mellanox DAFS software distribution:Duke,

Harvard, British Columbia, Rutgers (soon)

Page 21: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

OutlineOutline

The M2M GameThe M2M Game M2M Toys: VIA, InfiniBand, DAFSM2M Toys: VIA, InfiniBand, DAFS Playing with M2MPlaying with M2M

Software DSMSoftware DSM Intra-Server CommunicationIntra-Server Communication Fault Tolerance and AvailabilityFault Tolerance and Availability TCP OffloadingTCP Offloading

ConclusionsConclusions

Page 22: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Software DSM over VIASoftware DSM over VIA

Execution model: One application process on Execution model: One application process on each node of the clustereach node of the cluster

Invalidation-based memory coherence at page Invalidation-based memory coherence at page granularity using VM page protectiongranularity using VM page protection

Data and Synchronization traffic using VIAData and Synchronization traffic using VIA

VIA Interconnect

CodeData

CodeData

CodeData

CodeData

Shared VirtualAddress Space

Page 23: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Home-based Data Coherency using Home-based Data Coherency using VIAVIA

write (A)

Home of A

RDMA diff

read (A)

Node 1 Node 2

RDMA whole page

RDMA page invalidation

Page 24: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Lessons about M2M from DSM

Silent communication: used for diff propagation

Low-latency: 75 % of messages are small Copy avoidance: not always possible Useful but not available

Scatter-gather support Remote read (RDMA Read) Broadcast support

Page 25: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Scatter-Gather SupportScatter-Gather Support

What VIA supports

Source Dest

What we do

Source Dest

““True” scatter-gather can avoid multiple True” scatter-gather can avoid multiple message latenciesmessage latencies

Potential gain of 5-10%Potential gain of 5-10%

What we need

Source Dest

1

2

Page 26: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

RDMA ReadRDMA Read

Allows fetching of data without involving the Allows fetching of data without involving the processor of the remote nodeprocessor of the remote node

Potential gain of 10-20% Potential gain of 10-20%

Page (p)

Page req (p)

Page (p)

Page req (p)

Scheduling Delay +Handling time

With RDMA ReadWithout RDMA Read

Page 27: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Broadcast SupportBroadcast Support

Useful for the software DSM protocolUseful for the software DSM protocol Eager invalidation propagationEager invalidation propagation Eager update of data Eager update of data

Previous research (Cashmere’00) Previous research (Cashmere’00) speculates a gain of 10-15% from the speculates a gain of 10-15% from the use of broadcastuse of broadcast

Page 28: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

OutlineOutline

The M2M GameThe M2M Game M2M Toys: VIA, InfiniBand, DAFSM2M Toys: VIA, InfiniBand, DAFS Playing with M2MPlaying with M2M

Software DSMSoftware DSM Intra-Server CommunicationIntra-Server Communication Fault-Tolerance and AvailabilityFault-Tolerance and Availability TCP OffloadingTCP Offloading

ConclusionsConclusions

Page 29: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

TCP is really bad for Intra-Server TCP is really bad for Intra-Server CommunicationCommunication

Can easily steal 30-50% of the host cycles:Can easily steal 30-50% of the host cycles: From 1 GHz processor only 500-700 MHz are available to From 1 GHz processor only 500-700 MHz are available to

the applicationthe application Processor saturates before the NICProcessor saturates before the NIC TCP Offloading Engine (TOE) solves the problem only TCP Offloading Engine (TOE) solves the problem only

partiallypartially without TOE: 90% of the dual 1 GHz processor required to without TOE: 90% of the dual 1 GHz processor required to

achieve 2x875 MHz bandwidthachieve 2x875 MHz bandwidth with TOE: 52% of dual 1 GHz processors are required to with TOE: 52% of dual 1 GHz processors are required to

obtain 1.9 GHz Ethernet bandwidthobtain 1.9 GHz Ethernet bandwidth With M2M (Mellanox InfiniBand): 90% of the 3.8 GHz With M2M (Mellanox InfiniBand): 90% of the 3.8 GHz

bandwidth using only 7% of an 800 MHz processorbandwidth using only 7% of an 800 MHz processor

Page 30: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Distributed Intra-Cluster Protocols Distributed Intra-Cluster Protocols using M2Musing M2M

Direct Access File System (DAFS): network-Direct Access File System (DAFS): network-attached storage over VIA/IBattached storage over VIA/IB

Sockets Direct Protocol (SDP): lightweight Sockets Direct Protocol (SDP): lightweight transport protocol over VIA/IBtransport protocol over VIA/IB

SCSI Remote Protocol (SRP): connect servers SCSI Remote Protocol (SRP): connect servers to storage area networks over VIA/IBto storage area networks over VIA/IB

Ongoing industry debateOngoing industry debate ““TCP or not TCP?” = “IP or M2M network?”TCP or not TCP?” = “IP or M2M network?”

Page 31: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Distributed Intra-Cluster Server Distributed Intra-Cluster Server Applications using M2MApplications using M2M

Cluster-Based Web ServerCluster-Based Web Server Storage ServersStorage Servers Distributed File SystemsDistributed File Systems

Page 32: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Cluster-based Web Server: PressCluster-based Web Server: Press

location-aware web server with request forwarding and location-aware web server with request forwarding and load balancing (Bianchini et al -Rutgers)load balancing (Bianchini et al -Rutgers)

send

recvmain

disk

/ eth0 cLAN

fs TCP VIA

clients cluster

Page 33: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Performance of VIA-based Press Web ServerPerformance of VIA-based Press Web Server

0

1000

2000

3000

4000

5000

6000

Clarknet Forth Nasa Rutgers

Thr

ough

put

TCP/FE TCP/cLAN VIA/cLAN

[Carrera et al, HPCA’02]

Page 34: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Lessons about M2M from Web Lessons about M2M from Web ServersServers

M2M/VIA used for small messages M2M/VIA used for small messages (requests, cache summaries, load) and (requests, cache summaries, load) and large messages (files)large messages (files)

low overhead is the most beneficial featurelow overhead is the most beneficial feature trade off transparency for performance is trade off transparency for performance is

necessarynecessary sometimes zero copy traded off for number sometimes zero copy traded off for number

of messages (in the absence of scatter-of messages (in the absence of scatter-gather)gather)

Page 35: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

VI-Attached Storage ServerVI-Attached Storage Server

M2M for database-storage interconnect M2M for database-storage interconnect (Zhou et al)(Zhou et al)

VI Network

Local Disks

VI

Storage Server

Local Disks

VI

Storage Server

Database

Server

VI

Page 36: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Database Performance with Database Performance with VI-Attached Storage ServerVI-Attached Storage Server

FC driver highly FC driver highly optimized by vendoroptimized by vendor

cDSA outperforms by cDSA outperforms by 18%18%

Normalized TPC-C Transaction Rate

0

20

40

60

80

100

120

140

Fibre-Channel

kDSA wDSA cDSA

[Zhou et al, ISCA’02]

Page 37: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Lessons about M2M from Storage Lessons about M2M from Storage ServersServers

Zero-copy, low-overhead: most beneficialZero-copy, low-overhead: most beneficial Trade off transparency for performanceTrade off transparency for performance

extend I/O API (asynchronous I/O, buffer extend I/O API (asynchronous I/O, buffer registration) and/or relax I/O semantics (I/O registration) and/or relax I/O semantics (I/O completion)completion)

require application modificationsrequire application modifications

Missed VIA Features Missed VIA Features no flow controlno flow control no buffer managementno buffer management

Serious Competition: iSCSISerious Competition: iSCSI

Page 38: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Federated File System (FedFS)Federated File System (FedFS)

Global file namespace for distributed applications built Global file namespace for distributed applications built on top of autonomous local file systemson top of autonomous local file systems

FedFS

LocalFS

M2M Interconnect

LocalFS

LocalFS

LocalFS

A1A2 A2 A3 A3 A3

FedFS

Page 39: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Location Independent Location Independent Global NamingGlobal Naming

Virtual Directory (VD): union of local directories Virtual Directory (VD): union of local directories created on demand (created on demand (dirmergedirmerge) and volatile ) and volatile

Directory Table: local cache of VDs (analogue to Directory Table: local cache of VDs (analogue to TLB)TLB)

usr

file1

usr

file2

usr

file1 file2

virtual directory

local directories

Page 40: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Role of M2M in FedFSRole of M2M in FedFS

Directory Table - Virtual Directory Directory Table - Virtual Directory Coherency Coherency

Cooperative CachingCooperative Caching File MigrationFile Migration

DAFS + VIA/IP = FedFS over the InternetDAFS + VIA/IP = FedFS over the Internet

FedFSA1

A2 A2 A3 A3 A3

FedFS

DAFS DAFS DAFS DAFS

VIA/IP

Page 41: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

OutlineOutline

The M2M GameThe M2M Game M2M Toys: VIA, InfiniBand, DAFSM2M Toys: VIA, InfiniBand, DAFS Playing with M2MPlaying with M2M

Software DSMSoftware DSM Intra-Server CommunicationIntra-Server Communication Fault-Tolerance and AvailabilityFault-Tolerance and Availability TCP OffloadingTCP Offloading

ConclusionsConclusions

Page 42: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

M2M for Fault Tolerance and AvailabilityM2M for Fault Tolerance and Availability

Use RDMA write to efficiently mirror an Use RDMA write to efficiently mirror an application virtual space on remote application virtual space on remote memory: Fast Cluster Failover [Zhou et al]memory: Fast Cluster Failover [Zhou et al] fast checkpointingfast checkpointing fast failoverfast failover

Use RDMA read for “silent” state migration: Use RDMA read for “silent” state migration: Migratory TCPMigratory TCP extract checkpoints from overloaded servers extract checkpoints from overloaded servers

with zero overheadwith zero overhead

Page 43: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

TCP-based Internet ServicesTCP-based Internet Services

Adverse conditions to affect service availabilityAdverse conditions to affect service availability internetwork congestion or failureinternetwork congestion or failure servers overloaded, failed or under DoS attackservers overloaded, failed or under DoS attack

TCP has one responseTCP has one response network delays => packet loss => retransmissionnetwork delays => packet loss => retransmission

TCP limitationsTCP limitations early binding of service to early binding of service to aa server server client cannot dynamically switch to another server client cannot dynamically switch to another server

for sustained servicefor sustained service

Page 44: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

The Migratory TCP ModelThe Migratory TCP Model

Client

Server 1

Server 2

Page 45: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Migratory TCP: At a GlanceMigratory TCP: At a Glance

Migratory TCP solution to network delays: migrate Migratory TCP solution to network delays: migrate connection to a “better” serverconnection to a “better” server

Migration mechanism is Migration mechanism is genericgeneric (not application (not application specific) specific) lightweightlightweight (fine-grain migration of a per- (fine-grain migration of a per-connection state) and connection state) and low-latencylow-latency

Requires changes to the server application but Requires changes to the server application but totally transparent to the client applicationtotally transparent to the client application

Interoperates with existing TCPInteroperates with existing TCP

Page 46: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Per-connection State TransferPer-connection State Transfer

Server 1 Server 2

Application

M-TCP

Connections Connections

RDMA•`

Page 47: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Application- M-TCP “Contract”Application- M-TCP “Contract”

Server applicationServer application Define per-connection application stateDefine per-connection application state During connection service, export snapshots of During connection service, export snapshots of

per-connection application state when consistentper-connection application state when consistent Upon acceptance of a migrated connection, Upon acceptance of a migrated connection,

import per-connection state and resume serviceimport per-connection state and resume service

Migratory TCPMigratory TCP Transfer per-connection application and protocol Transfer per-connection application and protocol

state from the old to the new server and state from the old to the new server and synchronizesynchronize ( (here is where VIA/IP can helphere is where VIA/IP can help !) !)

Page 48: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Lazy Connection MigrationLazy Connection Migration

C (0)

C’<

Sta

t e R

equ

est

> (

2)

< S

tate

Reply

> (

3)

Client

Server 1

Server 2

<SYN C,…> (1)

<SYN + ACK> (4)

RDMA read (in the future)

Page 49: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Future work: Connection Migration using Future work: Connection Migration using M2MM2M

C (0)

C’

Client

Server 1

Server 2

<SYN C,…> (1)

<SYN + ACK> (4)

RDMA read (lazy)

RDMA write (eager) or

Page 50: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Stream Server ExperimentStream Server Experiment

Effective throughput close to average rate seen before server performance degrades (without VIA)

Page 51: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

OutlineOutline

The M2M GameThe M2M Game M2M Toys: VIA, InfiniBand, DAFSM2M Toys: VIA, InfiniBand, DAFS Playing with M2MPlaying with M2M

Software DSMSoftware DSM Intra-Server CommunicationIntra-Server Communication Fault-Tolerance and AvailabilityFault-Tolerance and Availability TCP OffloadingTCP Offloading

ConclusionsConclusions

Page 52: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

TCP Servers: TCP Offloading for TCP Servers: TCP Offloading for Cluster-Based Servers Cluster-Based Servers

VIA

Application

HOST TCP Server

Socket API

n/w processing

Page 53: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Implementation DetailsImplementation Details

VIA

Application

HOST TCP Server

Socket API

VIA NIC VIA NIC NIC

OS

BSD Sockets

WANWAN

TUNNEL SOCKET CALLS

EXECUTE SOCKET CALLS

OS

Page 54: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Sockets, VI Channels, and BuffersSockets, VI Channels, and Buffers

VI

SEND

RDMA

RECEIVE

SocketBuffer

Page 55: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Extended APIExtended API

Standard APIStandard API tcps_socket, tcps_send, tcps_recv …tcps_socket, tcps_send, tcps_recv …

Extended APIExtended API tcps_register_memorytcps_register_memory tcps_deregister_memorytcps_deregister_memory tcps_send_async_registeredtcps_send_async_registered tcps_io_donetcps_io_done tcps_io_waittcps_io_wait

Memory Registration

Asynchronous Send

Page 56: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Asynchronous Send ProcessingAsynchronous Send Processing

RESULTS OF PREVIOUS SENDS

FLOW CONTROL

EXPORTED BUFFERS

RETURN

HOST

VI

TCP Server

RDMA WRITE

PIPELINE REQUEST request request PROCESS A REQUEST

Page 57: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Send in TCP Server ArchitectureSend in TCP Server Architecture

HOST APPLICATION TCP SERVER

Time Line (us)

tcps_send()0

pre processing3

Post Send5

Send Wait

Async Send

14

Sync Send90

Recv Wait89

received by TCPS

32

return from send()

61

send()

return

return

VIA

Page 58: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

HTTP/1.0 Static WorkloadsHTTP/1.0 Static Workloads

•0

•100

•200

•300

•400

•500

•600

•700

•800

•900

•400 •500 •600 •700 •800 •900 •1000

•Offered Load (requests/sec)

• Th

rou

gh

pu

t (r

epli

es/s

ec)

•Regular

•Sync

•AsyncSend+EAccept

•AsyncSend

•ERecv

•AsyncSend+ERecv+EAccept

Page 59: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Mixed Loads - ThroughputMixed Loads - Throughput

•0

•50

•100

•150

•200

•250

•300

•350

•400

•450

•500

•200 •300 •400 •500 •600 •700

•Offered load (requests/sec)

• Th

rou

gh

pu

t (r

epli

es/s

ec)

•Regular

•Sync

•AsyncSend+EAccept

•AsyncSend

•ERecv

•AsyncSend+ERecv+EAccept

Page 60: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

HTTP/1.1 ThroughputHTTP/1.1 Throughput

•0

•200

•400

•600

•800

•1000

•1200

•1400

•800 •900 •1000 •1100 •1200 •1300 •1400 •1500 •1600

•Offered Load (requests/sec)

• Th

rou

gh

pu

t (r

eplie

s/se

c)

•Regular

•Standard API

•Sync

•AsyncSend

•AsyncSend+EAccept

Page 61: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Traditional Computer SystemTraditional Computer System

Intelligent host and passive I/O devicesIntelligent host and passive I/O devices OS executed exclusively on the host along with OS executed exclusively on the host along with

applicationsapplications I/O devices communicate only through the host I/O devices communicate only through the host

memorymemory

Processor

Memory

Storage Controller

Network Adapter

Applications

Filesystem

Networkprotocols OS

I/O bus

Page 62: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

The Cost of OS-Application Co-The Cost of OS-Application Co-habitationhabitation

OS “steals” compute cycles and memory from OS “steals” compute cycles and memory from applicationsapplications

Two protection modes: switching overheadTwo protection modes: switching overhead OS executed asynchronouslyOS executed asynchronously

interrupt processing overheadinterrupt processing overhead internal synchronization on multiprocessor serversinternal synchronization on multiprocessor servers

Cache pollutionCache pollution Host involved in “service-work” Host involved in “service-work”

TCP packet retransmissionTCP packet retransmission TCP ACK processingTCP ACK processing ARP request serviceARP request service

Extreme cases are even worseExtreme cases are even worse Receive livelocksReceive livelocks Denial-of-service (DoS) attacksDenial-of-service (DoS) attacks

Page 63: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Host mediates data transfer Host mediates data transfer between devicesbetween devices

DiskNetworkinterface

HostApplication buffer

Network bufferFile buffer

OS

Application

Page 64: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Server=Cluster of Intelligent Server=Cluster of Intelligent DevicesDevices

DISK

CPU CPU

MEMORY

IBCPU MEM MEM CPU

NIC

HOST

I-STORAGE I-NIC

Page 65: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Split-OS IdeaSplit-OS Idea

HOST

I-STORAGE I-NIC

OS

File-System TCP / IP

Application

REMOTEDMA

Page 66: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Networking in Conventional Networking in Conventional OSOS

Appl

OS

Host Network Interface

DMA

Networkpackets

DMA

interrupts

acksSend/Receivebuffers

Page 67: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Split-NetworkingSplit-Networking

Appl

OS

Host

I-NIC

Networkpackets

InfiniBand

Backupbuffers

Send/Receivebuffers

Page 68: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Split-NetworkingSplit-Networking

Minimum overhead on the host: communication Minimum overhead on the host: communication between application and network interfacebetween application and network interface

Retransmission and ack processing handled in the Retransmission and ack processing handled in the intelligent network interfaceintelligent network interface

Interrupts are eliminatedInterrupts are eliminated Receive livelock is avoided (no interrupts )Receive livelock is avoided (no interrupts )

DoS attacks can be absorbed in the network DoS attacks can be absorbed in the network interfaceinterface

Send and receive buffers kept in the network Send and receive buffers kept in the network interface as long as possibleinterface as long as possible optimal replacement policy can be implementedoptimal replacement policy can be implemented retransmission buffers may be evicted and written back retransmission buffers may be evicted and written back

to the host (non-intrusively using RDMA)to the host (non-intrusively using RDMA) receive buffers can be eagerly transferred to the host or receive buffers can be eagerly transferred to the host or

discarded if overflowdiscarded if overflow

Page 69: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Direct Device-to-Device Direct Device-to-Device CommunicationCommunication

Host

I-NICI-Storage

1. Bind(socket, VI_channel)2. Bind(file,VI_channel)

3. RDMA _write

Transfer file to socket bypassing the hostTransfer file to socket bypassing the host OS creates a channel and binds socket and file to it OS creates a channel and binds socket and file to it Direct D2D conflicts with caching in the host memoryDirect D2D conflicts with caching in the host memory

transfer(file,socket,size)

Page 70: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

OutlineOutline

The M2M GameThe M2M Game M2M Toys: VIA, InfiniBand, DAFSM2M Toys: VIA, InfiniBand, DAFS Playing with M2MPlaying with M2M

Software DSMSoftware DSM Intra-Server CommunicationIntra-Server Communication Fault-Tolerance and AvailabilityFault-Tolerance and Availability TCP OffloadingTCP Offloading

ConclusionsConclusions

Page 71: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Why is M2M Good ?Why is M2M Good ?

Low-overhead to the senderLow-overhead to the sender Zero-overhead to the receiver (with Zero-overhead to the receiver (with

RDMA)RDMA) Low-latency: good especially for small Low-latency: good especially for small

messagesmessages Zero-Copy from/to registered buffersZero-Copy from/to registered buffers

Page 72: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

M2M PitfallsM2M Pitfalls

No flow controlNo flow control not necessary for round-trip or overwrite-not necessary for round-trip or overwrite-

type messagestype messages

Registration: expensiveRegistration: expensive Zero-CopyZero-Copy

limited by registration capacitylimited by registration capacity copying is better when several small copying is better when several small

messages must be sentmessages must be sent

Best performance requires M2M-aware Best performance requires M2M-aware application/protocol application/protocol

Page 73: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

What would be good to haveWhat would be good to have

Remote read to silently fetch remote Remote read to silently fetch remote data data

Remote atomic to allow buffer sharingRemote atomic to allow buffer sharing Scatter-gatherScatter-gather Hardware flow-controlHardware flow-control

Page 74: Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland

Open QuestionsOpen Questions

Blocking vs. spinning on I/O Blocking vs. spinning on I/O completioncompletion

User vs. Kermel implementationUser vs. Kermel implementation VIA/IP & IB/IPv6 vs. IP with TOEVIA/IP & IB/IPv6 vs. IP with TOE

storage networking: IB. vs. iSCSIstorage networking: IB. vs. iSCSI