network function virtualization and messaging for non ... · network function virtualization and...

Network Function Virtualization and Messaging for Non-Coherent Shared Memory Multiprocessors

Mike Schlansker, Jean Tourrilhes, Sujata Banerjee, Puneet Sharma Hewlett Packard Labs HPE-2016-44 Keyword(s): NFV; Cluster; Datacenter; Non-Coherent Shared Memory Abstract: This technical report describes a datacenter-scale processing platform for Network Function Virtualization. The platform implements high performance messaging on a non-coherent shared memory fabric. This includes both features for fast messaging as well as get and put operations that implement high performance remote memory access at datacenter scale. The report explores NFV as an important application that can be accelerated on this platform.

External Posting Date: April 28, 2016 [Fulltext] Internal Posting Date: April 28, 2016 [Fulltext]

Copyright 2016 Hewlett Packard Enterprise Development LP

1

Network Function Virtualization and Messaging for Non-Coherent Shared Memory Multiprocessors

Mike Schlansker, Jean Tourrilhes, Sujata Banerjee, Puneet Sharma

Hewlett Packard Enterprise Labs

1. Non-Coherent Shared Memory Multiprocessors There are a number of advantages to multiprocessor hardware architectures that share memory. In these architectures a large number of processors share memory to support efficient and flexible communication within and between processes running on one or more operating systems. At small scale, this established and mature shared memory multiprocessor (SMP) technology is used in multi-core processor chips from multiple hardware vendors.

When shared memory is deployed over a very large number of processors, significant benefits are possible. Large-scale shared memory machines offer the potential for fine-grained non-volatile data sharing across large systems that is not possible with a traditional cluster computers using fast networks. These systems exploit principles of Memory Driven Computing (MDC) [Bresniker] which uses large-scale, persistent, and word addressable storage to support important big data processing applications. MDC architectures allow high-performance word-level access to a large persistent data store that is shared across many compute nodes. These shared memory architectures exploit benefits of emerging word-addressable non-volatile storage devices such as memristor.

However, potential benefits for large-scale shared memory comes with significant obstacles. Previous coherent shared memory architectures such as the SGI Origin [Laudon] and Stanford DASH [Hennesy] have developed principles for scaling coherent caches across large multiprocessor systems. But, years of experience have shown that preserving coherence across large memory fabrics introduces performance limitations. Cache coherence using snooping or distributed cache directories requires complex cache management protocols that require messages which invalidate or update copies of cached data whenever new data is written. The performance of such large shared memory systems is hard to predict and often disappointing as programmers fail to understand hidden bottlenecks that arise when they write parallel code that generates significant cache coherence traffic. Programs run slowly while they wait for hidden coherence signals needed to preserve a common view of shared data across the distributed hardware.

Prior work has investigated the design of non-coherent multiprocessor systems [Yang]. We are investigating this style of architecture as we explore non-coherent systems similar to the architectures explored in the Hewlett Packard Enterprise “The Machine” project [Packard]. These architectures have the potential to provide high performance with much larger communication scale than traditional shared memory machines. In this report, we explore communication and networking software for similar large-scale non-coherent shared-memory memory architectures.

Figure 1 illustrates a non-coherent shared memory multiprocessor system. The system includes multiple shared memory multi-processor nodes where each node is a coherent

2

shared memory multiprocessor along with its local RAM. Each node is a conventional multicore computer system that runs an operating system with possible virtualization software for guest VMs. In addition, large-scale non-volatile RAM storage is attached to the fabric.

Each node provides a node-to-fabric interface which connects all nodes in a single load-store domain through a Non-Coherent Memory Fabric (NCMF). When loads and stores are executed on the same SMP node then conventional memory coherence is preserved between these operations. A load operation on any processor within the node can addresses any given local or non-local memory address. The load will always see any value resulting from a prior store to the same memory location. This coherence property arises due to SMP hardware that ensures that all caches within the SMP preserve a common up-to-date view of memory. But, memory behavior is more complex when loads and stores are executed on distinct nodes because the non-coherent memory fabric design avoids high hardware cost and lost performance needed to provide fabric-wide coherence.

Non-Coherent Memory Fabric

(NCMF)

Coherent SMP

Local RAM

FabricInterface

FabricSwitches

FabricSwitches

Coherent SMP

Local RAM

FabricInterface

...

EthernetInterface

EthernetInterface

EthernetSwitches

EthernetSwitches

Ethernet Fabric

...

SMPNode

SMPNode

Load-Store Domain

To OtherLSDs

...

To OtherLSDs

Non-Volatile

RAM

Non-Volatile

RAM

Mem Intf Mem Intf

...

Figure 1 – Non-Coherent Shared Memory Multiprocessor

Non-coherent shared memory allows simplified hardware. As a result, inter-node communications requires that complex memory coherence operations must be explicitly programmed in software. Store operations executed on a source node may not be visible to load operations executed on a destination node unless explicit cache operations are executed to enforce the flow of data from the source node’s cache through the fabric and to the desired memory location.

Non-coherent shared memory makes expensive cross-node coherence transactions visible to programmers. If programmers fail to understand the flow of data between the distributed caches within the compute nodes, subtle and difficult to detect program bugs will arise. But, with deep knowledge of data flow, the programmer is positioned to modify programs to reduce the number of explicit cache flush, cache invalidate, or other operations needed to force the exchange of data between processors. This can improve program performance and scalability.

3

Each node provides an Ethernet interface to support global scale communications. Ethernet may be required for communication across large datacenters that exceed the appropriate scope for a non-coherent shared memory architecture. We plan to integrate communications software for shared memory with communications software for traditional network communications in order to provide an architecture that exploits non-coherent shared memory benefits and yet still achieves arbitrarily large scale. This document currently focuses on shared memory communications.

2. NFV for Large Non-Coherent Multiprocessors Future non-coherent multiprocessor architectures provide a number of important features that may prove useful for future NFV processing needs. First they are based on a coherent shared memory multiprocessor (SMP) single-node building-block. These SMPs run multiple hardware threads of execution which share local DRAM and on-chip caches. The single node SMP provides a high performance and parallel single-chip processing platform for NFV processing. Threads of execution share memory efficiently through on-chip processor caches and local DRAM. Techniques such as avoidance of packet copying and passing packet streams by pointer can be used to design efficient single node NFV processing systems. It is critical that any efficient NFV is based on a powerful and efficient single-node NFV processor.

But, conventional SMPs do not scale. For NFV applications that require complex packet processing at very high data rates, NFV systems will require the use of many nodes (or SMP processors). The traditional way to scale processing beyond a single node is to create packet processing pipelines that stream across Ethernet between nodes. This requires packet copying and expensive processing between nodes and results in wasted CPU cycles and wasted memory space needed to transfer and copy packets. An alternative approach is to use the enhanced scalability of non-coherent shared memory systems to define a large-scale system for NFV processing.

2.1. NFV Processing System Architecture Figure 2 below shows an abstract system that is similar to Figure 1, but reorganized to identify important sub-systems. For simplicity, Ethernet connectivity is not shown. We define our NFV Machine (SMPM) as a conventional shared memory multi-processor as previously used by the NetVM project [Hwang] to process chained NFV services. The SMPEXM extended machine is a larger non-coherent system that we use to explore function chaining over a large-scale non-coherent shared memory. We often refer to the SMPMs as nodes within a larger SMPEXM system. We use these definitions to distinguish between coherent modest scale NFV parallel processing and non-coherent large-scale NFV processing techniques.

Modern high performance computing systems use InfiniBand or Converged Ethernet to implement lossless messaging transport across large datacenter network fabrics [Vienne]. These lossless fabrics support RDMA-style programming models that we adopt in this work. We hope to improve on this style of remote memory access using closer integration of memory access capabilities into computer system hardware and software. Prior research also explored supporting messaging and RDMA-style communications

4

using Ethernet NIC that are attached directly to the coherent memory of a multiprocessor chip along with a conventional Ethernet network [Schlansker].

Our SMPEXM system is similar to a “Scale-Out NUMA” design presented in ASPLOS 2012 [Novakovic]. However, at the lowest hardware level, our design uses native hardware support for non-coherent remote memory operations. This includes direct hardware support for processor initiated load and store operations that are mapped to remote physical addresses and traverse the SMPEXM non-coherent memory fabric as supported in The Machine [Packard].

OS InstanceOS Instance

Non-Coherent Memory Fabric

Multi-CPU processor

Multi-CPU processor

Multi-CPU processor

LocalRAM

LocalRAM

LocalRAM

FabricNVM

FabricNVM

OS Instance

SMPEXM(Extended Machine)

SMPM SMPM SMPM

Ethernet

Ethernet

Ethernet Figure 2: Extended Machine - SMPEXM

Sections 3-9 below provide describe the ZMSG software architecture which is designed to implement messaging for large non-coherent shared-memory multi-processors. ZMSG is intended as a general tool for multiple applications. Section 10 describes techniques to accelerate NFV service chaining using ZMSG with a large non-coherent shared memory hardware system.

The remainder of this section outlines some of the research opportunities for optimized NFV processing that can be explored on this architecture.

2.2. Eliminating Copies in Shared Memory Systems Memory-based inter-node communications has the potential to enhance overall processing efficiency. We are developing communications techniques that are built on shared memory which use fast messaging or other approaches for inter-process communications among nodes in a non-coherent memory system. This eliminates software overheads needed to pass data through an Ethernet device. Memory based transport is lossless eliminating the need for software to retransmit control messages or any other messages that cannot be dropped.

Node to node copying of packets is particularly wasteful when most of the packet data is carried along and only required in rare cases where deeper inspection is needed or only when the packet reaches a final output to be retransmitted back onto a network. Network Functions (NFs) sometimes process the packet header while packet data is copied without inspection. Often a few packets that start a new flow receive special inspection to

5

determine subsequent processing for all packets in the flow. In these cases, there are good opportunities to use shared memory to reduce unnecessary copying of packet data.

Non-coherent-fabric architectures enable alternative NFV solutions. NVF architects can make careful choices regarding when to copy full packet data, or when to copy much shorter packet metadata along with a packet handle. The packet handle allows subsequent access to the full packet when needed. A packet handle can be passed from node to node within a non-coherent load-store domain and used, by any node in the packet processing pipeline. The handle can be translated to a local address and used to directly copy detailed packet data across the memory interconnect. Cross-fabric access is performed using software library functions that use low-level non-coherent memory operations to bypass local caches and reach directly across the non-coherent memory fabric to get data from its remote memory location. While not as efficient as local cache or local DRAM access, non-coherent access can be more efficient than copying packet data using high performance networking techniques, such as RDMA, to remotely access data.

2.3. Using Non-volatile Storage Traditional NFV systems use volatile DRAM to temporarily store high bandwidth packet streams. This data is transient data that cannot be stored for a substantial length of time due to a limited availability of DRAM storage. However, future NFV goals may include rapid write or read access to of large amounts of persistent data. Persistent data could be sample packets, sample flow traces, derived statistics, rule databases, or other packet processing measurements or instructions that are large, must be quickly accessed, and cannot be easily reconstructed if lost due to a power outage.

New NVRAM technologies provide a combination of benefits not possible with prior memory and disk. When the amount of data that needs to be stored becomes excessive, DRAM becomes very expensive and consumes too much power. When extracted data is randomly stored or accessed from a database, disk seek latency imposes severe bottlenecks. Word addressable NVRAM devices circumvent key limitations and may provide new persistent data storage possibilities for packet stream processing.

Often, packet processing is about finding rare events within a large volume of data much like searching for the proverbial needle in a haystack. When a packet stream enters the NFV system, little is known regarding the long-term need for the packet. But, as processing proceeds, a flow might be determined as malicious, it might be identified as a target of “lawful intercept”, or other triggers may identify a flow or set of packets as one that requires rigorous analysis and long term storage. NVRAM augments the storage hierarchy with new high-bandwidth and low-latency capabilities that can be considered for use in NFV systems. This allows rapid data storage and access that survives power outages or system restarts and may assist in some types of NFV processing.

2.4. Security in Shared Memory Systems Security is a critically important design feature for future large-scale computer systems. The consolidation of computing needs onto shared hardware infrastructure requires strong access control and partitioning as clients with conflicting business interests often share hardware. We hope to better understand the secure processing needs for NFV

6

systems. We hope to explore security needs for NFV and how this may relate to secure communication APIs, secure system-wide memory allocation, or other security issues for large-scale multiprocessor systems that relate to NFV.

3. Messaging is important A key component of our architecture work focuses on implementing high performance messaging communications for large non-coherent shared memory multiprocessors. Efficient communications across the extended machine is essential to most tasks. Our low-level messaging architecture is designed to support both modern high performance messaging interfaces as well as traditional software interfaces such as TCP/IP sockets. A goal for this effort is to design and evaluate non-coherent memory architectures along with appropriate messaging interfaces as a high performance platform for important NFV applications and to tune and refine software and hardware prototypes for these applications.

3.1. Messaging simplifies programs Shared memory programs have been studied as a complex and powerful parallel programming abstraction for about half a century. Even after all this work, the construction of highly parallel programs using shared memory is generally recognized as a class of programs that are hard to scale, verify, and debug. Shared memory parallel programs often exhibit problems such as decreasing performance with increasing scale, race-conditions, deadlocks, starvation, and other hard to debug program errors.

Techniques are needed that exploit shared memory hardware without unleashing the full complexity of shared memory programming across all applications. The construction of parallel programs using messaging is a trusted technology in both scientific and commercial applications that limits program complexity and helps us understand and predict program correctness and performance. The use of messaging does not preclude the use of more complex shared memory programming techniques that can still be used in critical program sections which are carefully written to exploit special shared memory benefits while avoiding shared memory pitfalls.

3.2. Messaging limits fault dependences When a multi-threaded program is developed using shared memory, the program is typically understood as a single fault domain. When any participating execution thread crashes during the execution of this parallel program, shared memory is left in an unknown state which adversely affects the execution of all participating threads. In one common situation, a thread needs exclusive access to modify an important object and locks access to that object in the shared memory. The thread then crashes before it completes the modification and unlocks the object. A critical resource is now permanently locked and unavailable for use by any other thread. Thus, the entire parallel program may deadlock after a single thread crashes.

Distributed parallel systems are routinely constructed using cooperating processes with messaging to exchange data. When one of the processes crashes, the effect of this failure on other processes is easier to understand. Messages can no longer be sent to or received from the faulty process but normal execution continues for all of the other processes.

7

Thus, messaging is often used to simplify process interactions within the design of fault tolerant systems. When messaging is layered over shared memory, we still face difficulties in fault isolation among processes that use share memory through their use of messaging. But now, problems of shared memory fault detection and fault recovery are restricted to a smaller amount of carefully written code within a messaging library.

The construction of a fault tolerant messaging library is a critical goal that is not yet adequately addressed within our work. This is an area where traditional Ethernet hardware based messaging has significant advantages.

4. The ZMSG Messaging Architecture We are developing the ZMSG messaging software architecture which implements high performance lossless messaging for large-scale non-coherent shared-memory multiprocessors. The central role of ZMSG is to provide high performance inter-node communications that can transport messages across non-coherent shared memory and can be extended to incorporate node-to-node transport across Ethernet or other networks.

ZMSG defines a low-level API on which a number of higher-level communication services will be layered. Goals include high performance, scalability, and security. The ZMSG design is evolutionary in nature and early goals pursue a software only approach for message delivery using operations such as load, store, cache flush, and memory atomic operations that are performed by today’s multicore processors from multiple vendors.

4.1. Global Shared Memory Overview ZMSG is designed for large non-coherent shared memory systems. For scalability, local communications within an SMPM and global communications across SMPMs are treated differently. For local communications the improved performance and simplicity of a fully-coherent shared memory can be used. For communications among nodes, care must be taken to accommodate non-coherent shared memory limitations.

NodeManager

NodeManager

Kernel Service

Kernel Service

UserProcess

Global Memory

Allocation

User Address Space

User Address Space

Kernel Address Space

Kernel Address Space

Shared Physical AddressAperture

SMPMNode

SMPMNode

Cluster Manager

UserProcess

Figure 3 – Global Shared Memory

8

Since no specialized NIC or other hardware is currently planned to assist ZMSG in its communication tasks, ZMSG runs using kernel processes that provide autonomous capabilities that can be overlapped with user execution. ZMSG uses one or more CPU cores within multiprocessor nodes to run kernel processes that act to provide autonomous network capabilities which are provided in a conventional system using NIC hardware.

Figure 3 provides an overview of shared memory as used by the ZMSG system. A cluster manager controls interactions between a multiple SMPM nodes. We define an aperture as a contiguous range of memory addresses that can be used by each process to access shared memory locations that are accessed by other processes through their own apertures. Software is written to control memory management hardware and to create these apertures specifically for the purpose of cross-node communications.

The Cluster manager performs all tasks needed to coordinate the use of apertures among the ZMSG nodes. A shared physical address aperture can be mapped into each of the nodes to support communications between all nodes. One function of the cluster manager is to acquire a physical memory aperture from a global memory allocation service and to make the same aperture available to kernel software running on each of the ZMSG nodes. This can provide non-coherent memory access to physical memory that can be shared among all of the nodes.

It should be possible to build a parallel and distributed cluster manager for very large clusters. To achieve adequate management performance, this may involve the introduction of hierarchical management or some other means to decompose cluster management into independent parallel tasks.

Each node runs a node manager, a ZMSG kernel service, and a number of user processes. The Node manager is responsible for processing local management requests. This may involve checks for correctness or security checks needed before setting up a connection. The kernel service provides shared services that are autonomous and not synchronized with user process execution. User processes are typical endpoints for ZMSG communications.

The node manager is also responsible for opening up user and kernel virtual address apertures into the shared physical address aperture. A kernel address aperture allows protected access into data shared by many other systems. The user address aperture is created to support direct user-to-user communications that bypass all operating system interaction.

4.2. Four quadrants of communication performance. The ZMSG library is designed for good performance in four communication extremes. We can then refine this architecture for additional performance improvements.

Figure 4 illustrates extreme behaviors in communication performance. Two vertical columns separate simple two-party communications from more complex all-to-all communications. In a two-party communications, a stream of messages is sent from a single transmitter to a single receiver. We call our solution architecture for this problem Bicomm. A simplified view of Bicomm is shown in Figure 5.

9

Two-Party All-to-all

ShortMessage

LongMessage

Bicommimmediate

Bicommindirect

Datagramimmediate

Datagramindirect

Figure 4 – Communication Extremes

Each ZMSG interface supports both send and receive operations. For each forward communication path we provide a return path which is needed to support message acknowledgement and buffer release management. A client desiring a unidirectional port can send and not receive on that port or receive and not send on that port if this is desired. ZMSG communications are in order and lossless.

BicommUser 1 User 2

Send Interface

Receive Interface

rcv q

rcv q

Figure 5 – Simplified View of ZMSG Bicomm

In more complex all-to-all communications, each receiver must be prepared to receive a message from one of many senders. If each receiver were limited to using a separate Bicomm interface to connect to each of many senders, then receivers would need to poll many interfaces while looking for each inbound message. Instead, we provide a lossless Datagram interface which allows each receive interface to be used as a common destination for many senders. Figure 6 presents a simplified view of Datagram.

DatagramUser 1

Send Interface

Receive Interface

rcv q

rcv

qrcv q

User 3rcv q

User 2

User n

Figure 6 – Simplified View of ZMSG Datagram

The two vertical rows in Figure 3 separate short-message from long-message communications. Short messages may be start or completion signals or other short command or data strings. Performance limits for short messages are usually measured as achieved low latency or achieved high message rate. Long messages are used to move a

10

large volume of data between program threads. Here, performance is often measured as an achieved high data rate. ZMSG uses immediate mode to move short messages and the indirect mode to move long messages. While the immediate mode passes the actual data contents through send and receive interfaces, the indirect mode passes a handle or pointer to data instead of actual data through the interface. The movement of data in the indirect mode will be performed, by a user copy loop, by DMA hardware, or by system software that mimics desired but missing DMA hardware.

4.3. ZMSG security overview We need an architecture that can ensure secure communications. Fine grained access controls are needed to ensure the authenticity of communicating parties as well as the privacy of communications. Our goal is to engineer a system where security features may penalize the performance of the connection setup process but should not penalize message transfer performance after connection setup is complete.

Our architecture uses a software approach which assumes that while user software cannot be trusted, kernel software can be trusted. We assume that secure procedures will be developed to authenticate kernel software and ensure that kernel code is trusted. While future hardware may assist in providing additional security it may not allow complex fine-grained security policies for a large number of logical communication endpoints. At this time, ZMSG relies on kernel-level protection for shared memory access to limit the scope of data access to authorized clients. ZMSG does not currently use encryption. Of course any ZMSG client users can encrypt data before sending data over ZMSG.

5. ZMSG’s Bicomm Two-Party Communications Bicomm provides high performance two party communications across ZMSG. This is done using OS bypass interfaces which allow direct user-to-user communications without operating system overhead on each message send and receive. A secure protocol interacts with the ZMSG manager to provide access into a shared region that is mapped as an aperture into the address space of two users. Since trusted ZMSG kernel software provides two-party access to page-protected user memory, we can ensure that only authorized processes can access data through the shared user aperture provided by the ZMSG manager.

Each of the endpoint processes accesses the shared region with library software that implements lock free queues using appropriate cache flush operations. A simplified Bicomm API provides easy-to-use send and receive message constructs that insulate the user from the low-level complexity of efficiently transferring cache lines across non-coherent shared memory.

5.1. Bicomm Security The protocol used to setup two-party Bicomm connections provides a first example for how all low-level ZMSG security mechanisms operate. Similar principles are used to secure ZMSG Datagrams, and ZMSG RDMA apertures.

Figure 7 illustrates the procedure to setup a secure Bicomm connection. In step 1, user 1 creates a unique secret name for the new Bicomm connection. This can be done using a secure naming service or by generating a unique random name of sufficient length. User

11

1 then submits a “Create Bicomm” command along with the secret name to the ZMSG cluster manager requesting a new Bicomm channel. A secure communication path to the cluster manager is required for this transaction. The ZMSG manager verifies that the secret name is not already in use and, if all is ok, the ZMSG manager replies to this action with a port handle that can be used for high performance messaging. In step 2, user 1 then uses a secure key distribution mechanism (“ESP” is shown in Figure 6) to provide the secret name for user1’s Bicomm channel to user 2 (and only to user 2). In step 3, user 2 submits a “Join Bicomm” command using the same name to the ZMSG manager. After checking to be sure that the reference Bicomm already exists, the ZMSG manager creates the second Bicomm port and returns a port handle to user 2.

Setup is now complete, and either user is free to issue send and receive commands on their respective Bicomm ports. Either user can be confident that the parties sending or receiving messages on this Bicomm connection are the two parties that exchanged a secret name in step 2.

Each Bicomm connection uses a memory region that is mapped into the physical address space of the SOCs for both communicating endpoints. Some subset of that memory region is mapped as a Bicomm port into the user address space for both communicating processes.

FirstBicomm Port

ZMSGMGR

Port Handle

Secret Name

Secret Name

Port Handle

SecondBicomm Port

Send()Rcv() Send()

Rcv()

ESP

Shared All Kernel Memory Region

CreateBicomm

JoinBicomm

Lock Free Queues

Step 1

Step 2

Step 3

2-User page within All Kernel Region

User1Code

User2Code

Global Memory Manager

UniqueSecret Name

Figure 7 – Bicomm secure connection setup

6. ZMSG’s Datagram Service for Multi-Way Communications

6.1. Datagram overview The ZMSG kernel Datagram service provides the most flexible messaging (along with the largest overhead) among our ZMSG communication interfaces. Its flexibility arises from a user-to-kernel-to-user architecture. By passing all messages through trusted kernel software, we gain two powerful capabilities. First, shared queues are accessed by trusted kernel code. This contributes to the both the reliability and the security of the shared datagram service. Second, the kernel can provide autonomous hardware DMA

12

acceleration that is not synchronized with the sending or receiving processes. This is needed for autonomous transport such as RDMA.

LportUser Logical Send/

Receive Queue

Logical Send/ Receive Queue

User

Pport

Lport

KernelUser Library

Local Lport Map

Kernel Send

Service

Kernel Receive Service

Remote Pport Map

Physical Receive Queue

CoherentShared

MemoryNon-Coherent

Shared Memory Fabric

...... To other

Pports

From other Pports

Figure 8 ZMSG Datagram Overview

The kernel Datagram service implements lossless Datagram-style messaging. A single physical port (Pport) and multiple logical ports (Lports) can be deployed on each node. Figure 8 provides an overview of a node with its Pport and its Lports. Each physical port provides a receive queue that can used as a destination for message insertion by many remote Pport senders. Physical receive queues are accessible only by the kernel and provide a foundation for message transport across the non-coherent memory fabric.

When a user places a message into an Lport send queue, a kernel send service observes the arrival of the message and begins processing. The identity of the destination physical port is determined and a remote physical port map is used to identify the address location of the remote Pport. The send service on this node enqueues the data into the physical receive queue of the remote Pport. This enqueue transaction is a carefully designed to support multi-threaded insertion over non-coherent shared memory. The message is now processed by the remote receive service. When a message is deposited into a physical receive queue, the receive service observes the message arrival and begins to processes the message. The destination Lport identifier is extracted from the message, a lookup is performed to find the address location of the Lport’s receive queue, and the message is delivered. The message is now available in a receive queue within the receiving user’s virtual memory.

6.2. Protocol to add secure Pports A ZMSG ring is defined as a collection of Pports that communicate with each other. The ring is used to establish kernel-level communications between multiple nodes. Each Pport on a ring can communicate with any other Pport on that ring. More than one ring can be defined as might occur when a datacenter is partitioned among multiple tenants. Shared memory hardware capabilities may limit each node’s access to specific rings. When ZMSG is initialized, a secure protocol ensures that kernel access to each ZMSG ring is authorized.

13

ZMGR

Ring Name

Physical Clique(Non-coherent mem)

ClusterManager

CreatePort

KernelModule

CreateRing

Ring Name

Port Handle

Ppor

t

Pport

Pport

Ring Name

Ring Name

Ring Name

KernelModule

KernelModule

Pport

Step 1

Step 2

Step 3

All KernelBook

ESP Pport Name

Figure 9 – Pport Creation Protocol

Figure 9 illustrates steps used to create and ensure secure access to a ZMSG ring. In step 1, a cluster manager creates a unique (possibly secret) ring name. A “Create Ring” command is sent to the ZMSG manager along with a suggested ring name and a reply indicates success. Step 2 uses a key exchange mechanism (shown as ”ESP”) to distribute the ring name over a secure channel to each of the kernel entities that is authorized to access the ring.

In step 3, each of the kernel entities independently creates its own Pport. Each kernel entity creates a Pport name that is unique among Pports on the ring. Each kernel entity submits a create Pport command with ring and Pport name parameters to the ZMSG manager. The ZMSG manager creates the Pport and returns a Pport handle to the requesting kernel entity.

6.3. Protocol to add secure Lports After each kernel entity has created a Pport, any client process that needs ZMSG communication services can create an Lport which serves as endpoints for Datagram communications. Figure 10 shows the Lport creation protocol. A client wishing to create an Lport submits a unique (possibly secret) Lport name, along with a resource request, to its Pport. The Pport replies with an Lport handle which can be used for subsequent high performance communications. The resource request specifies the size of the requested receive buffer. Large receive buffers are needed for large messages and for high fan-in communications when many senders may send simultaneously to the same receive port.

While this completes the creation of an Lport, no communications with any other Lport has been authorized. Establishing an actual communications capability is shown the next section.

14

PhysicalClique

Lport Handle

Pport

Lport Name

Pport

User code

Create Logical Port

Lport

Many Lports per Pport(Polled with coherent mem )

...

Res

Figure 10 – Lport Creation Protocol

6.4. Adding an Lport-to-Lport Connection A connection can be added between any two Lports that are attached to Pports on the same ring. Only a single connection can be added between a pair of Lports. Establishing a connection provides both an authorization to communicate as well as resources (credits) that are needed to communicate without potential data loss.

PhysicalRing

PportPportUser

CodeUserCode

PportName

Lport Name

ResPport Name Res Open

RemoteConn

OpenRemote

ConnLport Name

Lport LportRemID RemId

Step 1

Step 3

ESP

PportName

Lport Name

PportName

Lport Name

Step 2

Figure 11 – Adding an Lport connection

Figure 11 illustrates steps needed to create a connection between a pair of Lports. Connection creation begins in step one when both clients use some key exchange mechanism (shown as “ESP”) to exchange Pport and Lport names as a means to authorize remote access. In step 2 a client submits an open remote connection request to the Lport. The request provides the name of the remote Pport, the name of the remote Lport, and a resource request to obtain buffer credits needed to send on this connection. The open connection request returns the local ID for the secure remote port. This local ID serves as a destination address for a send operation or as a trusted source address for a receive operation. The other client duplicates the step 2 procedure (on the other side of the connection) in step 3. After all steps are complete, bidirectional communications is established between the clients who exchanged keys in step 1.

15

6.5. Credit based flow control ZMSG supports lossless Datagram messaging and when a stream of messages is sent, we must guarantee that each message will be delivered. When many sending Lports transmit messages to the same receiving Lport, a mechanism is needed to delay senders that compete for access to a shared receiver which has limited receive bandwidth and buffer memory. Such lossless transport requires end-to-end flow control

It is critical that any messages that have been deposited into the Pport ring can be drained into target Lport buffers to prevent physical port congestion from causing a loss of service among users who compete for physical Pport access. Credit-based flow control limits the rate at which new messages can be submitted to each sending Lport so that the physical ring can always be drained and every message in a Pport receive queue can be moved into its destination logical port without dropping messages. Credits are managed in kernel software to ensure congestion-free transmission through the shared physical ring.

Each Lport has a receive queue that provides a fixed number of message slots that are allocated when the Lport is created. Each slot can contain a single message whose maximal message size (in immediate mode) is limited to a number of bytes that is specified by a fabric MTU. When a receive buffer is shared among many senders, each sender needs a credit to guarantee an empty buffer slot at the remote target receiver before the message is accepted by the ZMSG message server and sent to that receiver. When the message is sent, the sender’s pool of available credits is decremented by one. When the message is delivered, processed, and de-allocated, the credit is returned to the sender from which the message came. This refreshes the sender’s pool of credits and allows a continuous stream of messages from senders to receivers.

Dynamic adjustments to the number of credits that each sender controls are possible as each sender’s need changes, but this feature is not yet implemented.

In the future, receive buffer allocation and credit management could consider the size of each message. Currently, all messages are considered as maximal in size.

6.6. Cross-node signaling The ZMSG kernel Datagram service can support signaling across a non-coherent memory fabric among multiple SOCs. A client can send a signaling message having a destination address that specifies an Lport. The ZMSG kernel receive service processes the received message from the Pport and deposits a message into the Lport receive queue to provide metadata associated with the signal. The kernel service thread can also sends a Linux signal to a client thread which implements a signal handler that may be suspended and waiting for that signal.

While our current ZMSG API is not faithful to the InfiniBand verb specification, our signaling API tries to follow InfiniBand Verbs.

Current software supports a completion queue as a component of each Lport. Each completion action associated with an Lport inserts a completion queue entry into the Lport’s completion queue. A completion channel can be associated with the Lport. A process can poll the completion queue and, after polling an empty queue for some period of time, the process can wait for a signal (block) on the associated notification channel.

16

This reduces wasted CPU cycles associated with a user process which is reading an empty input channel. After a signal is received notifying the user process of a newly received message, the process resumes polling to receive additional messages that are now present.

6.7. Autonomous Transport (RDMA) The ZMSG kernel Datagram service is an appropriate platform for the deployment of autonomous data transport such as RDMA. A major obstacle arises for platforms that do not have memory mover (or DMA) hardware. While a kernel software thread of execution can be substituted for missing DMA hardware, a single software thread must provide copy support for multiple Lport clients. The performance of a shared copy service, which is implemented using a single thread of execution, may be disappointing.

We advocate and look toward the incorporation of powerful DMA hardware into architectures supporting ZMSG.

7. ZMSG User-to-User Datagram A faster and simpler version of the Datagram service is also planned but not yet implemented. This service is called the user-to-user Datagram interface. Since this service eliminates kernel intervention, it can provide lower latency between its client ports. However, without kernel intervention, the service cannot implement autonomous RDMA transport and, since all code is untrusted user code, the interface cannot enforce selective security among its ports.

A single shared memory region is mapped into the user virtual address space of a set of user processes. Each client invokes user library code to manipulate non-coherent data structures within this shared region. The invocation of a send command identifies the address for the destination receive queue, checks the receive queue for a full condition and, if there is sufficient space, pushes the new message into the queue. A receive command returns the latest message at the head of the receive queue.

Here, the untrusted user code running at each SOC can be modified so as to perform arbitrary modifications on data structures in the shared region. Clients that share a memory region must trust each other to behave properly. The architecture cannot prevent denial of service, cannot selectively allow transport among pairs of ports, and messages are subject to falsification of source origin.

The simplicity of the user-to-user Datagram service eliminates costly messaging overhead. Unlike the kernel Datagram service, each message is sent directly from a sending client thread to a receiving client thread without processing by any intermediate thread.

A similar kernel-to-kernel Datagram service can also be defined. This is designed exactly as the user-to-user Datagram, except that all clients are kernel execution threads. This can be used to provide a general kernel-to-kernel messaging facility.

8. Indirect messaging modes The ZMSG indirect mode supports the transport of one or more object handles within a message from a sending process to a receiving process. The sending and receiving

17

processes potentially run on distinct nodes running separate OSs. A handle is a reference to a contiguous region of shared memory that can be translated into a starting virtual address by each process that references that region. Any word within the region can be addressed using an offset from the beginning virtual address that is identified by the handle.

When a handle is initialized, a means is provided (e.g. a function or table entry) to translate the handle into a local virtual address suitable for accessing the object. When an region is shared across multiple nodes, a location independent handle can be sent between nodes and can be translated into a local virtual address that is used to load or store data inside the region.

The rest of this section focuses on the use of handles for indirect-mode communication between concurrent threads that communicate across multiple nodes in a load-store domain. When using handles, the actual copying of data (if necessary) can be performed by user each user process rather than by a messaging service. Since each node runs many threads in parallel, copying can be performed with more parallelism than would be possible if a single kernel thread were used to implement a shared (software) DMA copy service.

ZMSG messages can be used to support either a send-side or receive-side update model described in more detail below. Each of these communication models communicates across a load-store domain by making a change to shared data and informing one or more remote nodes about the completion of this data update.

8.1. ZDMA – RDMA for ZMSG Figure 12 illustrates pseudocode which explains an RDMA interface we call ZDMA which is built on ZMSG to provide for system-wide communications. This augments messaging as implemented by IP sockets, MPI, or some other messaging library. Unlike messaging, ZDMA provides address-based remote memory access that can be used across large multiprocessor systems. Our ZDMA interface can be hosted on a variety of platforms since low-level RDMA transport can be implemented using coherent shared memory, non-coherent shared memory, or a networked cluster using RDMA hardware (such as RoCE or InfiniBand).

The pseudocode shows concepts of local buffer registration, key exchange, and remote buffer registration that are needed to enable a connection which allows fast get and put access across a non-coherent shared memory cluster. Two SMPM machines lie within a larger SMPEXM cluster. The left-hand machine holds a B1 buffer that is local to its memory. In a first step, the B1 buffer is allocated by a function that returns its address (B1A). A Local registration defines the B1 buffer as a buffer that can be remotely manipulated using get and put operations. This registration provides a cluster-wide memory manager information about the buffer’s global location within the cluster. Registration ensures that the physical location of the B1 buffer remains invariant so that cross-OS access (which is unaware of any paging operations) can safely reference data. This local registration returns a unique handle (B1H) that globally identifies the buffer. The handle is sent though a message communication channel (e.g. using a TCP socket) to a process running on the SMPM2 machine.

18

OS2

SMPM2

B1 Buffer

B2 Buffer

B1H=L_Register(B1A, B1len) B2H=L_Register(B2A, B2len)

MSG_Send(B1H)B1H=MSG_Rcv()

B1A=Alloc(B1len)B2A=Alloc(B2len)

Put(B2A, offset1, B1RA, offset2, len)

Get(B1RA, offset3, B2A, xoffset4, len)

SMPEXM Memory Manager

(B1RA, B1len) = R_Register(B1H)

Store(B1A+offsetx, data)Load(B1A+offsety, data)

OS1

SMPM1

Figure 12

The pseudocode for the right-hand machine illustrates remote access into SMPM1’s B1 buffer. The right-hand machine allocates and registers a B2 buffer for use in subsequent get and put operations. In order for SMPM2 to access SMPM1’s B1 buffer, a handle exchange is performed as SMPM2 reads the handle value using conventional messaging. The B1H handle value can be used in a remote registration call to acquire address (B1RA) and length. While such remote addresses can be implemented using non-coherent load and store operations that directly manipulate remote memory, these low-level operations are too complex for most users. Instead Get and Put library functions are provided simplify this process.

After both buffers are allocated and registered, get and put operations use handles to provide fast user-mode access to remote memory over the non-coherent fabric. Our pseudocode shows this remote access using a local update loop which updates buffer B1 on the left and a remote access loop which copies data between B1 and B2 on the right. The local update loop on SMPM1 accesses local memory using conventional coherent load and store operations through a pointer to B1 (B1A) in local memory. SMPM2 accesses remote memory using get and put operations that retrieve or insert data from or to remote memory.

After all buffer registrations have been performed, fast user-mode get and put operations can be used to randomly access data within a remote buffer. Local and remote nodes are running asynchronously and additional tools are used to synchronize the exchange of data when needed.

19

8.2. Data Ownership A non-coherent update occurs when distinct nodes run execution threads that concurrently update a shared object stored within a non-coherent shared memory. These distributed updates leads to two serious problems. First, performance is lost when mutually-exclusive updates are enforced across non-coherent shared memory. Slow cross-fabric atomics are used to lock data objects for exclusive modification. Second, when any thread of execution crashes, any shared object that the crashed thread may have modified could be permanently left in a corrupt state. This makes it difficult contain faults and provide program fault tolerance.

Per-node data-write ownership is used to eliminate distributed updates across a non-coherent shared memory. We say that a data object is owned by a single node if that node, and only that node, modifies data within the object. If each data object has a unique owner, then distributed updates are no longer needed. All requests to update owned data are executed within a single (multiprocessor) node allowing the use of more efficient local and coherent atomic operations. When a process within a node fails, data owned by that node may be left in a corrupted state. Data owned by all other nodes has not been corrupted. Developing techniques to promote data ownership can simplify fault containment and produce system that are more tolerant of failures.

8.3. Send-side Update (no ownership) In a send-side update, a sender updates globally shared data using store and flush operations that are executed by the sending node before an immediate mode message or signal is sent from the sender to one or more receivers to inform potential receivers of the update completion. A receiver can poll for update messages at a receive port or the receiver can block while waiting for a signal to indicate the update completion. For example, much like an RDMA put, a sending thread could copy data to update a remote object and then use ZMSG’s messaging to signal the completion of the update.

Without clear rules for data write ownership, fault tolerance may suffer. When node hardware or software fails, any object that can be modified by that node may be left in a corrupt state. Without clear ownership rules this may include almost any object in the system.

8.4. Receive-side Update (owned by receiver) A sender decides to update shared data that it does not own. The sender sends a message to the unique owner of that data. The message requests that the receiver perform an update action to an object as described by its object handle. The update is completed at the receiver and then acknowledged as complete by returning a message back to the sender.

Any node can update any object by sending a request to the owner of an object for a remote update. A node can self-update objects that it owns. This protocol allows that each data object is owned by a single node which is responsible for all updates to that node. Using this protocol, when any node’s software or hardware fails, all data owned by other nodes can be assumed to be in a correct state.

20

9. Appendix 1 NFV Use Case In this section, we provide an example NFV use case for Figure 2’s SMPEXM. This use case explains how NFV service chaining can be accelerated across a non-coherent shared-memory cluster. Our NFV service chaining example uses a software library to insulate higher-layer applications from system dependent details of low-level communications and memory management.

Shared memory is especially useful for manipulating data structures which rarely change. Network function virtualization (NFV) provides a good example of this. Conventional systems can use pointer based packet processing for NFV within the limited scale of a single coherent shared memory multiprocessor node. We are interested in performing NFV processing over a much larger a multi-node scale across a non-coherent memory interconnect.

In our simplified example, an NFV processing system is built by replicating a set of communicating NFV functions. Each function is implemented as software running on general-purpose hardware and is able to process a stream of packets by performing actions such as dropping the packet, replicating the packet, sending the packet to a specific successor function, or (if necessary) inspecting data inside the packet to make detailed decisions.

In our example a stream of packets is represented by a metadata list holding list elements containing information needed to process a corresponding packet. Metadata within an element includes important information extracted from the packet as well as a packet handle that can be used to access full packet data from any processing node within an SMPEXM system. Using this representation, a packet stream can be manipulated by copying packet metadata without copying actual packet data. In addition, if any virtual function needs to inspect actual packet data, this is available through a get operation using the packet handle. Packet metadata is smaller than full packet data and is often copied and manipulated within processor caches thus greatly accelerating packet processing.

In order to process a large volume of packets, the processing power of multiple virtual functions running on many nodes is applied to a stream of packets. Each function acquires one or more input packet streams and can send processed outputs on one or more streams to successor NFV functions while primarily inspecting and modifying metadata and not changing detailed packet data.

Figure 13 shows a toy NFV example to illustrate advantages of non-coherent memory-based communications. A number of NFV functions are chained across 4-SMPMs within a single SMP Extended Machine. At the perimeter of each extended machine, the system interfaces to the outside world using traditional Ethernet packet streams that connect to other extended machines or to other networking hardware. When more than one Extended Machine is used in an NFV solution, traditional Ethernet networking (Network-based NFV) is used to connect systems

Within the scope of each SMPEXM we can use Get and Put (supported by our ZDMA library) to quickly access packet data across the cluster (Non-Coherent Memory-Based NFV). This extends the utility of address-based memory access beyond the conventional

21

SMP. Within each SMP Machine, coherent communications can more quickly access data and can exploit additional efficiency offered by short distance memory access and the SMP cache.

Packet Processor

Redirection Processor

Filter Processor

Cache

PF PF

∆ T

F ≈ PR × PF × ∆ T

Redirect Table

Keeping ∆T Short has Big Benefits

Filter Table

Packet Stream

Raw Input Packet Buffer

Packet Input

PR=Packet rate

SPI serviceProcessor

DPI Service Processor

SMPM1

SMPM3

SMPM2

Filter Processor

SMPM4

Filter Table

Packet Output

OutputRedirection

RedirectionTable

Network-Based NFV

SPI Processor

SPI Processor

DPI Processor

DPI Processor

Metadata Command Streams

Non-Coherent Memory-Based NFV... ...Coherent Memory-Based NFV

SMPEXM

Figure 13

Consider the left-hand processing block SMPM1 which processes a stream of packets through four pipelined NFV functions. The input block interfaces to a standard Ethernet device and reads a stream of actual packets from some external input. Packet data is stored in a packet buffer so that it can be remotely accessed when needed. The Input block also produces a metadata stream that represents critical information about the packet needed for downstream processing. This metadata stream represents a packet processing request stream that is forwarded throughout the processing pipeline. Each local block in SMPM1 can freely access packet data directly using processor load from memory operations.

Each SMP machine provides hardware resources used to process packet streams. Resources include a limited number CPU cores, a limited main-memory bandwidth, and a limited cache size. Each of these resources imposes a critical performance bottlenecks and limits the parallel processing performance in the device.

With conventional packet processing, the packet footprint, or amount of packet data that is actually processed, includes the entire packet. As packet data is copied from one NFV function to the next, the packet data moves through and the CPU cache and displaces previously cached data while consuming cache resources. If we view the packet footprint as a resource, that we estimate the total cache footprint as: F ≈ PR × PF × ∆ T where PR is the packet rate in packets per second, PF is the average per-packet footprint in bytes, and ∆T is the time delay between packet entry and departure through the NFV pipeline. If both the time delay and packet size are large, then the total cache footprint exceeds the size of the cache and attempts to cache the packet stream fail. This results in slow cache-miss reads and CPU stalls as the stream is processed in each step of the NFV pipeline.

22

When processing is optimized, and only packet metadata is copied through the packet pipeline, the per-packet footprint can be much smaller. This improves efficiency and performance as metadata is accessed from cache rather than accessing whole packets from main memory.

Two additional NFV blocks shown in SMPM1 include the filter processing and redirection processing blocks. A filter block consults a filter table to decide whether a packet is forwarded or dropped. A redirection block consults its table to decide where packet metadata should be forwarded and whether multiple copies of metadata are forwarded to multiple locations.

The SMPM2 and SMPM3 blocks show shallow and deep packet inspection services that use multiple CPU cores to perform CPU-intensive tasks in parallel. Note that packet inspection tasks can use global shared memory to modify upstream or downstream packet processing filters and redirection blocks. This uses shared memory as a tool to implement a global control over packet flows.

The SMPM4 block implements an output processing service. After filtering, and redirection, packets are reassembled into an explicit full-packet format before each packet is output through an Ethernet NIC for network-based NFV processing. Note that this block can use Get operations to directly access raw packet data across the high-performance non-coherent shared memory as needed to output required packet frames.

10. References: [Bresniker] Kirk M. Bresniker, Sharad Singhal, and R. Stanley Williams, “Adapting to Thrive in a New Economy of Memory Abundance”, IEEE Computer Dec. 2015

[Hennesy] John Hennesy, Mark Heinrich, and Anoop Gupta, “Cache-Coherent Distributed Shared Memory: Perspectives on Its Development and Future Challenges” Proceedings of the IEEE Vol. 87 No. 3, March 1999.

[Hwang] Jinho Hwang, The George Washington University; K. K. Ramakrishnan, Rutgers University; Timothy Wood, The George Washington University, “NetVM: High Performance and Flexible Networking Using Virtualization on Commodity Platforms” ,Proc USENIX NSDI, 2014.

[Laudon] James Laudon and Daniel Lenoski, “The SGI Origin: A ccNUMA Highly Scalable Server”, ISCA, 1997.

[Novakovic] Stanko Novakovic, Alexandro Daglis, Edouard Bugnion, Babak Falsafi, Boris Grot, “Scale-Out NUMA”, ASPLOS 14

[Packard] Keith Packard “A look at The Machine”, https://lwn.net/Articles/655437/

[Schlansker] Michael Schlansker, Nagabhushan Chitlur, Erwin Oertli, Paul Stilwell, Linda Rankin, Dennis Bradford, Richard Carter, Jayaram Mudigonda, Nathan Binkert, and Norm Jouppi, “High-performance Ethernet-based communications for future multi-core processors, SC’07”.

[Vienne] Jerome Vienne, Jitong Chen, Md. Wasi-ur-Rahman, Nusrat S. Islam, Hari Subramoni and Dhabaleswar K. (DK) Panda, “Performance Analysis and Evaluation of InfiniBand FDR and 40 GigE RoCE on HPC and Cloud Computing Systems” HOTI 20.

https://lwn.net/Articles/655437/

23

[Yang] Xiaojun Yang, Fei Chen, Hailiang Cheng, Ninghui Sun “A HyperTransport-Based Personal Parallel Computer”, 2008 IEEE International Conference on Cluster Computing.

network function virtualization and messaging for non ... · network function virtualization and...

Documents