research report - ibmrz 3845 (#zur1607-030) 07/13/2016 computer sciences 11 pages research report...

RZ 3845 (#ZUR1607-030) 07/13/2016 Computer Sciences 11 pages

Research Report

jVerbs: RDMA support for Java®

Patrick Stuedi, Bernhard Metzler, Animesh Kumar Trivedi

IBM Research – Zurich

8803 Rüschlikon

Switzerland

Research Almaden – Austin – Beijing – Brazil – Cambridge – Dublin – Haifa – India – Kenya – Melbourne – T.J. Watson – Tokyo – Zurich

jVerbs: RDMA support for Java

Patrick StuediIBM Research, [email protected]

Bernard MetzlerIBM Research, [email protected]

Animesh TrivediIBM Research, [email protected]

Abstract

We present jVerbs, an API and library offering fullRDMA semantics in Java. jVerbs achieves bare-metallatencies in the order of single digit microseconds forreading small byte buffers from a remote host’s memory.In the paper, we discuss the design and implementationof jVerbs and demonstrate its advantages with several ap-plications.

1 Introduction

There has been an increasing interest in low latency fordata center applications. Remote Direct Memory Ac-cess (RDMA) is a standard for efficient and low la-tency memory to memory data transfer between two net-worked hosts. RDMA provides zero copy, kernel bypassas well as enhanced networking semantics such as one-sided operations and clean separation of paths for dataand control operations. Traditionally, RDMA techniqueshave been employed in high-performance computing [3].Consequently, the interfaces to the RDMA network stackare in native languages like C. In contrast, today manydata center applications and even large parts of the soft-ware infrastructure for data centers are written in Java.Therefore, proper RDMA support in Java will be nec-essary to avoid giving away the latency advantages ofmodern network interconnects.

One obstacle towards achieving this goal is thatRDMA requires direct access to the networking hard-ware from user space. Java proposes the Java Native In-terface (JNI) to provide Java applications with access tolow level tasks that are best implemented in C. Cross-ing the Java-native boundaries, however, is known to becostly. Even though JNI performance has improved, theperformance penalties are still significant compared tothe latencies of modern high-speed interconnects.

Luckily, Java as a language has evolved offering moreoptions for implementing efficient RDMA support in

Java. We present jVerbs, an API and library for Javaoffering full RDMA semantics and bare-metal latencies(up to 4-7 microseconds) for reading small byte buffersfrom a remote host’s memory. jVerbs operates in con-cert with the rest of Linux RDMA ecosystem and workswith existing RDMA devices (both software and hard-ware based). In the paper we discuss the design and im-plementation of jVerbs and demonstrate how it can beused to boost the performance of existing data center ap-plications such as a distributed cache and web services.

2 The Java Latency Gap

Network latencies seen by Java applications on modernhigh speed interconnects are lagging far behind the low-est possible latencies that can be achieved with applica-tions written in C. This is partially due to the additionaloverhead imposed by Java but also due to the limitedcapabilities in Java to leverage the hardware features ofmodern interconnects. To illustrate this situation we per-formed a set of experiments comparing the latencies seenby applications written in Java and C, respectively.

In the first set of experiments we measure the round-trip latency of a 64 byte message exchanged between twohosts connected by a 10 Gbit/s Ethernet network. In theexperiment we use Chelsio T4 NICs which can be usedboth in full RDMA and Ethernet only mode. As canbe seen from Figure 1, when implemented in Java us-ing regular TCP-based sockets this benchmark achievesa latency of 47µs. The same benchmark implemented inC achieves a latency of 35µs. The performance differ-ence can mainly be attributed to the longer code path ofJava.

We set these results in relation to the latency achiev-able if RDMA is deployed in the same network setup.Here, we further distinguish between (a) using theRDMA send/recv model which resembles the socketbehavior and (b) taking full advantage of RDMA se-mantics through one-sided read operations (see Sec-

0

10

20

30

40

50

60

70

80

ETH IB

late

ncy

[us]

Java/NIOC/sockets

C/RDMA send/recvC/RDMA read

Figure 1: Request/response latencies on modern inter-connects. There exists a significant latency gap betweenJava and other implementations.

tion 3 for details). By applying the send/recv model,RDMA protocol offloading and zero copy data move-ment already speeds up the latency to 30µs. UsingRDMA read operations the round trip latency furtherdecreases to only 7µs.

In a second set of experiments we exchanged Ethernetwith another RDMA interconnect, namely InfiniBand.Infiniband uses a different network link technology andis mainly deployed in high-performance computing. Themeasurements show an even more dramatic latency gapbetween the C application running over InfiniBand andusing Java on this technology. This increased perfor-mance discrepancy is mainly caused by the InfiniBandstack being highly optimized for direct RDMA opera-tions but not for being accessed via sockets/TCP. Withthat, the plain Java-based implementation ends up lag-ging behind by up to 70µs when compared to an RDMA-based implementation written in C.

Those measurements yield two main results. First,better RDMA support in Java is necessary if we wantJava applications to benefit from the latency advantagesof modern interconnects. Second, only making thewhole set of RDMA semantics directly available withinJava has the potential to unleash the full RDMA perfor-mance to its applications. jVerbs accommodates both tar-gets. To show the motivation behind its design, in thenext two sections we describe the core elements of theRDMA technology and discuss the challenges in provid-ing RDMA in Java.

3 Background

We briefly provide the necessary background informa-tion about RDMA, its API, and implementation in Linux.A more detailed introduction into RDMA can be foundin [7].

RDMA Semantics: RDMA provides both send/re-

ceive type communication and RDMA operations. WithRMDA send/recv – also known as two-sided op-erations – the sender sends a message, while the re-ceiver pre-posts an application buffer, indicating whereit wants to receive data. This is similar to the traditionalsocket based communication semantics. RDMA oper-ations comprise read, write, and atomics, com-monly referred to as one-sided operations. These op-erations require only one peer to actively read, write, oratomically manipulate remote application buffers.

In contrast to the socket model, RDMA fully separatesdata transfer operations from control operations. This fa-cilitates pre-allocation of communication resources e.gapplication buffer registration, connection endpoint andwork queue allocations etc. Separating a fast data pathfrom a control path enables implementing very fast dis-patching of data transfer operations without operatingsystem involvement and is key for achieving ultra-lowlatencies.

API: Applications interact with the RDMA subsystemthrough a “verbs” interface, a loose definition of APIcalls providing aforementioned RDMA semantics [8].By avoiding a concrete syntax, the verbs definition al-lows for different platform specific implementations.

To exemplify a control type verb, the create qp()creates a queue pair of send and receive queues for hold-ing application requests for data transfer. Data operationssuch as post send() or post recv() allow ap-plications to asynchronously post data transfer requestsinto the send/receive queues. Completed requests areplaced into an associated completion queue, allocatedvia the create cq() verb. The completion queue canbe queried by applications either in polling mode or inblocking mode.

The verbs API also defines scatter/gather data accessto minimize the number of interactions between appli-cations and RDMA devices. Furthermore, multiple re-quests can be submitted by a single post send() orpost recv() call.

Implementation: Today concrete RDMA implemen-tations are available for multiple interconnect technolo-gies, such as InfiniBand, iWARP, or RoCE. OpenFabricsis a widely accepted effort to integrate all these technolo-gies from different vendors and to provide a commonRDMA application programming interface implement-ing the RDMA verbs.

As a software stack, the OpenFabrics Enterprise Dis-tribution (OFED) spans both the operating system kerneland user space. At kernel level, OFED provides an um-brella framework for hardware specific RDMA drivers.These drivers can be software emulations of RDMA de-vices, or regular drivers managing RDMA network inter-faces such as Chelsio T4. At user level, OFED providesan application library implementing the verbs interface.

2

acr

Typewritten Text

acr

Typewritten Text

RTT JNI costs Overhead2000/VIA 70µs 20µs 29%2012/RDMA 3-7µs 2-4µs 28-133%

Table 1: Network latencies and accumulated JNIcosts per data transfer in 2000 and 2012. Assum-ing four VIA operations with base-type parameterper network operation [11], and two RDMA opera-tions with complex parameters per network operation(post send()/poll cq()).

Control operations involve both the OFED kernel anduserspace verbs library. In contrast, operations for send-ing and receiving data are implemented by directly ac-cessing the networking hardware from user space.

4 Challenges

Java proposes the Java Native Interface (JNI) to give ap-plications access to low-level functionality that is bestimplemented in C. Unfortunately, the overhead of cross-ing the boundary between Java and C can be quite high,especially for function calls with complex parameters.This is bad news since the key data operations in RDMAlike post send(), post recv() and poll cq()all take arrays of complex data types as parameters.

Direct userland network access for Java has been in-vestigated in the late 90s in the context of the virtualinterface architecture (VIA). Jaguar [11] is the mostsophisticated approach of that time and also closestto jVerbs. Inevitably, Jaguar discusses the performanceoverhead of using JNI to access networking hardwarefrom Java. The numbers reported are in the order of1µs for a simple JNI call without parameters, and 10sof µs for more complex JNI calls. Copying data acrossthe JNI boundary though, is the most expensive opera-tion of all with 270 µs per kB of data. These numbersare best viewed in relation to the round-trip latency of∼ 70µs measured natively on VIA. Today, we know thatping/pong latencies of single digit microseconds are pos-sible with modern RDMA interconnects. Thus, we madeseveral experiments to see how a potential JNI solutionwould fit into the RDMA latency landscape. We foundimproved JNI overheads of about 100ns for simple JNIcalls without parameters but 1−2 µs for more complexJNI calls similar to the post send() verb call.

Considering that multiple verb calls are necessary fora single message round-trip, these numbers add a signifi-cant fraction to the RDMA latency. This is an optimisticcalculation, looking at RDMA operations referencing asingle buffer only. The JNI overhead for scatter/gatheroperations can easily reach multiples of microseconds.

jVerb call description

Con

trol

jv create qp() create send/recv queuejv create cq() create completion queue

jv reg mr() register memory with devicejv connect() setup an RDMA connection

Dat

a

jv post send() post operation(s) to send queuejv post recv() post operation(s) to recv queue

jv poll cq() poll completion queuejv get cq evt() wait for completion event

Table 2: Most prominent API calls in jVerbs.

The key take away here is that the JNI overhead for com-plex function calls (including arrays and complex datatypes) is inacceptable when implementing a low latencynetwork stack. The overhead of simple function calls(with just base-type parameters), however, does not addsignificant costs to the overall latency.

5 Design of jVerbs

In this section we describe some of the design decisionswe made in jVerbs .

5.1 jVerbs APIThe verbs interface is not the only RDMA API, butit represents the “native” API to interact with RDMAdevices. Other APIs like uDAPL, Rsockets, SDPor OpenMPI/RDMA are built on top of the verbs,and typically offer higher levels of abstractions at thepenalty of restricted semantics and lower performance.With jVerbs as a native RDMA API we decided notto compromise on available communication semanticsnor minimum possible networking latency. jVerbs pro-vides access to all the exclusive RDMA features suchas one-sided operations and separation of data and con-trol, while maintaining a completely asynchronous, eventdriven interface. Other higher-level abstractions may bebuilt on top of jVerbs at a later stage if the need arises.Table 2 lists some of the most prominent verb calls avail-able in jVerbs.

5.2 Zero-copy Data MovementRegular Java heap memory used for storing Java ob-jects cannot be used as a source or sink in a RDMA op-eration since this would interfere with the activities ofthe garbage collector. Fortunately, support for off-heapmemory has been added to Java since version 1.4.2 ofthe Java Development Kit (JDK). Off-heap memory isallocated in a separate region outside the control of the

3

garbage collector, yet it can be accessed through the reg-ular Java memory API (ByteBuffer). jVerbs enforces theuse of off-heap memory in all of its data operations. Anydata to be transmitted or buffer space for receiving mustreside in off-heap memory. In contrast to regular heapmemory, off-heap memory can be accessed cleanly viaDMA. As a result, jVerbs enables true zero-copy datatransmission and reception for all application data storedin off-heap memory. This eliminates the need for datacopying across a JNI interface which we know can easilycost multiple 10s of microseconds. In practice though, itlooks like at least one copy will be necessary to movedata between its on-heap location and a network bufferresiding off-heap. In many situations, however, this copycan be avoided by either making sure that network dataresides off-heap from the beginning, or by merging thecopying with object serialization. As an example of thefirst, one can imagine a key/value store with the datastore being held in off-heap memory. A good exam-ple of the second is an RPC call, where the parametersand result values are marshalled into off-heap memoryinstead of marshalling them into regular heap memory.Both cases will be described in more detail in Section 7.

5.3 Stateful Verb CallsAt its core, any RDMA user library will have to translatedata structures exported in the API into a memory lay-out for the device. For instance, the post send verb callrequires work descriptors provided by the application tobe written into a mapped device queue in a device spe-cific format. Such a serialization when done in Java caneasily reach several microseconds; too much given thesingle-digit network latencies of modern interconnects.To mitigate this problem, jVerbs employs a mechanismcalled stateful verb calls (SVCs). With SVCs, any statethat is created as part of a jVerbs API call is passed backto the application and can be reused on subsequent calls.This mechanism manifests itself directly at the API level:instead of returning a return value, the verb calls return astateful object that represents a verb call for a given setof parameter values. Each SVC object has the followingfour operations associated with it:

• exec(): executes the corresponding jVerbs APIcall for the given parameter values;

• result(): returns the result of thelast jVerbs call;

• valid(): is true to indicate that the object can beexecuted, false otherwise;

• free(): releases any state associated with thisSVC object;

One key advantage of SVCs is that they can be cachedand re-executed as long as they remain valid. Seman-tically, each execution of an SVC object is identical toa jVerbs call evaluated against the current parameter stateof the SVC object. Any serialization state that is neces-sary while executing the SVC object, however, will onlyhave to be created when executing the object for the firsttime. Subsequent calls use the already established serial-ization state, and will therefore execute much faster.

Some SVC objects allow for changing the pa-rameter state after object creation. For instance,the addresses and offsets of SVC objects returnedby jv post send() and jv post recv() can bechanged by the application if needed. Internally, thoseobjects update their serialization state incrementally.Modifications to SVC objects are only permitted as longas they are not extending the serialization state. Conse-quently, adding new work requests or scatter/gather el-ements to a SVC jv post send() object is not al-lowed.

Stateful verb calls give applications a handle to miti-gate the serialization cost. In many situations, applica-tions may only have to create a small number of SVCobjects matching the different types of verb calls theyintend to use. Re-using those objects effectively re-duces the serialization cost to almost zero as we willshow in Section 8. Figure 2 illustrates programmingwith jVerbs and SVCs and compares it to native RDMAprogramming in C.

6 Implementation

The implementation of jVerbs follows the abstract fac-tory pattern. The API is defined as an abstract classand applications use a factory to create a specific in-stance of the jVerbs interface. Currently there exist twoseparate implementations of the jVerbs interfaces, jVerb-s/mem which is implemented entirely in Java and jVerb-s/nat which makes use of JNI in a way that avoidsJNI performance overheads almost completely. From aperformance standpoint jVerbs/mem has a slight edgeover jVerbs/mem. On the other hand jVerbs/nat can runwith any RDMA device currently supported by OFEDwhich is not true for jVerbs/mem.

Which implementation of the jVerbs library is instan-tiated is decided by a system property that can be set bythe application. Other than that, the implementation spe-cific details are completely shielded from the application.In the following we discuss each of the two implementa-tions of jVerbs in more detail.

4

/* assumptions: send queue (sq),completion queue (cq),work requests (wrlist),output parameter withpolling events (plist) */

/* post the work requests */post_send(sq, wrlist);/* check if operation has completed */while(poll_cq(cq, plist) == 0);

(a)

RdmaVerbs v = RdmaVerbs.open();/* post the work requests */v.jv post send(sq, wrlist).exec().free();/* check if operation has completed */while(v.jv poll cq(cq, plist)

.exec().result() == 0);

(b)

RdmaVerbs v = RdmaVerbs.open();/* create SVCs */RdmaSVC post = v.jv post send(sq, wrlist)RdmaSVC poll = v.jv poll cq(cq, plist);post.exec();while(poll.exec().result() == 0);/* modify the work requests */post.getWR(0).getSG(0).setOffset(32);/* post again */post.exec();while(poll.exec().result() == 0);

(c)

Figure 2: RDMA programming (a) using native C verbs,(b) using jVerbs, (c) using jVerbs with SVC.

6.1 jVerbs/MemjVerbs/mem is an implementation of the jVerbs API en-tirely written in Java. It is based on two key aspects:memory mapped hardware access and direct kernel inter-action. Figure 3 serves as a reference throughout the sec-tion, illustrating the various aspects of the jVerbs/memimplementation architecture.

6.1.1 Memory-mapped Hardware Access

As mentioned earlier, for all performance-critical oper-ations the native C verbs library interacts with RDMAnetwork devices via three queues: a send queue, a re-ceive queue and a completion queue. Those queuesrepresent hardware resources but are mapped into userspace to avoid kernel involvement when accessingthem. jVerbs/mem leverages Java’s off-heap memory

verbs API

user space

kernel space

control path

fast path

jVerbs corejVerbs driver

driver

NIC

send/recv/compl queues

work requests pointing to user memory

direct memoryaccess

application

off-heap memory

JVMstateful verb call (SVC)

OFED kernel

Figure 3: Architecture of the jVerbs/mem implementa-tion.

to map these hardware resources into the Java addressspace. Fast path operations like jv post send() orjv post recv() are implemented by directly serial-izing work requests into the mapped queues. All the op-erations are implemented entirely in Java, avoiding ex-pensive JNI calls or modifications to the JVM.

Access to hardware resources is device specific.Hence, the native C verbs user library contains hooksfor a device specific driver that implements hardware ac-cess from user space. In jVerbs/mem we follow the samemodel, encapsulating the device specific internals into aseparate module (user driver) which interacts with thecore jVerbs library through a well defined user driverinterface. A concrete implementation of this interfaceknows the layout of the hardware queues and makes surework requests or polling requests are serialized into theright format. As of now, we have implemented userdrivers for Chelsio T4, Mellanox ConnectX-2 and Softi-WARP [10].

User drivers typically make use of certain low-leveloperations when interacting with hardware devices. Forinstance, efficient locking is required to protect hard-ware queues from concurrent access. Other parts requireatomic access to device memory or guaranteed orderingof instructions. While such low-level language featuresare available in C they have not been widely supported inJava for years. With the rise of multicore, however, Javahas extended its concurrency API to include many ofthese features. For instance, Java has recently added sup-port for atomic operations and fine grain locking. Basedon those language features it was possible to implementRDMA user drivers entirely in Java using regular JavaAPIs in most of the cases. The only place where Javareflection was required was to obtain an off-heapmapping of device memory. This is due to the failure ofJava mmap() to work properly with device files. Look-

5

ing at the Java road map it shows that more low-levelfeatures will be added in the coming releases, makingthe development of jVerbs user drivers even more conve-nient.

6.1.2 Direct Kernel Interaction

The native C verbs library implements control operationsin concert with the RDMA kernel modules. Verbs libraryand kernel communicate over a shared file descriptor us-ing a well defined binary protocol. Control operationsare often referred to as the slow path since they typicallydo not implement performance critical operations. Thisis, however, only partially true. Certain operations likememory registration may very well be in the critical pathof applications. To make sure no additional overheadis imposed for control operations, jVerbs/mem directlycommunicates with the RDMA kernel via the same file-based binary protocol. Again, one alternative would havebeen to implement the binary protocol in a C library andinterface with it through JNI. But this comes at a loss ofperformance which is unacceptable even in the controlpath.

Some RDMA devices require extra parameters to bepassed to the kernel driver during control operations. Toaccommodate those needs, jVerbs/mem invokes the userdriver interface before interacting with the kernel.

6.1.3 Stateful Verb Calls in jVerbs/mem

In jVerbs/mem we make use of SVCs to cache all theserialization state that occurr within the user driver aspart of writing the memory mapped queues. For instance,work requests are cached as a byte array in a serializedform for the target device. Upon executing the given verbcall, the serialized byte sequence is unfolded into theright position within the memory mapped device queue.

6.2 jVerbs/natjVerbs/nat is an alternative implementation of thejVerbs interface working directly with the OFED RDMAuser libraries. This has the advantage that jVerbs/nat canbe used with any RDMA capable network interface sup-ported by OFED. The challenge was to integrate withthe native OFED user libraries without paying the JNIlatency penalty.

6.2.1 Native Integration

The key insight from section 4 was that the cost of call-ing a JNI function is significant when passing complexparameter types, but neglectable when passing base typeparameters. In jVerbs/nat (see Figure 4) we use JNI tobridge (jverbs.so) between Java and the native libraries

jverbs/nat

verbs API

user space

kernel space

control path

fast path driver

NIC

send/recv/compl queues

work requests pointing to user memory

direct memoryaccess

application off-heap memory

JVMstateful verb call (SVC)

OFED kernel

jverbs.so

ibvers.so user driver

serialized parameters

for ibverbs.so

Figure 4: Architecture of the jVerbs/nat implementation.

(ibverbs.so), but we make sure all the JNI function callsonly take a small number of base type parameters as in-put.

To illustrate the approach let us walk through the callstack of jVerbs/nat for the case of the jv post sendverb call. An application calls jv post send with alist of work request referencing memory regions to betransmitted to a remote host. In jVerbs/nat we want tocall the native post send API function available in theOFED user library. Calling post send via a regularJNI interface would, however, impose a significant la-tency overhead due to marshalling and de-marshalling ofthe work requests at the JNI border. Instead, in jVerbs/natwe serialize the work request directly into off-heap mem-ory using the native C memory layout. This memorylayout exactly matches the memory layout of the C pa-rameters that are used for calling the native post sendAPI function call. With the work requests properly se-rialized in off-heap memory, passing them to the nativeOFED function (post send) simply involves a JNI callcontaining a base type parameter holding the address tothe off-heap representation of the parameter. The advan-tage is that the Java parameters only have to be serializedonce, avoiding the intermediate JNI serialization.

6.2.2 Stateful Verb Calls in jVerbs/nat

The concept of stateful verb calls is used in jVerbs/natto cache the aforementioned serialized parameters pre-pared and ready to use in off-heap memory. That way,repeating jVerbs API calls will not have to undergo theparameter translation process (see Figure 4).

7 Applications

RDMA programming is different from socket program-ming and often requires major re-thinking of applicationstructures. One key characteristic of RDMA is that it

6

jVerbs

application

client

stub

NIC

sessionstate

off-heapmemory

server

marhalled parameter

SVC for RPC

1

2

Figure 5: Zero-copy RPC using jVerbs (skipping detailsat server): 1) marshalling parameters into off-heap mem-ory 2) zero-copy transmission of parameters and RPCheader using scatter/gather RDMA send.

cleanly separates data from control operations. This isdone for good reason, namely because it allows applica-tions to push most of their control operations out of thecritical path. With the control state being set up properly,applications can benefit from the latency advantages offast RDMA data operations. In the following, we de-scribe the implementation of two systems explicitly de-signed to achieve low-latencies using jVerbs.

7.1 Low-latency RPC

Remote Procedure Call (RPC) is a popular mechanism toinvoke a function call in a remote address space [5]. Datacenter software infrastructure and applications heavilyrely on RPC as a key mechanism for accessing their ser-vices. Remote Method Invocation (RMI) is Java’s APIand toolset for performing the object oriented equivalentof RPC. RMI is based on TCP/IP sockets and is thereforenot optimized for high-performance interconnects withuser-level networking interfaces. We have developedjvRPC, a simple prototype of an RDMA-based RPC sys-tem for Java. jvRPC makes use of RDMA send/recvoperations and scatter/gather support. The steps involvedin a RPC call are illustrated in Figure 5.

Initialization: First, both client and server set up asession for saving the RPC state across multiple calls ofthe same method. The session state is held in off-heapmemory to make sure it can be used with jVerbs opera-tions.

Client: During an RPC call, the client stub first mar-shalls parameter objects into the session memory. It fur-ther creates an SVC object for the given RPC call us-ing jVerbs and caches it as part of the session state. TheSVC object represents a jv post send() call of typesendwith two scatter/gather elements. The first elementpoints to a memory area of the session state that holds theRPC header for this call. The second element points tothe marshalled RPC parameters. Executing the RPC call

then comes down to executing the SVC object. Since theSVC object is cached, subsequent RPC calls of the samesession only require marshalling of the parameters butnot the re-creation of the serialization state of the verbcall.

The synchronous nature of RPC calls allow jvRPC tooptimize for latency using RDMA polling on the clientside. Polling is CPU expensive though it leads to sig-nificant latency improvements. jvRPC uses polling forlightweight RPC calls and falls back to blocking modefor compute-heavy function calls.

Server: At the server side, incoming header andRPC parameters are placed into off-heap memory by theRDMA NIC from where they get de-marshalled into on-heap objects. The actual RPC call may produce returnvalues residing on the Java heap. These objects togetherwith the RPC header are again marshalled into off-heapsession memory provided by the RPC stub. Further, anSVC object is created and cached representing a sendoperation back to the client. The send operation trans-mits both header and return values in a zero-copy fash-ion. Again, RDMA’s scatter/gather support allows thetransmission of header and data with one single RDMAoperation.

RPC based on user-level networking is not a new idea,similar approaches have been proposed in [4, 6]. jvRPC,however, is specifically designed to leverage the seman-tical advantages of RDMA and jVerbs (e.g., scatter/-gather, polling, SVCs). In Section 8 we show that jvRPCachieves latencies that are very close to the raw networklatencies of RDMA interconnects.

7.2 Low-latency Memcached

Memcached is a prominent in-memory key/value storeoften used by web applications to store results ofdatabase calls or page renderings. The memcached ac-cess latency directly affects the overall performance ofweb applications. Memcached supports both TCP andUDP based protocols between client and server. Re-cently, RDMA-based access to memcached has been pro-posed [9]. The work focuses on using software-basedRDMA to lower the CPU footprint of the memcachedserver. The same architecture when used with hardware-supported RDMA, however, can be used to reduce theaccess latency of memcached clients. We thereforeused jVerbs to implement a Java client accessing mem-cached through an RDMA-based protocol as describedin [9].

Basic idea: The interaction between the Java clientand the memcached server is captured in Figure 6. First,the memcached server makes sure it stores the key/valuepairs in RDMA registered memory. Second, clients learnabout the remote memory identifiers (in RDMA termi-

7

NIC

JVM

jVerbs

NIC

web app

memoryregistration

key-valuepairs

libibverbs

client memcached server

stagtable

SVC

RDMA read

Figure 6: Low-latency memcached access for Javaclients using jVerbs.

nology also called stags) of the keys when accessinga key for the first time. And third, clients fetch key/-value pairs through RDMA read operations (using thepreviously learned memory references) on subsequentaccesses to the same key.

Get/Multiget: Using RDMA read operations foraccessing key/value pairs reduces the load at the serverdue to less memory copying and fewer context switches.More important with regard to this work is that RDMAread operations provide clients with ultra-low latencyaccess to key/value pairs. To achieve the lowest possi-ble access latencies, the memcached client library makesuse of jVerbs SVC objects. A set of SVC objects rep-resenting RDMA read operations are created at loadingtime. Each time a memcached GET operation is called,the client library looks up the stag for the correspond-ing key and modifies the cached SVC object accordingly.Executing the SVC object triggers the fetching of thekey/value pair from the memcached server.

Memcached also has the ability to fetch multiple key/-value pairs at once via the multiget API call. The callsemantics of multigetmatch up well with the scatter/-gather semantics of RDMA, allowing the client libraryto fetch multiple key/value pairs with one single RDMAcall.

Set: Unlike in the GET operation, the server needsto be involved during memcached SET to insert the key/-value pairs properly into the hash table. Consequently,a client cannot use one-sided RDMA write operationsfor adding new elements to the store. Instead, addingnew elements is implemented via send/recv opera-tions. Nevertheless, objects can be transmitted withoutintermediate copies at the client side if they are mar-shalled properly into off-heap memory before.

8 Evaluation

In this section we discuss the performance of jVerbs byfirst analyzing its performance for basic RDMA oper-ations, and then looking at more complex applicationsrunning on top of jVerbs. Before presenting our results,it is important to describe the hardware setup used forevaluation in detail.

8.1 Test Equipment

Experiments are executed on two sets of machines. Thefirst set comprises two machines connected directly toeach other. These machines are equipped with a 8 coreIntel Xeon E5-2690 CPU and a Chelsio T4 10 Gbit/sadapter with RDMA support. The second set comprisestwo machines connected through a switched Infinibandnetwork. These machines are equipped with a 4 coreIntel Xeon L5609 CPU and a Mellanox ConnectX-2 40Gbit/s adapter with RDMA support. We have used theEthernet-based setup for all our experiments except forthe one discussed in Section 8.4. We have further usedan unmodified IBM JVM version 1.7 to perform all theexperiments shown in this paper. As a sanity check, how-ever, we have repeated several experiments using an un-modified Oracle JVM version 1.7 without having seenany major performance differences. We use the jVerb-s/mem implementation throughout the evaluation. How-ever, we again repeated most of the experiments alsowith jVerbs/nat without noticeable performance differ-ence.

8.2 Basic Operations

In this section we are showing the latency performanceof the different RDMA operations available in jVerbs anddemonstrate the gain they add compared to traditionalJava/sockets. The measured latency numbers discussedin this section are captured in Figure 7. The benchmark ismeasuring the round-trip latency for messages of varyingsizes. Five bars are shown for each of the measured datasizes representing five different experiments. Each datapoint represents the average value over 1 million runs. Inall our experiments, the standard deviation across the dif-ferent runs was small enough to be neglectable, thus, wedecided to omit reporting error bars. We used Java/NIOto implement the socket benchmarks. For a fair compari-son, we use data buffers residing in off-heap memory forboth the jVerbs and the socket benchmarks.

Sockets: The Java socket latencies are shown as thefirst bar among the five bars shown per data size in Fig-ure 7. We measured socket latencies of 59µs for 4 bytedata buffers, and 95µs for buffers of size 16K. Thosenumbers allow us to put the upcoming jVerbs latencies

8

0

20

40

60

80

100

120

140

4B 64B 0.5K 1K 2K 4K 8K 16K

pin

g/p

ong

late

ncy

[us]

buffer size

NIOjVerbs send/recv

jVerbs send/recv+polljVerbs read

jVerbs read+poll

Figure 7: Round-trip latencies of basic jVerbs operationscompared to traditional socket networking in Java.

into perspective. A detailed performance analysis of theJava/NIO stack, however, is outside the scope of thiswork.

Two-sided operations: Two-sided operations are theRDMA counterpart of traditional socket operations likesend() and recv(). As can be observed from Fig-ure 7, two-sided operations in jVerbs achieve a roundtriplatency of 30µs for small buffers and 55µs for largerbuffers. This is about 50% faster than Java/sockets.Much of this gain can be attributed to the zero-copytransmission of RDMA send/recv operations and itsoffloaded transport stack.

Polling: One key feature offered by RDMAand jVerbs is the ability to poll a user mapped queue todetermine when an operation has completed. By usingpolling together with two-sided operations we can bringdown the latencies by an additional 65% (see third bar inFigure 7).

One-sided operations: One-sided RDMA opera-tions provide a semantic advantage over traditional ren-dezvous based socket operations. The performance ad-vantage of one-sided operations is demonstrated by thelast two bars shown in Figure 7. These bars represent thelatency numbers of a one-sided read operation, onceused in blocking mode and once used in polling mode.With polling, latencies of 7µs can be achieved for smalldata buffers, whereas latencies of 31µs are achieved forbuffers of size 16K. This is another substantial improve-ment over regular Java sockets. Much of this gain comesfrom the fact that no server process needs to be scheduledfor sending the response message.

8.3 Comparison with Native Verbs

One important question regarding the performanceof jVerbs is how it compares with the performance of the

0

10

20

30

40

late

ncy

[us]

C/verbsJava/jVerbs

0

10

20

30

40

4B 64B 0.5K 1K 2K 4K 8K 16K

late

ncy

[us]

buffer size

Figure 8: Comparing latencies using the native Cverbs interface with jVerbs, top: send/recv op-eration with polling, bottom: read operation withpolling. jVerbs adds negligible performance overhead.

native verbs interface in C. For this reason, we comparedall the benchmarks of Section 8.2 with their C basedcounterparts. Throughout those experiments, the perfor-mance difference has never exceeded 5%. In Figure 8 wecompare the latencies of native C verbs with jVerbs forboth send/recv and read operations. In both experi-ments polling is used, which puts maximum demands onthe verbs interface. The fact that jVerbs performance isat par with the performance of native verbs approves thedesign decisions made in jVerbs.

8.4 Different Transports

RDMA is a networking principle which is independentof the actual transport that is used for transmitting thebits. Each transport has its own performance characteris-tics and we have already shown the latency performanceof jVerbs on Chelsio T4 NICs. Those cards provide anRDMA interface on top of an offloaded TCP/IP/Ethernetstack, also known as iWARP. In Figure 9, we show laten-cies of RDMA operations in jVerbs for two alternativetransports: Infiniband and SoftiWARP.

In the case of Infiniband, the latency numbers for small4 bytes buffers outperform the socket latencies by far.This discrepancy, described in Section 2, is due to In-finiband being optimized for RDMA but not at all forsockets. The gap increases further for larger buffer sizesof 16K. This is because RDMA operations in jVerbs areable to make use of the 40Gbit/s Infiniband transport,whereas socket operations need to put up with the Ether-net emulation on top of Infiniband. The latencies we seeon Infiniband are a good indication of the performancelandscape we are going to see for RDMA/Ethernet in thenear future. In fact, the latest iWARP NICs today provide

9

0

20

40

60

80

100

120

140

160

4B 16K

late

ncy

[us]

buffer size

4B 16K

buffer size

NIO

jVerbs send/recv

jVerbs send/recv+poll

jVerb read

jVerb read+poll

Figure 9: Comparing latency performance of jVerbs onInfiniband (left) and SoftiWARP/Ethernet (right) toJava/NIO operations.

RPC call RMI jvRPC Speedupfoo() 101µs 16.6µs 6foo(class) 149µs 19.3µs 7.7foo(byte[1K]) 193µs 22.9µs 8.4foo(byte[16K]) 591µs 61.5µs 9.6

Table 3: RPC latencies for Java/RMI and jvRPC

40 Gbit/s bandwidth and latencies below 3µs.The right side of Figure 9 compares the latencies of us-

ing jVerbs on SoftiWARP [1] and compares the numbersto standard Java sockets. SoftiWARP is a software-basedRDMA device implemented on top of kernel TCP/IP. Inthe experiment we run SoftiWARP on top of a standardIntel 10 Gbit/s NIC. SoftiWARP does not achieve laten-cies that are as low as those seen by Infiniband. Us-ing jVerbs together with SoftiWARP, however, still pro-vides a significant performance improvement comparedto standard sockets (70% in Figure 9). This is appealingsince one big advantage of SoftiWARP is that it runs onany commodity NIC.

8.5 Applications

We are testing jVerbs in the context of the two applica-tions described earlier.

jvRPC: We have implemented jvRPC by manuallywriting RPC stubs for a set of function calls. In Table4 we compare the latency performance of jvRPC withJava RMI. As can be seen, jvRPC achieves RPC latenciesof 16µs for simple void functions, and around 19-61µsfor calls taking objects and large byte arrays as param-eters. On the other hand, the RMI latencies range from100 to almost 600µs for the same function calls. Overall,jvRPC performs a factor of 6-10 times faster than RMI.

0

20

40

60

80

100

1 1 2 3 4 5 6 7 8 9

late

ncy

[us]

keys per multiget

spymemcachedjVerbs,first access

jVerbs,repeated access

Figure 10: Memcached access latencies for different Javaclients.

Interestingly, the gap increases with increasing size ofthe RPC parameters. Here, the data copying inside theRMI stack including the copying inside the NIO subsys-tem may cause some of these latency overheads. Again, adetailed performance analysis of Java/RMI is outside thescope of this work. One important observation, however,is that the RPC latencies of jvRPC are only marginallyhigher than the fastest send/recv latencies that canbe achieved (see Figure 7). The small latency penaltycomes from marshalling parameters into off-heap mem-ory and from the fact that polling can be used only at theclient side while waiting for the RPC response. Usingpolling at the server side would decrease the latency fur-ther, but is inefficient since the server does not know thetime frame during which an RPC call is coming in.

Memcached/jVerbs: Figure 10 shows the latenciesof accessing memcached from a Java client. We com-pare two setups: (a) accessing unmodified memcachedusing a standard Java client library [2] and (b) access-ing a modified memcached via RDMA using the jVerbs-based client library as discsussed in Section 7.2. TheRDMA-based setup is further subdivided into two cases,(a) accessing keys for the first time and (b) repeatedaccess of keys. In the benchmark we measure the la-tency of a multiget operation with increasing num-bers of keys per operation. As can be observed from theFigure, standard memcached access from Java takes be-tween 55 and 72 µs depending on the number of keys inthe multiget request. With RDMA-based memcachedaccess this number reduces to 21-32 µs if keys are ac-cessed for the first time, and 7-18 µs for repeated accessto keys. These latencies are very close to the raw networklatencies of send/recv and read operations that areused to implement key access in each of the cases.

One important observation from Figure 10 is that thebenefits of RDMA increase with the increasing num-

10

ber of keys per multiget operation. This shows thatRDMA scatter/gather semantics are a very good fit forimplementing the multiget operation. Overall, usingRDMA reduces Java-based access to memcached sub-stantially – by a factor of 10 compared to a standardclient.

9 Conclusion

In this paper we have presented jVerbs, a library frame-work offering ultra-low latencies for Java applications.Our measurements show that jVerbs provides bare-metalnetwork latencies in the range of single digit microsec-onds for small messages – a factor of 2-8 better than thetraditional Java socket interface on the same hardware.

References

[1] Softiwarp. http://www.gitorious.org/softiwarp.

[2] Spymemcached. http://code.google.com/p/spymemcached/.

[3] ADIGA, N., ET AL. An overview of the bluegene/lsupercomputer. In Supercomputing, ACM/IEEE2002 Conference (nov. 2002), p. 60.

[4] BILAS, A., AND FELTEN, E. W. Fast rpc on theshrimp virtual memory mapped network interface.J. Parallel Distrib. Comput. 40, 1 (Jan. 1997), 138–146.

[5] BIRRELL, A. D., AND NELSON, B. J. Implement-ing remote procedure calls. ACM Trans. Comput.Syst. 2, 1 (Feb. 1984), 39–59.

[6] CHANG, C., AND VON EICKEN, T. A softwarearchitecture for zero-copy rpc in java. Tech. rep.,Ithaca, NY, USA, 1998.

[7] FREY, P. Zero-Copy Network Communication:An Applicability Study of iWARP beyond MicroBenchmarks. In Dissertation submitted to ETHZURICH.

[8] HILLAND, J., CULLEY, P., PINKER-TON, J., AND RECIO, R. Rdma pro-tocol verbs specification (version 1.0).http://www.rdmaconsortium.org/home/draft-hilland-iwarp-verbs-v1.0-RDMAC.pdf.

[9] STUEDI, P., TRIVEDI, A., AND METZLER, B.Wimpy nodes with 10gbe: leveraging one-sided op-erations in soft-rdma to boost memcached. In Pro-ceedings of the 2012 USENIX conference on An-nual Technical Conference (Berkeley, CA, USA,2012), USENIX ATC’12, USENIX Association,pp. 31–31.

[10] TRIVEDI, A., METZLER, B., AND STUEDI, P.A case for rdma in clouds: turning supercom-puter networking into commodity. In Proceedingsof the Second Asia-Pacific Workshop on Systems(New York, NY, USA, 2011), APSys ’11, ACM,pp. 17:1–17:5.

[11] WELSH, M., AND CULLER, D. Jaguar: Enablingefficient communication and i/o in java. In Concur-rency: Practice and Experience, Special Issue onJava for High-Performance Applications (1999).

11

IBM is a trademark of International Business Machines Corporation, registered in many jurisdictions worldwide. Java is a registered trademark of Oracle and/or its affiliates. Other names may be trademarks of their respective owners.

research report - ibmrz 3845 (#zur1607-030) 07/13/2016 computer sciences 11 pages research report...

Documents