evaluating network processing ef ciency with processor ...brecht/papers/eurosys-2006.pdf · to...

Evaluating Network Processing Efficiency with ProcessorPartitioning and Asynchronous I/O

Tim Brecht‡ G. (John) Janakiraman§ Brian Lynn§ Vikram Saletore∗∗ Yoshio Turner§

§Hewlett Packard Laboratories ∗∗ Intel Corporation ‡University of Waterloo

ABSTRACTApplications requiring high-speed TCP/IP processing caneasily saturate a modern server. We and others have pre-viously suggested alleviating this problem in multiproces-sor environments by dedicating a subset of the processorsto perform network packet processing. The remaining pro-cessors perform only application computation, thus elimi-nating contention between these functions for processor re-sources. Applications interact with packet processing en-gines (PPEs) using an asynchronous I/O (AIO) program-ming interface which bypasses the operating system. A keyattraction of this overall approach is that it exploits the ar-chitectural trend toward greater thread-level parallelism infuture systems based on multi-core processors. In this pa-per, we conduct a detailed experimental performance anal-ysis comparing this approach to a best-practice configuredLinux baseline system.

We have built a prototype system implementing this ar-chitecture, ETA+AIO (Embedded Transport Accelerationwith Asynchronous I/O), and ported a high-performanceweb-server to the AIO interface. Although the prototypeuses modern single-core CPUs instead of future multi-coreCPUs, an analysis of its performance can reveal importantproperties of this approach. Our experiments show thatthe ETA+AIO prototype has a modest advantage over thebaseline Linux system in packet processing efficiency, con-suming fewer CPU cycles to sustain the same throughput.This efficiency advantage enables the ETA+AIO prototypeto achieve higher peak throughput than the baseline system,but only for workloads where the mix of packet processingand application processing approximately matches the allo-cation of CPUs in the ETA+AIO system thereby enablinghigh utilization of all the CPUs. Detailed analysis showsthat the efficiency advantage of the ETA+AIO prototype,which uses one PPE CPU, comes from avoiding multipro-cessing overheads in packet processing, lower overhead ofour AIO interface compared to standard sockets, and re-duced cache misses due to processor partitioning.

EuroSys’06, April 18–21, 2006, Leuven, Belgium.

Categories and Subject DescriptorsD.4.4 [Operating Systems]: Communications Manage-ment—Network communication; C.4 [Performance ofSystems]: Design Studies; C.5.5 [Computer System Im-plementation]: Servers

General TermsPerformance, Design, Experimentation

KeywordsNetwork Processing, Asynchronous I/O, TCP/IP

1. INTRODUCTIONNetwork protocol processing can saturate servers when run-ning high-speed TCP/IP-based applications. High overheadfor packet processing can leave few CPU cycles available foractual application processing. This problem is worsening asservers deploy higher bandwidth 10 GigE interfaces. Onesolution that has been proposed is to offload TCP/IP pro-cessing from the host CPUs to the network interface to accel-erate network processing and free host cycles for applicationprocessing. However, prior TCP offload investigations [9, 46,47, 33, 21] have not demonstrated such improvements andpoint to the limited capabilities of the TCP Offload Engine(TOE) hardware. While future TOE designs will likely showimprovements, the processing elements in TOEs are likelyto always be behind the performance curve of mainstreamprocessors. Furthermore, memory limitations on TOE de-vices can force them to rely on host support in high loadscenarios, degrading performance.

Several evolutionary trends in server hardware and network-ing software stacks have renewed focus on improving thepacket processing efficiency of servers without resorting towholesale TCP offload [50, 42, 43, 41, 12]. The higher degreeof parallelism enabled by multi-core, multi-threaded proces-sors is a good fit for concurrently processing multiple net-work streams. Increasing cache sizes and shared caches onmulti-core chips provide greater potential to hide memorylatency by transferring data between processors through thecaches (e.g., between the application and the network stack).Memory bandwidth and I/O bandwidth will scale with thenumber of processing elements as server designs transitionto high-bandwidth point-to-point links (from multi-dropshared buses) for memory and I/O [11]. Furthermore, mem-ory controller interfaces and network interfaces will be in-tegrated much closer to processor cores reducing memoryand I/O overheads and bottlenecks. Concurrent with these

hardware developments, the socket/TCP/IP network stackis evolving [4] to adopt concepts such as OS-bypass andasynchronous I/O from APIs such as VIA [17] and Infini-Band [26] to reduce operating system overhead and increaseI/O concurrency. Our research is aimed at leveraging thesetrends to support scalable network performance on standardserver platforms.

In this direction, we and others [50, 42, 43, 41] have sug-gested that network packet processing efficiency can be im-proved by dedicating a subset of the server’s processors fornetwork processing and by using asynchronous I/O for com-munication between the network processing engine and theapplication. In our architecture, the Embedded Trans-port Acceleration (ETA) [42, 50] component (developed

at Intel Labs) dedicates (“sequesters”) one or more pro-cessors or hardware threads in the server to perform all net-work processing. These dedicated Packet Processing En-gines (PPEs) avoid context switches and cache conflictswith application processes, and they avoid costly interruptsby polling applications and network interfaces. PPEs canpoll efficiently since they are fully integrated in the sys-tem’s cache coherence domain. ETA also exposes VIA-style [17] user-level communication to applications, reducingdata copies and bypassing the operating system for most I/Ooperations [26, 3]. The second component of our architec-ture exposes Asynchronous I/O (AIO) based socket se-mantics to applications [7, 6, 4, 23]. The APIs support asyn-chronous versions of traditional socket operations. Applica-tions issue operations without being blocked and withoutfirst verifying socket descriptor status. Applications receiveasynchronous events that signify the completion of previ-ously issued I/O operations. The AIO model allows applica-tions to sustain concurrent I/O operations without the over-head of multiple software threads. The queues of operationsand events between the application and the packet process-ing allow these two functions to independently speed up orslow down with respect to each other, enabling greater con-currency between them. The combined ETA+AIO architec-ture enables the development of highly concurrent, low over-head, event-based networking applications. Event-based ap-plications are structured as state machines in which a statetransition occurs when the application processes an event(e.g., when the OS or an I/O device notifies the applicationthat an outstanding operation has completed) [51].

This paper makes two primary contributions. First, it pro-vides the first rigorous analysis of this approach based onprocessor partitioning and asynchronous I/O, and identi-fies its strengths, weaknesses, and opportunities. We haveported an open source web server, userver [14], to a pro-totype implementation of the ETA+AIO architecture andcompared its performance against performance on a baselineLinux system configured using best practices. Our experi-mental results show that processor partitioning combinedwith AIO can improve network processing efficiencies overthe best Linux configuration, but these benefits are smallin magnitude for the prototype system. Our analysis showsthat the efficiency advantages of the prototype system re-sult primarily from instruction path length improvementsthrough the use of AIO and a polling architecture, and toa small degree from reduced cache misses. The AIO APIallows the application to interface with the PPE directly

through shared memory, bypassing operating system over-heads associated with issuing socket operations and receiv-ing event notifications. ETA’s polling-based dedicated PPEarchitecture avoids interrupt related overheads, for exam-ple by substituting softirq mechanisms with lighter-weightin-line function calls, and it results in reduced cache missesby eliminating cache conflicts between the application codeand the network stack.

Our second main contribution is to provide a detailed de-scription of the changes required and issues that arose whenmodifying the event-driven userver to take full advantage ofthe asynchronous socket I/O model. Our experience withextending the userver to use the AIO API illustrates sev-eral new aspects that should be factored in when developingAIO-based applications. In particular, the use of an AIOmodel necessitates additional steps for registering and de-registering (and, therefore, managing) buffer memory, sim-plifies the application by eliminating the need to handle par-tially completed operations, exposes new considerations inthe sequencing of the application’s socket operations, andenables applications to process events and issue operationsusing application-specific policies to improve resource man-agement and performance. In addition, given the similarityof our AIO API to the Sockets Extension API [4] being de-veloped in the industry, our implementation of AIO overETA is potentially useful as an example implementation forthe Sockets Extension API.

The remainder of this paper is organized as follows. Wediscuss related work in Section 2. Section 3 provides anoverview of the ETA+AIO architecture, followed by a de-tailed discussion of the ETA implementation in Section 4and the AIO API implementation in Section 5. Section 6describes the architecture of our userver application and itsmodification to use the AIO API. Section 7 discusses theresults from our performance evaluation of ETA+AIO. Po-tential architectural extensions are discussed in Section 8,and we conclude in Section 9.

2. BACKGROUND AND RELATED WORKA network server application can be implemented using amultithreading programming model, an event-driven pro-gramming model, or a combination thereof [39]. TheETA+AIO architecture is well suited to the more efficientevent-driven model, which is free of the thread schedulingand context switching overheads inherent to multithreading.With the event-driven model an application consists of a sin-gle execution thread per CPU. To avoid pausing the singlethread for extended durations, the thread must use eithernon-blocking or asynchronous I/O (AIO) to perform I/O op-erations. The standard network sockets interface supportsnon-blocking I/O operations which execute synchronouslywith the application and return prematurely if they can-not complete without blocking [49]. Applications use anevent notification mechanism, such as Unix select or poll

or Linux epoll [32, 38], to detect changes in socket state thatindicate which non-blocking operations could make progressif issued.

In contrast to non-blocking I/O, with AIO an applicationissues non-blocking calls which post operations that executeto completion asynchronously to the application. When the

operations eventually complete, the application is notified,usually through an event queue mechanism. AIO thus elim-inates the partially completed operations which can burdenapplications that use non-blocking I/O. There is a growingtrend toward enabling applications to use AIO in additionto non-blocking I/O. For example, Lazy AIO has recentlybeen proposed as a general OS mechanism for automaticallyconverting any system call that blocks into an asynchronouscall [18]. With Lazy AIO, calls that do not have to block ex-ecute to completion synchronously to the application (sim-ilarly to non-blocking I/O), whereas calls that would haveto block are transformed automatically into asynchronousoperations, which require completion event generation. Themore specialized POSIX AIO API supports a set of asyn-chronous read and write operations and uses POSIX real-time signals to provide an event queue for notifying applica-tions of operation completion [7]. Microsoft Windows pro-vides extensive support for AIO, with notification providedthrough OS event queues called completion ports [23]. TheOpen Group, which defines Unix standards, has recently ap-proved the Extended Sockets API, which provides extensiveAIO capabilities for network sockets [4]. Extended Socketsdiffers from prior asynchronous sockets mechanisms by pro-viding OS bypass primitives inspired by previous work suchas VIA [17]. Specifically, primitives for memory registrationand for shared memory work queues and event queues enableapplications to post operations and receive completion noti-fication without going through the OS kernel. In this paper,we use an AIO API that provides socket-like primitives andOS bypass capabilities very similar to Extended Sockets. InSection 6 we present our experiences with using this API toimplement an open source web server application.

To support applications, the underlying system performsTCP/IP packet processing. The high cost of this processinghas been a long-standing concern. In 1989, Clark et al [15]argued that the TCP protocol itself can be implementedefficiently in software, but operations like buffer manage-ment, connection structure lookup, and interrupt handling,which would all be needed with any transport layer pro-tocol are often expensive and limit performance. Kay andPasquale [29] later reported the high cost of data touch-ing operations (e.g., copies and checksum computations),and Mogul and Ramakrishnan [34] showed that under hightraffic loads interrupt processing could easily overwhelm aserver. These and related studies [30, 20, 19] have proposedvarious techniques to improve packet processing efficiency.

Recent OS kernels provide features that can reduce networkprocessing overhead. The “zero-copy” sendfile system callenables data transfer between file descriptors without requir-ing the data to be copied through user space buffers. Proces-sor affinity for interrupts reduces device driver contention fornetwork hardware and kernel data structures [19]. Software-based interrupt moderation (e.g., NAPI [45]) can help re-duce the overhead of interrupt processing under high loads.Many of these features are not useful in the ETA+AIO ar-chitecture, which bypasses the kernel for most data transferoperations. However, the ETA+AIO architecture providessimilar benefits, avoiding copies on the transmit path likesendfile, and eliminating the overhead of network deviceinterrupts by using a polling interface.

To reduce heavy loads on host CPUs, recent network in-terface card (NIC) designs provide some degree of offload-ing, the ability to execute some packet processing operationsthat previously had to execute on the CPU. In particular,modern mainstream NICs (e.g., Intel PRO/1000 gigabitEthernet) can perform TCP/IP checksums and segmenta-tion, and provide programmable timers used for interruptmoderation [27]. Because offloading changes the softwareinterface to the NIC, OS device drivers and the protocolstack software must be designed or modified to take advan-tage of these capabilities. The ETA architecture can benefitas much as conventional systems from NIC offloading capa-bilities since the ETA architecture makes only slight modi-fications to a standard TCP/IP stack implementation.

Although offloading selected functions has proved successful,so far there is only limited use of TCP Offload Engine (TOE)NICs, which offload the entire TCP/IP processing for someor all packets to the NIC. TOEs typically rely on firmwareand specialized hardware implementations, which are moredifficult to upgrade and customize than host software-basedimplementations. TOEs have not yet demonstrated signif-icant performance benefits [9, 46, 47, 33], possibly becausethe current scarcity of NIC hardware resources limits scal-ability, rendering a TOE better suited for supporting a fewlong-lived TCP connections than for workloads with a largenumber of short-lived connections (e.g., web server work-loads) [33]. In recent work, Freimuth et al [21] propose aTOE design to minimize bus crossings, data copies, cachemisses, and interrupts, which they argue limit the scala-bility of current systems. Using a purely software-basedemulation environment, they demonstrate that their TOEdesign reduces I/O bus utilization (they plan to evaluateother performance aspects in future work). Hence, theycannot yet assess the overall performance benefits of thisapproach. In this paper, we take a different evaluation ap-proach. We evaluate a real multiprocessor server implemen-tation of the ETA+AIO architecture and measure networkthroughput and CPU usage. We present a detailed analy-sis of these results which quantifies the impact of differentelements of the ETA+AIO architecture on microarchitec-ture performance attributes including instruction executionrates, cache misses, and interrupt processing overhead.

At the microprocessor level, the growing challenges of powerdensity and interconnect delay are limiting further increasesin single-core processor performance. Thus, the emergingdesign trend is to use increasing transistor density to in-crease the integration of functions on a chip and to imple-ment multi-core processors. As multi-core processors de-crease the relative cost of individual cores and provide mech-anisms for fast inter-core communication, it becomes moreattractive to dedicate some cores to packet processing, asin the ETA+AIO architecture. In addition, this approachis complementary to proposals for placing network inter-face hardware close to or even integrated onto the processorchip [12], and techniques for intelligently transferring datadirectly between network hardware and CPU caches [12, 25].

Previous studies of an earlier version of our ETA+AIO pro-totype indicate that it improves performance for simple mi-crobenchmarks [42, 44]. In contrast, our evaluation vehicleis a full-fledged event-driven web server application driven

by a SPECweb99-like workload [48]. The web server appli-cation makes full use of sendfile, non-blocking I/O andepoll, and an aggressive connection acceptance strategy.

An earlier (1998) approach that also dedicated CPUs to de-vice processing was the AsyMOS [36] and Piglet [37] project.The mechanisms proposed in that project appear to providemore generality than ETA, allowing for example the abil-ity to download user code to the dedicated device proces-sors. However, the published evaluations include only devicedriver code on the device processors, while TCP is left forthe application CPUs. In addition, only a very preliminaryperformance analysis is presented, and of course the analysisis based on the relatively unsophisticated hardware and OSfeatures of that time period.

TCP Servers [41], a more recent project similar to ETA,has examined the impact of sequestering CPUs for dedi-cated TCP processing. The TCP Servers project providesan AIO API that supports zero-copy data transfers [40], butthis AIO API is implemented only for transfers between dif-ferent nodes of a cluster interconnected by a system-areanetwork. On a multiprocessor platform, the implementa-tion of TCP Servers provides neither zero-copy transfersnor AIO for interactions between applications and the ded-icated packet processing CPUs. In contrast, our prototypeprovides these features in a multiprocessor implementation.The TCP Servers study finds that offloading TCP/IP pro-cessing can improve server throughput by up to 30%, thatnoticeable benefits are obtained by using network interfacepolling rather than interrupts, and that substantial com-puting resources are required for complete offloading. Inaddition, the authors note that server performance wouldfurther benefit from a scheme for dynamically adjusting theallocation of CPUs for application and network processing.

Unlike both the TCP Servers work and the previousETA+AIO investigations, in this paper we explicitly en-sure that we compare the ETA+AIO prototype against abest-practice configured standard Linux system with inter-rupt and application processor affinity, network interfaceinterrupt moderation, and NIC checksum offloading. Ourfindings are significantly different from these previous stud-ies. We observe smaller benefits from partitioning since wecompare against a strong baseline system that benefits, forexample, from transmit-side copy-avoidance and interruptmoderation. We characterize the remaining benefits of theETA+AIO architecture, which include higher efficiency ofthe AIO API compared to sockets and improved cachingbehavior.

Finally, Jacobson and Felderman [28] very recently proposedtechniques for speeding up network processing by using lock-free circular buffers in the network stack instead of tradi-tional linked lists of socket buffer structures, and also byperforming TCP processing in a user-level library linkedwith the application. Since ETA uses the standard TCPstack, the benefits of changing the stack implementation touse circular buffers would likely accrue to both ETA andthe standard system. However, running the stack in a user-level library on the same CPU as the application is con-trary to ETA’s use of processor partitioning, and furtherstudy is needed to evaluate the tradeoffs between these op-

Figure 1: DUSI/ETA PPE Software Architecture

posing approaches. For example, while partitioning reducescontention for CPU resources between network processingand application processing, some functions (e.g., copyingreceived packet data payloads from kernel memory to user-level memory) are probably more cache-friendly when per-formed on the same CPU as the application.

3. ARCHITECTURE OVERVIEWOur architecture dedicates one or more processors or hard-ware threads of a server to perform all TCP/IP protocol pro-cessing. Applications interface to the ETA TCP/IP stackthrough an API called the Direct User Sockets Interface(DUSI). DUSI provides an asynchronous interface similarto the Sockets Extension API [4]. It presents familiar socketsemantics to applications with asynchronous versions of lis-ten, accept, send, receive, etc.

Figure 1 depicts the architectural components and partition-ing between the “sequestered” CPUs and the “host” CPUs.The Packet Processing Engine (PPE) is a kernel thread run-ning on the sequestered CPU, and takes no part in userapplications. The Direct User Adaptation Layer ProviderLibrary (DUSI Library) implements the AIO functions andis linked with applications executing on the host CPUs. Ap-plications and the PPE interact through queues created incache coherent shared system memory. Queuing and de-queing operations and events are extremely fast lock-freeoperations requiring only a few instructions and no kernelintervention. Finally, a Kernel Agent (KA) is used for thoseoperations for which the PPE requires Linux kernel assis-tance (for example, pinning data buffers in memory).

Applications establish communication with the ETA PPEby creating a User Adaptation Layer (UAL). The UALhouses all queuing structures shared between an applicationand the PPE. To initiate TCP/IP communication, an ap-plication creates a Direct Transport Interface (DTI) withinthe UAL, analogous to creating a traditional BSD socket. ADTI contains or references all queuing structures associatedwith a TCP socket [50, 44]. The UAL and DTIs are createdusing synchronous operations. Subsequently, the applica-tion can issue socket operations through the DTI queues

and receive completion events asynchronously.

The PPE executes a polling loop that continuously queriesNIC descriptors and ETA queues. Since all NIC descriptorsand ETA queues are accessed through shared cache-coherentsystem memory, polling is performed directly from cacheand has no effect on the memory subsystem or front-sidebus. A memory access occurs only when a memory locationhas been updated by either a NIC or the DUSI Library [42].Another benefit of polling is that the PPE avoids incur-ring interrupts. The PPE uses a modified Ethernet driverthat allows it to perform NIC polling, and it incorporatesa TCP/IP stack that is a slightly modified version of themainstream Linux TCP/IP stack.

4. ETA ARCHITECTUREIn the following subsections we describe the queue struc-tures used in the Embedded Transport Acceleration (ETA)architecture and the operation of the PPE thread.

4.1 ETA QueuesEach application has one instance of a UAL which contains(see Figure 1) one Doorbell Queue (DbQ), one or more EventQueues (EvQs), and, for each DTI, a pair of Transmit andReceive descriptor queues (TX and RX). Memory for thesequeuing structures is pinned and mapped into both the userand kernel virtual address spaces to permit access by boththe DUSI Library and the PPE. The DUSI Library poststo the Doorbell Queue to alert the PPE whenever the ap-plication initiates an operation on any of its DTIs. Thisenables the PPE to discover application-initiated operationsby polling only the DbQ instead of polling each of the queuesof several DTIs. The application creates Event Queues inthe UAL for use by the PPE to post events notifying the ap-plication of the completion of previously issued operations.To wait for events, applications can either poll the EvQs orrequest notification through the OS signal mechanism.

A DTI is associated with each instance of a socket (whetherit is an unbound or listening socket, or the endpoint of aTCP stream). When an application initiates a send (receive)operation on a DTI, pointers to the application’s send (read)buffers are posted by the DUSI Library to the DTI’s TX(RX) descriptor queue (the DUSI Library also posts to theDbQ as mentioned above). When an application creates aDTI, it can bind the TX and RX queues of the DTI to anyEvQ within the UAL. This flexibility allows the applicationto combine event notifications for operations associated withseveral DTIs into a smaller number of EvQs, reducing thenumber of EvQs it must poll.

In addition to the UAL queues, ETA includes the Adminis-trative Queue (AQ) and the Interrupt Queue (IQ) throughwhich the Kernel Agent and the PPE communicate. To per-form synchronous operations, the DUSI library invokes anioctl system call to schedule the Kernel Agent, which thenqueues a request to the AQ. The PPE reads requests fromthe AQ and uses the IQ to notify the Kernel Agent of thecompletion of AQ requests and to trigger the Kernel Agentto deliver signals requested by the application. The PPEposts an event to the IQ and then generates an Interpro-cessor Interrupt (IPI) for the host CPU to cause the KernelAgent to be scheduled.

PPE Polling Loop

Select NIC: Select a NIC Interface to Service in this loop iteration

PollDbQ: Process At Most 1 operation from each Doorbell Queue for any new work

Poll TX: Process a TX packet & poll NIC driver to reclaim TX descriptor

Poll RX: Poll NIC driver and process at mostNRX valid packets

Poll AQ: Process at most 1 operation from the Admin. Queue.

PPE Polling Loop

TCP/

IP

1.

2.

3.

4.

5.

Select NIC: Select a NIC to service in

this loop iteration

Poll AQ: Process at most 1 operation

from the Admin Queue

Poll DbQ: Process at most 1 operation

from each Doorbell Queue

Poll RX: Poll NIC driver and process

at most N valid RX packets

Poll TX: Poll NIC driver to reclaim TX

descriptors

Figure 2: PPE Polling Loop

4.2 PPE Polling LoopThe PPE’s goals include ensuring that: NICs do not droppackets, NICs are never idle when there is data to be trans-mitted, and latency is minimized for all application opera-tions. This requires a balance between how aggressively thePPE takes on new operations posted by applications andhow often it services NICs.

Our prototype PPE uses the algorithm shown in Figure 2.Within this loop the TCP/IP stack can be invoked in severalplaces. The PPE contains a Linux 2.4.16 TCP/IP protocolstack that includes minor modifications for it to interface tothe PPE queuing mechanisms. Whenever the PPE threadfinds that a NIC has received a packet (Step 4), it invokesthe inbound TCP/IP stack to service the packet. When thePPE dequeues an application’s send operation from the DbQ(Step 3), it invokes the outbound TCP/IP stack to generateand send packets to the network. The PPE can also invokethe outbound TCP/IP stack to resume a blocked transmis-sion that was initiated in some previous execution of Step 3.This occurs in Step 4 when an ACK is received that opensthe TCP window, allowing the blocked transmission to re-sume. A simple scheduling policy that we have found effec-tive is to poll the NICs in a round-robin fashion, one NICper loop iteration, and to dequeue at most one operationfrom each DbQ in each iteration.

When using a single NIC and running one DUSI applica-tion on a 2.8GHz Xeon server, an unproductive iterationthrough our prototype’s PPE loop – where all queues areempty and the NIC has no packets – on average takes only152 clock cycles (∼55ns). We have found that when thePPE has ample CPU cycles available to handle the load, thechoice of scheduling policy has almost no effect on perfor-mance; the PPE achieves a fair balance between applicationand NIC processing since the queues tend not to have per-sistent backlogs and the PPE is work-conserving (the costof polling an empty DbQ or an empty NIC is minimal).Only when the PPE starts to saturate the CPU does itmatter how effectively the scheduling policy allocates PPECPU cycles for polling the NICs versus polling the applica-tion DbQs. In this operating region, queues are longer andshould be drained at appropriate rates. Allocating too few

cycles for handling DbQ entries would cause the DbQs tobecome unacceptably long, delaying the processing of appli-cation requests. Conversely, devoting more cycles to han-dling DbQ entries would leave fewer cycles for processingnetwork traffic. Thus, it may be beneficial to dynamicallyadjust resource allocations to optimize the handling of dif-ferent workloads.

5. ASYNCHRONOUS I/O (AIO)Our prototype implements an asynchronous interface, DUSI,which allows applications to execute unimpeded while net-working operations are being performed. Before a DUSIapplication can send and receive data over TCP/IP, it mustfirst call the synchronous function Dusi Open Ual to createa User Adaptation Layer and initialize application-specificdata structures. In creating a UAL, an application mustspecify the depth of the Doorbell Queue and the maxi-mum number of DTIs that will be created. As with allsynchronous DUSI functions, creating a UAL interacts withthe ETA Kernel Agent thread rather than directly with thePPE. This provides access to kernel functions, such as forpinning memory and translating user virtual addresses tokernel virtual addresses, and allows slower operations to beoffloaded from the PPE.

An application is responsible for creating the Event Queueson which it receives notification of completed operations.When an application creates a DTI socket, it associates twoEvQs with the DTI: a receive EvQ (RX EvQ) and a trans-mission EvQ (TX EvQ). The RX EvQ is used by the appli-cation for receiving notification of when inbound operationshave completed (e.g., receives, accepts). The TX EvQ is as-sociated with outbound operations (e.g., send, shutdowns,application initiated connections). DUSI does not put con-straints on which event queues may be associated with DTIs.For example, it is permissible for a DTI to use a single EvQas both the RX and TX EvQs; and EvQs may be sharedamongst multiple DTIs.

To establish a TCP connection an application must first cre-ate the DTI socket to be associated with the stream. If theapplication is the listener, then the application must alsocreate the listen DTI socket and bind it to a TCP port ad-dress. In contrast to BSD sockets, where an accept opera-tion returns the TCP stream socket descriptor upon comple-tion, a DUSI application first creates the DTI stream (child)socket and then passes it, along with the DTI listen socket,as input parameters to the Dusi Accept function. This per-mits the application to initiate a receive on the TCP streambefore an accept operation has been posted. The sequenceensures that an application data buffer will be available tothe PPE for any data immediately following the TCP SYN.

To send data an application calls Dusi Send with a pointerto the source data and the data’s length. The DUSI librarywrites the information to the DTI’s TX Descriptor Queue,and then notifies the PPE by posting to the DbQ. Similarly,the application initiates a read by calling Dusi Recv, whichresults in an event being queued to an RxQ followed bythe PPE being notified via the DbQ. Vectored sends andreceives, specifying multiple data locations, are supported.By writing to an asynchronous interface an application canhave multiple send and receive operations outstanding for a

single TCP stream, as well as across multiple TCP streams.This can lead to higher network bandwidth and less latencyfor a single application.

Since an application executes in user space and the PPEexecutes as a kernel thread, they do not share virtual ad-dresses. When an application references a buffer, the ap-plication’s virtual address needs to be mapped to a virtualaddress accessible by the PPE. To perform this mapping werequire an application to register memory prior to referenc-ing it in send and receive operations. Registering memoryreturns a handle to be included in calls to DUSI I/O func-tions. When memory is registered ETA makes certain thatall user memory, including anonymous memory, is backed byphysical pages. ETA also takes advantage of registration topin the pages in memory, preparing the pages to be accessedby the NIC.

Prior research [22, 31, 16] has shown the benefits of copyavoidance in network performance. When sending data,ETA minimizes copies by writing directly from the userbuffer to the NIC. This is easily and efficiently done be-cause the application buffer is already mapped to a kernelvirtual address and pinned in memory. Since retransmis-sions may need to occur from that application buffer, ETAdoes not generate a send completion event until the remoteTCP has acknowledged all of the packets. An AIO inter-face lends itself nicely to sending from an application buffersince the application is never impeded on the send and, incontrast to BSD sockets, there is never a need to indicatepartial success.

For inbound data, the PPE must copy data from the hostNIC buffers into the application’s buffer. If a receive isposted at the time that the PPE polls the NIC, then thePPE is able to immediately copy the data to the applica-tion’s buffer, which can result in reduced latency for receives.We allow an application to have multiple receives posted toa DTI so that packets can be passed to the application assoon as the PPE detects them.

DUSI provides several mechanisms for applications to de-termine when an operation has completed. There are veryefficient functions for polling EvQs, dequeuing events, anddetermining the EvQ event depth. At times, an applica-tion may have operations outstanding but can do no furtherwork until one of the operations is completed. DUSI pro-vides functions that allow an application to block on a singlequeue, but DUSI currently relies on the Linux user signalmechanism to block on multiple EvQs. A function is pro-vided for associating a signal with an EvQ (the same signalcan be assigned to multiple EvQs). If the application wantsto sleep until an event arrives, it arms the correspondingsignal(s) and notifies DUSI to raise a signal when any of thespecified EvQs transitions from being empty to not empty.

6. WEB SERVER APPLICATIONTo investigate the benefits of using ETA+AIO for networkintensive workloads, we have adapted a web server appli-cation to use the DUSI AIO API. A web server is a widelyused and representative example of a networking applicationthat exerts high demands on system resources as a result ofhandling large numbers of simultaneous connections.

Nearly all network server applications follow a similar se-ries of steps to handle each client connection. The serverreceives a client request on the connection, processes therequest, and sends a reply. This continues until the serverreplies to the final client request and then closes the connec-tion. Several of these steps can block on interaction witha remote host, the network, some other subsystem (e.g., adatabase), and potentially a disk. Consequently, for highperformance the server must handle up to several thousandsimultaneous connections [10] and quickly multiplex connec-tions that are ready to be serviced. One approach is to as-sign each connection to a separate thread of execution whichcan block [1, 39]. However, the cost of frequently switchingexecution contexts can limit the application’s performance.

An alternative approach which avoids this problem is to usea Single Process Event Driven (SPED) application architec-ture (terminology from [39]). A SPED architecture uses asingle process (per CPU) running an infinite event process-ing loop. In each loop iteration an event notification mech-anism identifies a connection that is ready for additionalprocessing, and then the processing for this connection isperformed in a non-blocking fashion [13, 5]. With this ap-proach, a process can service a large number of connectionsconcurrently without being blocked on an I/O operation andwithout the overheads of multithreading [52]. However, itmay still be necessary to use multiple processes per proces-sor to mask unavoidable blocking (e.g., for disk I/O on apage fault [39]).

The userver [14, 24] is an open source web server imple-mented with the SPED architecture, originally using APIsprovided by standard Linux. We have extended the userverimplementation with the option to use AIO APIs. Thefollowing subsections describe the primary aspects of theuserver implementation for these two environments. We fo-cus on how the userver handles requests for static content(file data) as opposed to dynamic content.

6.1 The userver for standard LinuxTo realize the SPED model on standard Linux, the userverperforms socket I/O operations in non-blocking mode asprovided by the traditional BSD sockets interface. Theuserver provides an option to transmit web page data us-ing the sendfile system call. Linux provides a “zero-copy” sendfile implementation by transmitting data di-rectly from the shared file system buffer cache in kernelmemory to the NIC, thus eliminating the copy from usermemory to kernel memory incurred with basic socket sendoperations. The use of sendfile improves performance inthe baseline system and allows us to investigate the bene-fits of the ETA architecture aside from its elimination ofmemory-to-memory copying on the transmit path. Fig-ure 3(a) shows a state machine diagram illustrating how theuserver handles a single connection in the baseline system.

6.2 The userver with AIO: OverviewWe next describe in some detail our extensions to the userverto use an AIO API. Our description serves as a case studyillustrating the use of AIO for a real application. To en-able the possible use of a variety of AIO APIs, we mod-ified the userver to use an application-specific AIO API.We then implemented a library that maps the application-

read <request>

read <eof>

write <incomplete>

write <completed>

close

accept

Closing

Writing

Reading read <partial>

Accepting

Closed

(a) Baseline

accept_compl

<read request>

<read eof>

read_compl

read_compl

close_compl

<read partial>Reading

if !eofprepost read

Connected

Closed

post closeClosing

post writeWriting

Accepting

read_compl

prepost acceptprepost read

(b) AIO

Figure 3: State machine for handling a single con-nection

specific API to DUSI. The application-specific API is a rel-atively straightforward adaptation of standard socket callsexcept that they are asynchronous. For example, some of thecalls it supports are aio accept, aio read, aio write, andaio close. Since this mapping is straightforward we do notelaborate on it here but instead focus on how the userver,through the mapping layer, ultimately interacts with theDUSI API.

6.3 The userver with AIO: StartupAt startup the userver calls DUSI to create the user adap-tation layer (UAL), EvQs, and multiple child sockets, andto create and bind listen sockets. The userver also registersbuffer memory for receiving client HTTP requests. Beforeentering its event processing loop, the userver posts an ac-cept operation for each child socket to the listen socket’s DTIRX descriptor queue in anticipation of client connection at-tempts. As discussed in Section 5, the userver is able topost a read operation to the child socket before posting theaccept operation, ensuring immediate delivery of the client’sinitial HTTP request by the PPE∗.

6.4 The userver with AIO: Event ProcessingTo use AIO with the SPED architecture, the userver ex-ecutes an infinite event processing loop where each itera-tion performs the following steps: 1) dequeue completionevents, 2) use the event information provided in the comple-tion event to determine what processing will be done nextand on which connection, and 3) issue the appropriate AIOoperations on the connection. Figure 3(b) shows a statemachine diagram illustrating how the userver responds toevents for a single connection.

The userver provides runtime options to define the numberof EvQs used and to select the policy for servicing the EvQs.For our experiments we chose to use three EvQs: one for allaccept completions, one for receive completions across all

∗Our current implementation relies on the simplifying as-sumption that each HTTP request fits in one packet andtherefore posts only a single receive at a time.

DTIs, and one for send and shutdown completions acrossall DTIs. Using this segmentation enables the userver eventprocessing loop to give highest priority to dequeuing receivecompletion events. This is beneficial because, as illustratedin Figure 3(b), it is primarily the receive completions thatdrive the userver’s state machine, and other events can of-ten be ignored. For example, the first receive completiondequeued for a DTI can be used as an implicit accept, ren-dering accept completions superfluous. When the userverdoes dequeue an accept completion, it can be discarded.

When the userver dequeues a successful receive completionevent, then (for HTTP 1.1 connections) the userver imme-diately posts a new receive operation on the connection’sRX descriptor queue in anticipation of receiving the next re-quest from the client. The userver parses the current HTTPrequest and generates a reply which consists of an HTTPheader and the reply data (in the case of a static request,the contents of the file). To enable the PPE to transmitthe requested file data directly from memory, the uservermmaps requested files and registers the mmapped memory re-gions with DUSI. As part of memory registration, DUSI in-vokes the kernel agent to pin the memory region into physi-cal memory until it is later unregistered by the application.The userver provides a further optimization by maintain-ing a cache of recently requested mmapped file handles andaddresses, along with a cache of corresponding HTTP re-sponse headers†. Registering an mmapped file can cause theapplication to block because of disk accesses as file datais brought into memory. To avoid this blocking, we wouldneed to extend the ETA+AIO implementation by addinga new, asynchronous version of the DUSI memory registra-tion operation. To send a reply, the userver posts to theconnection’s TX descriptor queue a send operation that in-cludes a list of pointers to the HTTP response header andthe requested file data in the registered memory region. ThePPE generates a completion event for this send operationonce all of the data is ACKed by the TCP peer. Similarto accept completions, send completions (for non-pipelinedHTTP requests) are largely superfluous because subsequentreceive completions on a DTI can double as implicit sendcompletion events. (However, send completions are usefulfor managing the userver file cache use count.)

When using HTTP 1.1, a client sends a FIN on a connec-tion to indicate it will not send any more requests to theserver. After the server’s PPE receives the FIN, any out-standing or subsequent userver receives will be successfullycompleted with a length of zero. The userver reacts to azero-byte receive completion event by posting a shutdownoperation to the connection’s TX descriptor queue. ThePPE processes the shutdown by trying to complete pendingtransmits for the connection and then sending a FIN to theclient. When the shutdown is complete, the PPE queues

†The ETA+AIO prototype does not currently coalesce datain a vectored Dusi Send call. Furthermore the prototypedoes not implement getsockopt and setsockopt function-ality, most notably TCP CORK and TCP UNCORK. As a resultthere is currently no mechanism to force the coalescence ofan HTTP header and HTML data into a single networkpacket. To avoid this artifact in our analysis, we embed-ded the HTTP header into the URI files. This has a verysmall positive effect on the baseline experiments in that theheader is sent via sendfile rather than from a user buffer.

a shutdown completion event to the userver. The userverresponds by posting an accept to the listening socket’s DTIRX descriptor queue to allow the server to accept a new con-nection, thus replacing the connection that was just closed.As described earlier, the userver also posts a receive opera-tion to the child DTI. As a result, the sum of the number ofactive connections plus the number of new connections thatcan be accepted remains constant. This is similar to a listenqueue in the case of standard BSD sockets.

6.5 The userver with AIO: SummaryIn adapting the userver to use asynchronous I/O, we iden-tified several important considerations. One is that someevents can implicitly indicate the completion of other events.For example, the successful reception of data could be anindication that an accept or prior send has also completed.With this in mind, different event types can be assignedto separate queues, thus permitting the userver to makeapplication-driven decisions about which events to priori-tize. Another important factor is having the AIO interfaceallow reads to be posted on future connections before theyare accepted. This can be used to avoid race conditionswhere the application is notified of a connection being ac-cepted but is unable to post a read request prior to the dataarriving. This is important because if a read is posted thePPE can immediately deliver a notification to the applica-tion, otherwise it must delay delivering the event until theread is eventually posted. Finally, an asynchronous I/O in-terface can be very efficient. However, if there are no eventsto be processed the application must wait for events, andkernel assistance is required to block the application andlater unblock it when an event does arrive. As we discuss inSection 7.3 the overhead of using signals as an event notifi-cation system can be costly relative to other costs associatedwith AIO.

7. PERFORMANCE EVALUATIONIn this section, we compare the performance of the userverapplication on the ETA+AIO prototype system against itsperformance on well-configured standard Linux systems. Wealso analyze how various aspects of the ETA+AIO architec-ture affect its network processing efficiency.

7.1 Experimental EnvironmentOur experiments study the performance of the userver run-ning on a multiprocessor server, under a SPECweb99-likeworkload generated by a cluster of client machines runninghttperf [35]. To stress computational and networking ca-pabilities on the server, we choose a workload that mini-mizes disk I/O. Clients issue requests to static files with nothink time. We use a small SPECweb99 fileset (51 MB,ten directories) that fits entirely in memory and is com-pletely pre-loaded into memory by the userver at startup.For ETA+AIO the userver also registers the mapped fileswith the PPE, causing them to be pinned in memory. It isimportant to note that with this workload choice, our re-sults for the ETA+AIO system characterize its performancebehavior without factoring in the cost of memory registra-tion or deregistration operations, which could be significantsources of overhead for other workloads.

The server hardware platform is an HP Proliant DL580with two 2.8 GHz Xeon processors (512 KB L2, 2 MB

ETA vs Baseline Efficiency

0

10

20

30

40

50

60

70

80

90

100

0 500 1000 1500 2000 2500 3000

Throughput (Mbps)

CP

U U

tiliza

tio

n (

2 C

PU

s)

ETA Baseline w/ sendfile Baseline w/o sendfile

Figure 4: Processing Efficiency and Capacity

L3, hyperthreading disabled), 32 GB RAM, and four

Intel PRO/1000 gigabit network interfaces (NICs). EightHP Integrity rx2600 machines act as clients. A switch in-terconnects all the machines, and the network is configuredto support two clients on each server NIC’s subnet. Eachof these machines is also connected to a control machine ona distinct control network, which is used to control experi-ments and collect results.

Our ETA+AIO experiments dedicate one CPU to the PPEand the second (host) CPU runs one instance of the userver.For our experiments with the standard Linux stack (base-line), we use best-practice configurations to maximize base-line system performance. Specifically, the baseline configu-rations bind a userver process and the interrupts from theinterfaces it listens on to the same CPU [8]. For our two-CPU server, each CPU runs a single userver process whichlistens on the IP addresses of two of the four network inter-faces. We use epoll in the baseline, which we found to bemore efficient than poll or select. We configure each NICto use checksum offloading, and for baseline configurationswe also enable the dynamic interrupt throttling capabilityof each NIC to minimize interrupt overhead. We do not useTCP segmentation offload (TSO) because it is not supportedby ETA+AIO’s 2.4.16-based networking stack. We wouldexpect TSO to provide comparable benefits for ETA+AIOand the baseline.

7.2 Overall Network Processing PerformanceTo obtain a high-level characterization of the performancebenefits of ETA+AIO, we evaluate the ETA+AIO proto-type and baseline configurations in terms of two key per-formance attributes: network processing efficiency, whichwe define as the fraction of CPU cycles that are availablefor additional application processing, and peak throughput.For any fixed level of delivered throughput, configurationswith lower CPU utilization have higher network processingefficiency, since they leave a larger fraction of CPU cyclesavailable for additional application processing.

Figure 4 plots the CPU utilization (aggregate of thetwo CPUs, measured with vmstat) required to achieve

each level of throughput for ETA+AIO and two baselineconfigurations, using the userver and the SPECweb99-likeworkload. For each configuration, the rightmost data pointcorresponds to the peak throughput achieved with thatconfiguration. This value represents the actual systemcapacity, free of artifacts like AIO error cases, which neveroccur in these experiments. We compare ETA+AIO againstbaseline configurations that use the Linux 2.4.18 kernelwith a processor affinity patch, since the 2.4.18 kernel isthe basis for the ETA+AIO prototype‡. In one baselineconfiguration the userver application uses the write systemcall to transmit data as is commonly the case when sendingdynamic content instead of static content. The write

system call incurs the overhead of a memory-to-memorycopy from the user data buffer to kernel socket buffers. Inthe other baseline configuration, the userver applicationuses the “zero-copy” sendfile system call to transmit datawithout memory-to-memory copies.

Figure 4 shows that, as expected, network processing ef-ficiency and peak throughput in the baseline substantiallyimprove with the use of sendfile. The baseline withoutsendfile achieves a peak throughput of 1916 Mbps, whilethe baseline with sendfile achieves 2832 Mbps. These re-sults reflect the performance capabilities of the kernel andnetwork stack implementations since the peak throughputis limited by the saturated CPU rather than bottlenecks inthe memory or I/O subsystems.

The results for ETA+AIO in Figure 4 show that its aggre-gate CPU utilization is always at least 50%, because in ourserver one of the two CPUs runs the continuously pollingPPE. Thus, at low network throughputs when the PPECPU is not well-utilized, the network processing efficiency ofthe ETA+AIO system is considerably worse than the base-line configurations. In contrast, at high network loads thePPE CPU is effectively utilized enabling ETA+AIO to havehigher network processing efficiency than the baseline con-figurations. In addition, for future systems with a largernumber of processor cores, a single PPE will comprise alower percentage of total CPU cycles than 50%, likely nar-rowing the region of light load levels where ETA+AIO hasworse network processing efficiency.

Compared to the baseline without sendfile, ETA+AIOachieves higher network processing efficiency over a widerange of throughput (higher than 1300 Mbps) and alsosupports a higher peak throughput (2266 Mbps versus1917 Mbps). These improvements can be largely attributedto the elimination of transmit-side copies in ETA+AIO. Thisresult suggests that applications that are not able to usecopy-avoidance primitives such as sendfile can see signifi-cant performance benefits with ETA+AIO.

ETA+AIO also achieves higher network processing effi-ciency than the baseline with sendfile, but over a nar-rower range of high throughput starting at 2000 Mbps.For example, at 2266 Mbps throughput, ETA+AIO has9% more cycles available for application processing thanthe baseline with sendfile (62% versus 71% utilization atthe same throughput). The efficiency improvement shows

‡We compared a 2.6 kernel’s networking to the 2.4.18 kerneland saw only a small performance increase.

ETA vs Baseline Throughput w/ Encryption

0

10

20

30

40

50

60

70

80

90

100

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

Percent of files <= 7500 bytes encrypted

CP

U U

tiliza

tio

n

0.0

500.0

1000.0

1500.0

2000.0

2500.0

3000.0

Th

rou

gh

pu

t (M

bp

s)

ETA CPU Utilization Baseline CPU Utilization

ETA Throughput Baseline Throughput

Figure 5: Peak Throughput and CPU Utilizationwith an Application Workload

that additional architectural aspects of ETA+AIO besidestransmit copy elimination benefit network processing. De-spite these improvements, the ETA+AIO prototype achieveslower peak throughput than the baseline with sendfile

(2266 Mbps versus 2832 Mbps). Although the ETA+AIOsystem has a substantial fraction of idle CPU cycles at itspeak throughput, the PPE CPU is saturated, preventing ad-ditional network processing. However, these cycles can beutilized for other application processing, potentially allowingETA+AIO to support higher throughputs for an applicationwith a larger component of application computation (e.g., awebserver generating dynamic content).

To study this effect, we next evaluate performance on aworkload that includes the encryption of a subset of the files.We extend the userver to encrypt a configurable percentageof the files that are at most 7500 bytes in size. In these ex-periments, to maximize efficiency the baseline system usessendfile to transmit all data, including the dynamicallyencrypted files.

Figure 5 shows the peak throughput and CPU utilizationfor various levels of encryption. Since our ETA+AIO proto-type statically allocates the application and PPE CPUs, itdoes not compare favorably to the baseline when the CPUdemand is biased toward either networking or the userverapplication. However, the ETA+AIO prototype achieveshigher throughput than the baseline for encryption percent-ages in the range between 25% and 55%. Furthermore,within this range the ETA+AIO prototype often has lowerCPU utilization than the baseline.

7.3 Detailed Performance AnalysisSection 7.2 identified workload regions in which theETA+AIO prototype has a modest advantage over the bestbaseline configuration in throughput and efficiency. We nextpresent a detailed analysis to identify and quantify the rea-sons for the observed efficiency advantages.

Whereas our prototype assigns a single CPU for use by thenetworking stack (PPE), with the 2-CPU baseline systemthere may be an instance of TCP/IP executing on each pro-

Efficiency Graph (No Encryption)

0

10

20

30

40

50

60

70

80

90

100

0 500 1000 1500 2000 2500 3000 3500 4000

Throughput (Mbps)

CP

U U

tiliza

tio

n

ETA 2-CPU baseline w/ sendfile

2-CPU baseline w/o sendfile 1 CPU baseline w/ sendfile (normalized)

Figure 6: Processing Efficiency with Normalized 1-CPU Data

cessor. Due to cache conflicts and semaphore contention,TCP/IP tends to scale less than linearly on baseline systemswith multiple CPUs. Therefore, it is important to determinewhat portion of the ETA+AIO prototype’s benefit is a resultof running only a single instance of the networking stack. Weconduct experiments to understand this effect, and so thatwe may factor it out when comparing the baseline to theETA+AIO architecture.

To evaluate the benefits of performing packet processing ononly one CPU, we measure a baseline configuration execut-ing with only one CPU enabled. For this configuration, weuse only two NICs rather than four and execute at half theHTTP request rate of the 2-CPU systems. We normalizethe 1-CPU results to the ETA+AIO and 2-CPU results bydoubling the 1-CPU system’s observed throughput. Thisnormalized 1-CPU baseline provides a best-case representa-tion of an idealized 2-CPU baseline which would be free ofthe effects of multi-processor networking stack contention.

In Figure 6 we compare the normalized 1-CPU baseline(utilizing sendfile) against the real 2-CPU baselines andthe ETA+AIO prototype. At the peak throughput forETA+AIO, CPU utilization roughly matches the 1-CPUbaseline. Hence it appears that much of the benefit of ourprototype versus the 2-CPU baseline does come from avoid-ing multiprocessing overheads for packet processing.

Though for this workload the ETA+AIO prototype is nomore efficient than the 2-CPU baseline after discounting themultiprocessing overhead, we wish to determine if the twoarchitectures are exhibiting equivalent behavior. For exam-ple, features of the ETA+AIO architecture may be providingefficiency benefits that are offset elsewhere. To understandhow CPU cycles are distributed in the ETA+AIO architec-ture versus the baseline architecture, we collect data fromthe hardware performance counters using oprofile [2]. Wechoose a target rate of 18800 requests per second, which re-sults in a throughput of 2218 Mbps. This is slightly less thanthe peak ETA+AIO throughput because peak throughputis slightly degraded when oprofile sampling is enabled. Wegroup the oprofile results into the following bins:

Table 1: ETA+AIO vs Baseline CPU UtilizationBaseline Baseline

Bin ETA+AIO 2-CPUs 1-CPUIdle 39.24% 29.30% 40.08%

Userver 6.89% 5.48% 5.32%TCP/IP 25.02% 29.51% 25.54%Ethernet 15.01% 19.19% 16.82%

Kernel other 8.31% 16.40% 12.09%PPE other 5.43%

Userver Includes the userver code plus time spent in li-braries called by the userver, including libc, the DUSIlibrary, and the AIO mapping layer.

TCP/IP Contains functions we are able to attribute toTCP/IP processing. In order to ensure that theETA+AIO and baseline TCP/IP bins include onlysimilar functionality, the baseline TCP/IP bin ex-cludes socket system calls.

Ethernet Contains the NIC device driver and general de-vice processing that we directly attribute to the NICs(or is dominated by the NICs such that other devicesare insignificant). For the baseline, this includes thecost of NIC interrupts; for ETA+AIO, it includes thecost of NIC polling.

Kernel other A catchall bin for kernel processing thatdoes not fall into the other bins. This includes func-tions that cannot be attributed to the other bins (e.g.,memory allocation, where we cannot reliably deter-mine which calls are a result of TCP/IP, the Ethernetdriver or elsewhere). Socket and file-related functionsare included in this bin.

PPE other Consists of oprofile samples consumed by thePPE that are not directly attributable to TCP/IP orEthernet processing. This bin includes the polling ofthe application queues.

Table 1 shows the measured distribution of CPU cycles forthe workload without encryption. These results show thatthe ETA+AIO system has a nearly identical percentage ofidle cycles compared to the idealized and normalized 1-CPUsystem (39.24% versus 40.08%), and a significantly greaterpercentage of idle cycles than the actual 2-CPU baselinesystem (29.30%). The ETA+AIO idle cycles are all on thehost CPU since the PPE CPU is continuously polling. Inaddition to the polling of NICs attributed in the Ethernetbin, the PPE is also continuously polling the application’sDbQ and the Administrative Queue (Figure 2). The costof the application polling manifests itself in the PPE otherbin. Our examination of the oprofile details indicates thatapproximately half of the PPE other cycles in Table 1 areconsumed by unproductive polling.

The results for the TCP/IP bin in Table 1 show thatthe ETA+AIO prototype executes TCP/IP more efficientlythan the 2-CPU baseline with sendfile. However, theETA+AIO prototype and the normalized 1-CPU systemhave very similar TCP/IP performance. As discussed ear-lier, the ETA+AIO prototype’s TCP/IP stack benefits from

executing on just a single processor. It is not surprising thatthe ETA+AIO TCP/IP bin is not noticeably more efficientthan that of the 1-CPU system when there is no encryptionworkload. Since we have not changed the TCP/IP stack, theonly expected improvement would be due to better cache be-havior. On our measurements of encryption workloads wedo observe small but consistent improvements in ETA+AIOTCP/IP efficiency over a 1-CPU system. For example, whenwe sampled a 15400 request per second workload encrypting40% of files not greater than 7500 bytes, the ETA+AIO sys-tem spent 20.05% of its time in TCP/IP, while the 1-CPUsystem consumed 21.63% of the CPU cycles.

Table 1 also shows that the baseline spends more time inthe kernel (16.40% for the 2-CPU system and 12.09% forthe 1-CPU system) than the ETA+AIO system (8.31%).An analysis of the detailed oprofile data shows that most ofthe baseline’s kernel overhead is attributed to sockets andepoll. In addition, this analysis shows that signals and syn-chronous DUSI events (e.g., creating and destroying DTIs)are a substantial part of the ETA+AIO architecture’s kerneloverhead. Recall that the application uses signals only whenit must wait for a new event to arrive to empty event queues.This requires the userver to make several signal-related sys-tem calls. In addition, any DUSI synchronous calls invokean ioctl system call to schedule the ETA Kernel Agent.When the PPE raises a signal or completes a synchronousevent, the Kernel Agent is scheduled via an InterprocessorInterrupt (IPI). Though the AIO interface provides signifi-cant benefits over sockets and epoll, the gains are partiallymitigated by the overhead of signals waiting for events. Ouranalysis of the oprofile data shows the combined cost ofsignal-specific kernel functions, the ETA kernel agent, IPIsand system calls to be 2.70% of the system’s CPU cycles, orapproximately a third of the ETA+AIO kernel overhead. InFigure 4 we note that as ETA+AIO begins to reach its peakthroughput it becomes more efficient. This is because athigh loads the userver is more likely to find an event on itsqueues, and thus less likely to incur the overhead of signals.

Table 1 shows that the ETA+AIO Ethernet component con-sumes less CPU cycles (15.01%) than both the 2-CPU and1-CPU baseline Ethernet components (19.19% and 16.82%,respectively). This is despite the PPE’s continuous pollingof NICs, which consumes CPU cycles even when no pack-ets are available. A more detailed analysis of the oprofiledata reveals that the ETA+AIO architecture benefits fromnot having to incur the overhead of hardware interrupts,and from not having to go through the Linux SoftIRQ path.However, these operations are quite efficient on the base-line system, accounting for only slightly more than 2% ofthe overall system overhead. The low cost of interrupts onthe baseline system is likely due to the effective use of NICinterrupt moderation timers.

Since ETA+AIO partitions the networking stack from theapplication with a goal of reducing resource contention, weuse oprofile to evaluate the impact of partitioning on cachebehavior. We note that the unified caches of the Xeonprocessor preclude measurement of separate cache behav-ior for instructions and data. Table 2 shows the relativenumber of all Level-3 (L3) and Level-2 (L2) cache missesfor ETA+AIO, a 2-CPU baseline system and a normalized

Table 2: ETA+AIO vs Baseline Cache Misses (NoEncryption). Equal normalized throughputs

Baseline BaselineCache ETA+AIO 2-CPUs 1-CPU

L3 Misses 1.43 1.34 1.00L2 Misses 1.01 1.16 1.00

Table 3: ETA+AIO vs Baseline Cache Misses (WithEncryption). Equal normalized throughputs

Baseline BaselineCache ETA+AIO 2-CPUs 1-CPU

L3 Misses 1.00 1.38 1.00L2 Misses 0.71 1.13 1.00

(half-rate) 1-CPU baseline system when the userver applica-tion has no additional computation to perform (i.e., execut-ing without encryption enabled). The values are reportedrelative to the normalized 1-CPU baseline cache miss counts.For example, in Table 2 the 2-CPU baseline has 34% moreL3 cache misses per CPU than the 1-CPU baseline. Table 3presents the relative L3 and L2 cache misses for a workloadwhere the userver does have additional computation to per-form (i.e., encrypting 40% of the files not greater than 7500bytes).

Table 2 shows that without additional application compu-tation (no encryption) the ETA+AIO system has a higherL3 cache miss rate than both baseline configurations. Thiscan be attributed to the cache-to-cache transfers betweenthe host CPU and the PPE for the DTI interactions. How-ever, Table 3 shows that with an encryption workload therate of L3 cache misses between ETA+AIO and the 1-CPUbaseline system is comparable, and much better than the2-CPU baseline. An analysis of the oprofile data indicatesthat without encryption ETA+AIO incurs approximately1273 L3 cache misses per megabit of data served, while the1-CPU baseline incurs roughly 890 L3 misses per megabit.For the encryption workload, the ETA+AIO L3 cache missrate increases slightly to about 1310 L3 misses per megabit,while the 1-CPU baseline system’s L3 cache miss rate in-creases over 45% to approximately 1310 misses per megabit.The additional encryption workload increases the cache con-tention between the application and the network stack onbaseline systems. When a file is encrypted, the userver ap-plication reads the original file and writes encrypted data toan encryption buffer, potentially causing evictions from theL3 cache. For a 1-CPU baseline system with an encryptionworkload, encrypting the data accounts for almost 17% ofthe L3 cache misses. For the ETA+AIO system an evic-tion in the application processor’s L3 cache does not affectPPE packet processing, and thus processor partitioning re-duces the impact of the additional application computationon the L3 cache miss rate. In contrast to the 1-CPU base-line, ETA+AIO encryption accounts for roughly only 10%of the total L3 cache misses.

Table 2 shows that ETA+AIO’s L2 cache behavior is similar

to the 1-CPU baseline when there is no encryption workload.However, as we add the encryption application workload(Table 3) the ETA+AIO system L2 cache behavior becomessubstantially better than the baseline. We attribute this tothe network stack and application partitioning, as we expectthe ETA+AIO architecture to have better locality and asmaller per-CPU instruction working set compared to thebaseline.

7.4 Summary of ResultsUsing a web server workload with a configurable compu-tational component (encryption), our performance evalua-tion has characterized workload conditions where either theETA+AIO prototype or the baseline provides superior per-formance or efficiency. Our detailed performance analysisprovides insight into the reasons for this behavior. Ourresults show that the transmit-side copy avoidance in theprototype results in significantly better processing efficiencyand higher peak throughput than a baseline system that usesthe standard write system call to transmit replies to webclients. After factoring out the impact of copy avoidance byusing a baseline system that uses the sendfile system call,the prototype retains a (smaller) efficiency advantage overthe baseline. This efficiency advantage only partially off-sets the fact that for packet processing, the baseline systemuses two CPUs whereas the prototype uses only one (thePPE CPU). Hence, for workloads with lightweight applica-tion processing requirements, the prototype achieves a lowerpeak throughput than the sendfile baseline. For workloadsin which application processing and packet processing bet-ter match the CPU allocation, the prototype can achievemarginally higher peak throughput than the baseline.

By comparing a one-CPU version of the sendfile baselineto the prototype, we show that most of the efficiency ben-efits of the prototype come from avoiding the overheads ofmultiprocessing for packet processing (since the prototypeuses only a single PPE CPU). The detailed breakdown ofCPU execution time shows that additional efficiency bene-fits come from the use of the AIO interface, which has loweroverhead than the standard sockets interface. For memory-intensive application workloads, processor partitioning elim-inates contention between the application and TCP process-ing for a shared processor cache, and hence reduces cachemisses compared to baseline systems. Finally, the signalmechanism used by the application to wait for events whenall the event queues are empty can introduce significant over-head and should be replaced by some lighterweight mecha-nism. However, this problem occurs only for intermediatetraffic loads, since at low loads the CPU usage is negligible,and at high loads event queues are empty less often.

8. ARCHITECTURAL EXTENSIONSOur analysis points to many interesting directions to extendthe implementation and investigation of the ETA+AIO ar-chitecture. To support higher throughputs, the ETA proto-type should be extended to scale to multiple PPE CPUs. Animportant question to answer is how PPE scaling affects theperformance benefits of processor partitioning. For example,in our study the bottleneck was the CPU cycles available inthe PPE (or the host CPU). By adding PPEs, the bottleneckresource can shift from CPU cycles to memory bandwidth,at which point the cache benefits of partitioning will likely

become more important than we have observed in our inves-tigation. With multiple PPEs, additional issues arise such asfinding the best assignment of responsibility for the NICs ina system to the various PPEs. In order to keep up with thelink speed of a single high-speed (e.g., 10 GigE) NIC, it maybe necessary for multiple PPEs to share a single NIC, per-haps by spreading TCP connections for the NIC across thePPEs as proposed by Microsoft’s receive-side scaling (RSS).For implementations that support multiple PPEs, the num-ber of PPEs should be dynamically varied in proportion tothe workload.

Scheduling issues are also interesting to investigate. AIOenables applications to present large numbers of concurrentoperations to the PPE. In our prototype, some schedulingparameters in the PPE loop are chosen through trial-and-error, such as the number of entries to dequeue from a NICin each loop iteration. Different choices may be optimal fordifferent workloads, and a method for dynamically tuningthe scheduling of PPE operations may yield higher through-put. In addition, application performance may depend onthe order in which the application processes events.

The benefits of processor partitioning may be greater foremerging multi-core processor architectures than in our pro-totype, for example due to on-chip mechanisms for fast com-munication between processor cores. In addition, networkinterface logic may become integrated onto the same chipas the processor cores, avoiding the costs of traversing off-chip I/O links such as PCI. It is important to understandhow this integration would affect the performance behaviorof the ETA+AIO and baseline architectures.

9. CONCLUSIONSThe high cost of TCP/IP packet processing and the emer-gence of high-speed network links such as 10 gigabit Eth-ernet pose an important challenge for next-generation sys-tems. In this paper we presented the design and evalua-tion of a prototype implementation of the ETA+AIO archi-tecture which addresses this problem by dedicating proces-sors to packet processing and by exposing asynchronous I/O(AIO) semantics to applications. We also presented a casestudy showing how the use of asynchronous network socketprimitives affects the design of a real server application.

Compared to previous work, the evaluation presented in thispaper provides a more accurate and comprehensive assess-ment of the potential benefits of approaches that dedicateCPU cores to networking, particularly in light of recent ad-vances in OS and TCP stack implementations. Prior inves-tigations of ETA and a similar architecture (TCP Servers)have the disadvantage that they only compared their ap-proaches against baseline systems that did not use best-practice configurations and did not use OS features thathave recently become available to improve performance. Incontrast, our evaluation makes full use of current best-practice techniques including use of the sendfile systemcall, processor affinity for both processes and interrupts, andNIC hardware support for interrupt moderation and TCPchecksumming. In our experience, use of these features forthe baseline system significantly improves performance, asreported in recent literature (e.g., [19]). Therefore, the re-sults of our study differ substantially from previous results,

showing more nuanced benefits for processor partitioningthan have prior investigations which have claimed improve-ments in peak throughput on the order of 30% [42, 44, 41].

Our results (summarized in Section 7.4) show that theETA+AIO prototype provides marginally greater process-ing efficiency than standard systems by avoiding multipro-cessing overheads for packet processing and by using thelower overhead AIO API instead of sockets. Also, proces-sor partitioning reduces cache misses for memory-intensiveapplications. For workloads in which application processingand packet processing approximately match the CPU allo-cations, the ETA+AIO prototype effectively utilizes boththe host and PPE CPUs, leading to higher peak throughputusing fewer CPU cycles.

In addition to presenting the performance evaluation, wealso described in detail the issues involved in using asyn-chronous socket I/O to implement an event-driven webserver application. We described the construction of thefinite-state machine which handles each connection. We de-termined that only a subset of the event types is neededto drive the state machine operation, while other eventtypes can be ignored. We also discussed the challenges ofmanaging memory registration and de-registration, and thescheduling opportunities provided by the ability to use mul-tiple event queues. The userver implementation using theAIO API is open source software.

Although our prototype demonstrated only modest perfor-mance advantages over the well-configured baseline system,our investigation motivates a number of additional investi-gations as outlined in Section 8 which may lead to greaterperformance improvements in the future. These investiga-tions include scaling to multiple PPEs, changing PPE andapplication scheduling policies, and exploiting features ofmulti-core processors like fast inter-processor communica-tion which may improve the benefits of partitioning.

AcknowledgementsThe authors gratefully acknowledge the contributions of PaulStillwell, Jr., John Wiegert, Phil Cayton, Jeff Gray, andGreg Regnier of Intel Corporation on the design and de-velopment of the ETA software prototype. In addition,Tim Brecht gratefully acknowledges the financial supportof Hewlett-Packard, the Ontario Research and DevelopmentChallenge Fund, and the National Sciences and Engineer-ing Research Council of Canada. Furthermore, the authorswould like to acknowledge the efforts of the reviewers andour shepherd Alan Cox for their insightful comments thathave improved the clarity of the paper.

10. REFERENCES[1] Apache. URL www.apache.org.

[2] OProfile. URL oprofile.sourceforge.net/news/.

[3] RDMA Consortium. URL www.rdmaconsortium.org.

[4] Sockets API Extensions. URL www.opengroup.org.

[5] Zeus Technology. URL www.zeus.co.uk.

[6] Design notes on asynchronous I/O (aio) for Linux,2002. URL lse.sourceforge.net/io/aionotes.txt.

[7] The Open Group Base Specifications Issue 6 IEEEStd 1003.1, 2003 Edition.

[8] V. Anand and B. Hartner. TCP/IP network stackperformance in Linux kernel 2.4 and 2.5. InProceedings of the Linux Symposium, pages 8–30.Ottawa, Ontario, Canada, July 2003.

[9] B. S. Ang. An evaluation of an attempt at offloadingTCP/IP processing onto an i960rn-based NIC.Technical Report HPL-2001-8, HP Labs, Palo Alto,CA, Jan 2001.

[10] G. Banga, J. Mogul, and P. Druschel. A scalable andexplicit event delivery mechanism for UNIX. InProceedings of the 1999 USENIX Annual TechnicalConference. Monterey, CA, June 1999.

[11] A. V. Bhatt. Creating a PCI Express interconnect.URL www.pcisig.com/specifications/pciexpress/

technical_library/pciexpress_%whitepaper.pdf.

[12] N. L. Binkert, L. R. Hsu, A. G. Saidi, R. G.Dreslinski, A. L. Schultz, and S. K. Reinhardt.Performance analysis of system overheads in TCP/IPworkloads. In Proceedings of the 14th InternationalConference on Parallel Architectures and CompilationTechniques. St. Louis, September 2005.

[13] T. Brecht and M. Ostrowski. Exploring theperformance of select-based internet servers. TechnicalReport HPL-2001-314, HP Labs, November 2001.

[14] T. Brecht, D. Pariag, and L. Gammo. accept()ablestrategies for improving web server performance. InProceedings of the 2004 USENIX Annual TechnicalConference. Boston, June 2004.

[15] D. Clark, V. Jacobson, J. Romkey, and H. Salwen. Ananalysis of TCP processing overhead. IEEECommunications Magazine, 27(6):23–29, June 1989.

[16] Z. Ditta, G. Parulkar, and J. Cox Jr. The APICapproach to high performance network interfacedesign: Protected and other techniques. In Proceedingsof IEEE INFOCOM ’97, volume 2, pages 7–11, April1997.

[17] D. Dunning, G. Regnier, G. McAlpine, D. Cameron,B. Shubert, F. Berry, A. M. Merritt, E. Gronke, andC. Dodd. The Virtual Interface Architecture. IEEEMicro, 18(2):66–76, March-April 1998.

[18] K. Elmeleegy, A. Chanda, A. L. Cox, andW. Zwaenepoel. Lazy asynchronous I/O forevent-driven servers. In Proceedings of the 2004USENIX Annual Technical Conference. Boston, June2004.

[19] A. Foong, J. Fung, and D. Newell. An in-depthanalysis of the impact of processor affinity on networkperformance. In IEEE International Conference onNetworks, November 2004.

[20] A. Foong, T. Huff, H. Hum, J. Patwardhan, andG. Regnier. TCP performance re-visited. In IEEEInternational Symposium on Performance of Systemsand Software, March 2003.

[21] D. Freimuth, E. Hu, J. LaVoie, R. Mraz, E. Nahum,P. Pradhan, and J. Tracey. Server network scalabilityand TCP offload. In Proceedings of the 2005 USENIXAnnual Technical Conference, pages 209–222.Anaheim, April 2005.

[22] A. Gallatin, J. Chase, and K. Yocum. Trapeze/IP:TCP/IP at near-gigabit speeds. In Proceedings of 1999USENIX Technical Conference (Freenix Track), pages109–120, June 1999.

[23] J. M. Hart. Win32 System Programming. AddisonWesley, 2nd edition, 2001.

[24] HP Labs. The userver home page, 2005. URLwww.hpl.hp.com/research/linux/userver.

[25] R. Huggahalli, R. Iyer, and S. Tetrick. Direct cacheaccess for high bandwidth network I/O. In Proceedingsof the 32nd International Conference on ComputerArchitecture (ISCA’05). Madison, WI, June 2005.

[26] InfiniBandSM Trade Association. InfiniBandTM

Architecture Specification Volume 1, Release 1.0.October 2000. URL www.infinibandta.org.

[27] Intel Corporation. PCI/PCI-X Family of GigabitEthernet Controllers Software Developer’s Manual,Revision 2.5. July 2005.

[28] V. Jacobson and B. Felderman. A modest proposal tohelp speed up and scale up the linux networking stack.In linux.conf.au, January 2006.

[29] J. Kay and J. Pasquale. The importance of non-datatouching processing overheads in TCP/IP. InSIGCOMM, pages 259–268, 1993.

[30] J. Kay and J. Pasquale. Profiling and reducingprocessing overheads in TCP/IP. IEEE/ACMTransations on Networking, 4(6):817–828, 1996.

[31] Y. Khalidi and M. Thadani. An efficient zero-copyI/O framework for UNIX. Technical report, SMLITR95-39, Sun Microsystems Lab, May 1995.

[32] D. Libenzi. Improving (network) I/O performance.URL http://www.xmailserver.org/linux-patches/

nio-improve.html.

[33] J. C. Mogul. TCP offload is a dumb idea whose timehas come. In 9th Workshop on Hot Topics inOperating Systems (HotOS IX). USENIX, May 2003.

[34] J. C. Mogul and K. K. Ramakrishnan. Eliminatingreceive livelock in an interrupt-driven kernel. ACMTransactions on Computer Systems, 15(3):217–252,1997.

[35] D. Mosberger and T. Jin. httperf: A tool formeasuring web server performance. In First Workshopon Internet Server Performance, pages 59—67.Madison, WI, June 1998.

[36] S. Muir and J. Smith. AsyMOS – an asymmetricmultiprocessor operating system. In IEEE Conf onOpen Architectures and Network Programming(OPENARCH), April 1998.

[37] S. Muir and J. Smith. Functional divisions in thePiglet multiprocessor operating system. In ACMSIGOPS European Workshop, September 1998.

[38] S. Nagar, P. Larson, H. Linder, and D. Stevens. epollscalability web page. URLhttp://lse.sourceforge.net/epoll/index.html.

[39] V. S. Pai, P. Druschel, and W. Zwaenepoel. Flash: Anefficient and portable Web server. In Proceedings ofthe USENIX 1999 Annual Technical Conference, 1999.

[40] M. Rangarajan, K. Banerjee, J. Yeo, and L. Iftode.MemNet: Efficient offloading of TCP/IP processingusing memory-mapped communication. TechnicalReport DCS-TR-485, Rutgers University TechnicalReport, 2002.

[41] M. Rangarajan, A. Bohra, K. Banerjee, E. Carrera,R. Bianchini, L. Iftode, and W. Zwaenepoel. TCPServers: Offloading TCP processing in Internetservers. Technical Report DCS-TR-481, RutgersUniversity, Mar 2002.

[42] G. Regnier, D. Minturn, G. McAlpine, V. Saletore,

and A. Foong. ETA: Experience with an IntelXeon processor as a packet processing engine. In HotInterconnects, August 2003.

[43] G. J. Regnier, S. Makineni, R. Illikkal, R. R. Iyer,D. B. Minturn, R. Huggahalli, D. Newell, L. S. Cline,and A. Foong. TCP onloading for data center servers.IEEE Computer, 37(11):48–58, 2004.

[44] V. A. Saletore, P. M. Stillwell, J. A. Wiegert,P. Cayton, J. Gray, and G. J. Regnier. Efficient directuser level sockets for an Intel Xeon processor basedTCP on-load engine. In The Workshop onCommunication Architecture for Clusters. Denver,CO, April 2005.

[45] J. H. Salim, R. Olsson, and A. Kuznetsov. BeyondSoftnet. In 5th Annual Linux Showcase andConference, pages 165–172, November 2001.

[46] P. Sarkar, S. Uttamchandani, and K. Voruganti.Storage over IP: when does hardware support help? In2nd USENIX Conference on File and StorageTechnologies (FAST), Mar 2003.

[47] P. Shivam and J. S. Chase. On the elusive benefits ofprotocol offload. In ACM SigComm Workshop onNetwork-IO Convergence (NICELI). Germany, August2003.

[48] Standard Performance Evaluation Corporation.SPECweb99 Benchmark, 1999. URLwww.spec.org/osg/web99.

[49] W. Stevens. Unix Network Programming, Volume 1.Addison Wesley, third edition, 2003.

[50] Y. Turner, T. Brecht, G. Regnier, V. Saletore, G. J.Janakiraman, and B. Lynn. Scalable networking fornext-generation computing platforms. In ThirdAnnual Workshop on System Area Networks (SAN-3).Madrid, Spain, February 2004.

[51] M. Welsh, D. Culler, and E. Brewer. SEDA: anarchitecture for well-conditioned, scalable Internetservices. In 18th Symp. on Operating SystemPrinciples (SOSP-18), Oct 2001.

[52] N. Zeldovich, A. Yip, F. Dabek, R. T. Morris,D. Mazieres, and F. Kaashoek. Multiprocessor supportfor event-driven programs. In Proceedings of theUSENIX 2003 Annual Technical Conference, June2003.

evaluating network processing ef ciency with processor ...brecht/papers/eurosys-2006.pdf · to...

Documents