p ika: a network service for multikernel operating...

Computer Science and Artificial Intelligence Laboratory

Technical Report

m a s s a c h u s e t t s i n s t i t u t e o f t e c h n o l o g y, c a m b r i d g e , m a 0 213 9 u s a — w w w. c s a i l . m i t . e d u

MIT-CSAIL-TR-2014-002 January 28, 2014

P IKA: A Network Service for Multikernel Operating SystemsNathan Z. Beckmann, Charles Gruenwald III, Christopher R. Johnson, Harshad Kasture, Filippo Sironi, Anant Agarwal, M. Frans Kaashoek, and Nickolai Zeldovich

PIKA: A Network Service for Multikernel Operating Systems

Nathan Z. Beckmann, Charles Gruenwald III, Christopher R. Johnson, Harshad Kasture, Filippo SironiAnant Agarwal, M. Frans Kaashoek, and Nickolai Zeldovich

MIT CSAIL

ABSTRACT

PIKA is a network stack designed for multikernel operat-ing systems that target potential future architectures lackingcache-coherent shared memory but supporting message pass-ing. PIKA splits the network stack into several servers thatcommunicate using a low-overhead message passing layer. Akey challenge faced by PIKA is the maintenance of sharedstate, such as a single accept queue and load balance infor-mation. PIKA addresses this challenge using a speculative3-way handshake for connection acceptance, and a new dis-tributed load balancing scheme for spreading connections.A PIKA prototype achieves competitive performance, excel-lent scalability, and low service times under load imbalanceon commodity hardware. Finally, we demonstrate that split-ting network stack processing by function across separatecores is a net loss on commodity hardware, and we describeconditions under which it may be advantageous.

1 INTRODUCTION

Recent research has proposed several distributed kernel (i.e.,multikernel) architectures for multicore processor operatingsystems [6, 24]. These multikernel architectures do not sharedata structures among cores to avoid scalability bottlenecksthat have plagued monolithic kernels [7]. Multikernels areparticularly suitable for multicore processors that do not sup-port cache-coherent shared memory [15]. Although the juryis out whether future multicore and manycore processors willsupport cache-coherent shared memory or not [17], an in-teresting research question to explore is how to design andimplement system services that require shared state but cannotrely on cache-coherent shared memory. This paper exploresthe design and implementation of one such system service,the network stack, and proposes a novel design that achievesgood performance and scalability under a wide range of work-loads.

A simple design for a multikernel network stack is to haveone instance on one core shared between all applications (e.g.,web servers) [20], but this design does not scale well with anincreasing number of cores running application code. Theother extreme is to have a replica of the network stack onseveral cores without sharing, but this design may result inlow utilization and makes it challenging to share, for instance,a single TCP port for incoming connections. This paper in-vestigates a split design for the network stack [6, 24], calledPIKA, consisting of a collection of servers (i.e., user-space

processes) that collaborate to provide high-performance, scal-able networking. PIKA’s instance of a split design has one ormore Link Servers that manage the network interface cards(NICs), one or more Transport Servers that provide transportservices such as TCP, and one or more Connection Managersthat are in charge of connection establishment. Our imple-mentation allows each of these components to either be runas separate servers each on their own core or to be combinedinto hybrid servers on fewer cores. We evaluate the perfor-mance implications of each of these design alternatives andthe conditions under which splitting network stack processingmay be advantageous.

The biggest challenge in PIKA’s design is the managementof shared state among the servers without the convenienceof cache-coherent shared memory. Specifically, connectionmanagement requires the abstraction of a single accept queueto distribute connections among applications.1 Additionally,maintaining high utilization in the presence of applicationswith high and low service time requires load balancing ofconnections among them. The challenges of shared stateare most apparent in short-lived connections, which stressthe accept queues and load balancer. PIKA addresses thesechallenges with a speculative 3-way handshake protocol toaccept connections and a novel distributed load balancingscheme.

Moreover, as computation has moved into the data centerand multicore processors have encouraged parallel computing,response times have come to be dominated by the long tail ofservice time distributions [5, 9]. It is increasingly important todesign network services not only for throughput, but for worst-case latency. Load balancing is thus a first-order challenge forPIKA. Our evaluation shows that PIKA maintains low servicetimes across a variety of load imbalances.

The main contributions of this paper are as follows. First,we show that a network stack for a multikernel can achievegood performance and scalability without reliance on cache-coherent shared memory. Second, we present a novel loadbalancing scheme that allows the network stack to achievehigh throughput and low latency in the presence of applica-tion delays. Third, we evaluate various design choices for anetwork stack on a multikernel via a thin message passinglayer built on top of the Linux kernel to emulate a multi-kernel architecture. Using this setup we investigate which

1Throughout this paper, we use application to mean a single instance ofan application (i.e., a process).

1

design performs best on commodity hardware. Combined,these contributions serve as a template for the design andimplementation of network services for multikernels.

The rest of the paper is organized as follows: Section 2provides an overview of related work. Section 3 discussesthe design of the system and the various components of thenetworking service. Section 4 provides implementation de-tails while Section 5 presents experimental results. Finally,Section 6 concludes the paper.

2 RELATED WORK

PIKA is the first network stack design we are aware of thatachieves good performance, scalability, and load balance with-out reliance on cache-coherent shared memory. In particular,PIKA tackles challenges similar to those addressed in pre-vious research on multikernels and microkernels with splitnetwork stack designs, load balancing of connections, newapplication programming interfaces (APIs) for monolithickernels, and techniques to avoid scaling problems due to datasharing (e.g., data structures and locks) in monolithic kernels.Split Network Stack Designs. PIKA adopts a split networkstack design from the Barrelfish [6] and fos [24] multikernels.PIKA extends the design with support for load balancing todistribute connections across the different applications, pro-viding fast emulation of a shared accept queue. Althoughour evaluation uses a modern 10 Gigabit Ethernet (GbE) NICinstead of a 1 GbE NIC (yielding 10× higher peak through-put), we achieve approximately 100× the throughput of Bar-relfish and fos. PIKA achieves these gains despite the fact thathigher throughput stresses PIKA’s design much more thanprevious research. Our evaluation further explores the bestconfiguration of components on commodity hardware andhow architectural changes could affect this decision.

Hruby et al. [16] also proposed splitting the network stackinto several user-space processes that communicate via mes-sage passing. However, they focused primarily on reliability,and their design factors the network stack into componentsthat manage different layers and protocols (e.g., network –IP, transport – TCP and UDP, etc.) to provide fault isolation.PIKA, by contrast, is focused on performance and scalabilityby minimizing the shared state among components.Load Balancing of Connections. PIKA employs a dynamic,distributed load balancing algorithm to distribute connectionsto applications: it determines how to distribute connectionsbased on the current state of the system, in a decentralizedfashion (e.g. it does not rely on a centralized server for mak-ing load balancing decisions[10, 21]).

It is assumed that connections accepted by an applicationunder PIKA are processed by that same application until com-pletion, so there is no load balancing of connections betweenapplications. Our load balancing algorithm is therefore simi-lar to web server load balancing which distributes connectionsacross several web server instances capable of satisfying the

request [8]. In those systems, the decision of which webserver receives a connection is left to the client, a DNS server,or a separate dispatcher system which may take factors such asserver load and network congestion into account [13]. PIKA’sload balancer, by contrast, is integrated with the networkstack on the system hosting the application rather than asa discrete component and therefore does not need to takeoutside network state into account.New APIs for Monolithic Kernels. A number of researchershave proposed changes to the interfaces of monolithic ker-nels to address scaling problems. Soares and Stumm [22]proposed FlexSC, an exception-less system call mechanismfor applications to request system services. FlexSC batchessynchronous system calls via asynchronous channels (i.e.,system call pages); system call kernel threads running ondedicated cores provide system services, thus avoiding trap-ping from user to kernel mode and addressing cache pollutionproblems. The synchronous execution model is preserved bya user-space threading library that is binary-compatible withthe POSIX threading API and also makes FlexSC transpar-ent to legacy applications, while the asynchronous executionmodel is supported by libflexsc [23]. PIKA also dedicatesmultiple cores to network stack processing but assumes hard-ware supporting message passing instead of cache-coherentshared memory.

Han et al. [12] presented MegaPipe, a new API for scal-able network input/output (I/O). MegaPipe employs thread-local accept queues, which are responsible for connectionestablishment on a particular subset of cores, thus avoidingremote cache accesses. MegaPipe also batches asynchronousI/O requests and responses via synchronous channels withtraditional exception-based system calls to favor data cachelocality. PIKA attains good performance and scalability imple-menting the POSIX socket API, thus supporting unmodifiedapplications. Moreover, PIKA achieves good load balance forboth uniform and skewed workload thanks to connection loadbalancing, which is not supported by MegaPipe.Avoiding Scaling Problems in Monolithic Kernels. Recentwork by Boyd-Wickizer et al. [7] and Pesterev et al. [18]has focused on fixing scaling problems in the Linux kernel’snetwork stack due to data sharing through cache-coherentshared memory. In contrast, PIKA assumes hardware lackingcache-coherent shared memory and solves these problems byother means; hence, these techniques are not applicable. Thispaper demonstrates that the benefits of Affinity-Accept [18]are attainable in a multikernel design on hardware lackingcache-coherent shared memory but supporting message pass-ing.

Shalev et al. [20] describe IsoStack, a network stack designthat eliminates unnecessary data sharing by offloading net-work stack processing to a dedicated core as in the multikernelarchitecture [6, 24]. A prototype built inside the AIX ker-nel with support from a user-space library exploits messagequeues, event-driven operation, and batching to achieve high

2

(a) Split system architecture. (b) Combined system architecture.Figure 1: Split and combined system architectures. In the split configuration (a) the components are divided apart and runseparately. The number of instances of each component is configurable. For the combined configuration (b) the key considerationis the number of instances of the network stack to run.

data throughput matching line speed. PIKA harnesses similartechniques but exploits multiple dedicated cores exploringvarious parallel configurations to achieve both high connec-tion throughput and low response latency under uniform andskewed workloads. With the same number of concurrent con-nections, PIKA achieves comparable performance at low corecounts and up to an order of magnitude higher performanceat high core counts, which IsoStack cannot exploit.

3 DESIGN

The goal for PIKA is to provide a high-performance, scalablenetwork stack for a multikernel across a wide range of work-loads. The main challenge is managing shared state to imple-ment POSIX semantics without reliance on cache-coherentshared memory. PIKA load balances connections amongservers to maximize throughput and minimize latency. Sincemultikernels cannot rely on cache-coherent shared memory toshare state, PIKA uses message passing to share informationand employs a number of techniques to minimize the numberof messages sent while maintaining correctness and achievinggood performance. This section describes elements of PIKA’sdesign and the techniques it uses to achieve its goals.

3.1 PIKA Components and Design ChoicesLike FlexSC [22], PIKA employs dedicated cores for networkstack processing which are distinct from application cores.Furthermore, PIKA splits the network stack into various con-ceptual components based on functionality. The design ofPIKA allows each of these components to either be run asstand-alone servers or combined with other components intoa composite server. An instance of a PIKA system consists ofone or more servers, each of which may encapsulate one ormore of these components (Figures 1a and 1b).

The Link Server (LS) is the component that interacts di-rectly with the NIC. Its responsibilities include configuringthe NIC (e.g., managing the flow director), sending packetson the hardware interface in response to requests from othercomponents of PIKA, as well as to receive inbound packetsand transfer them to the appropriate PIKA component. Inparticular, it does packet inspection to forward SYN packets toConnection Managers (CMs) as described below. The numberof LSs desired depends on the hardware configuration. For

instance, it may not be desirable to have more LSs than thenumber of hardware direct memory access (DMA) rings thatthe NIC can support, since having multiple LSs share a singlering may affect scalability.

The Transport Server (TS) component is responsible formanaging TCP/IP flows, including packet encapsulation/de-encapsulation, Transmission Control Block (TCB) state man-agement, out-of-order packet processing and re-transmissions.There is no shared state among separate flows, and thus differ-ent flows can be managed by different TSs entirely in parallel.The number of TSs can thus be scaled up trivially to meetthe system requirements. Note also that other network stackprotocols such as UDP, ICMP, DHCP, DNS, and ARP arealso handled by the TS. However since these protocols eitherdo not demand high throughput or are stateless they are notconsidered in any further detail in this paper.

The CM is responsible for TCP/IP connection establish-ment. The CM encapsulates all the shared state necessaryfor the implementation of TCP/IP sockets that adhere to thePOSIX socket specification. In particular, it maintains the lis-ten queues where incoming connection requests are enqueuedand then distributed to listening applications. The CM alsodecides which TS should manage each established connec-tion. The CM shares state with the application (Subsection3.2), and also with other CMs in order to correctly implementPOSIX sharing semantics (i.e., a single listen socket sharedamong multiple application processes) and to effectively loadbalance among various application processes listening on thesame port (Subsection 3.3). Scaling up the number of CMshas associated challenges and trade-offs. A single CM forthe entire system can become a bottleneck when high con-nection throughput is required. On the other hand, deployingmultiple CMs presents challenges in sharing state to ensurecorrectness and high performance (discussed in Subsection3.2 and Subsection 3.3), since the overhead of maintainingshared state between CMs increases as the number of CMsgrows.

In addition to the parallelism within each component, an-other design consideration is when to combine various compo-nents into a single composite server. Combining components(Figure 1b) has the benefit of using fewer cores and reducing

3

context switch overhead over a split configuration (Figure 1a).Additionally, communication between components within acomposite server has much lower overhead (function calls vs.messages). This unfortunately leads to loss of control overthe level of parallelism within each component. For example,one might want to have many TSs in the system, but not asmany CMs (because of the high cost of sharing) or LSs (if theNIC only has a few hardware DMA rings). Splitting compo-nents into stand-alone servers may have advantageous cacheeffects (eliminating cache conflicts between components andreducing the working set size), but this potential benefit isoffset by the need for a higher number of cores and highercommunication costs. Traditional microkernels combine allcomponents into a single server. However, a number of recentmultikernels have argued for a split design [6, 24]. The designof PIKA allows us to evaluate the design trade-offs involvedin each of these choices; we present the results in Subsection5.4.

3.2 Speculative 3-Way Handshake

In monolithic kernels, the library function accept() is im-plemented using a shared accept queue within the kernel fromwhich all applications can dequeue incoming connections;these queues are an example of shared state within the net-work stack that can become a scalability bottleneck [7, 18]. InPIKA, by contrast, each CM maintains its own accept queue,without cache-coherent shared memory to keep the queuescoordinated. This fact presents a challenge in the implemen-tation of accept().

A naïve solution would have the CM enqueue all incomingconnections; the application would then need to send a mes-sage to the CM on every invocation of accept(). In response,the CM would choose one of the enqueued connections andnotify the owning TS to offer the connection to the applica-tion. This implementation, while straightforward, is not veryperformant. First, it adds the latency of up to three messagesto every invocation of accept() (App → CM, CM → T S,T S→ App). An invocation of select() for a listen socketwould similarly incur a round trip messaging cost to the CM.Each hit in a select loop would thus require five messages,while a miss would need two, leading to prohibitively highcommunication costs. More importantly, it can lead to veryhigh message traffic as applications poll select().

Our solution is to have the CM speculatively assign in-coming connections to applications. Both select() andaccept() become purely local operations: the applicationsimply checks for new connection messages asynchronouslyforwarded by the CM. This scheme presents a challenge, asapplications that either do not subsequently call accept() orhappen to be busy lead to lost connections (above the TCPlayer) or high latency. The CM must therefore keep an accu-rate list of ready applications so that most assignments areserviced quickly, and additionally provide a recovery mecha-

nism in case the application does not claim the connection ina short time frame.

PIKA solves this challenge by employing a 3-way hand-shake between the application and the CM/TS. On the firstinvocation of accept() or periodically on invocations ofselect(), the application adds itself to the CMs list of ac-ceptors by sending an ACCEPT-CONNECTION message. Anincoming connection is then assigned to one of the accep-tors and the TS is notified as such, causing the TS to send aCONNECTION-AVAILABLEmessage to the application. On thenext invocation of accept() or select(), the applicationclaims this new connection by sending a CLAIM-CONNECTIONmessage to the TS, at which point the TS finishes connectionestablishment and confirms to the application that it success-fully claimed the connection. If, however, the applicationdoes not respond to the CONNECTION-AVAILABLE messagewithin a set period of time, the TS notifies the CM that theapplication timed out. In response, the CM removes the ap-plication from its list of acceptors and assigns the connectionto another acceptor. If the original acceptor subsequentlytries to claim the connection, it is notified that it timed out,at which point it needs to add itself again to the CM’s list ofacceptors by sending another ACCEPT-CONNECTIONmessage.By using a small timeout, new connections can be efficientlyreassigned from busy applications to ready applications.

The speculative 3-way handshake thus allows the CM tomaintain an approximate list of acceptors that is lazily up-dated, greatly reducing the number of messages and improv-ing latency relative to the naïve approach.

3.3 Load BalancingPIKA CMs must distribute incoming connections among ap-plication processes, prioritizing those with the lowest servicetimes. Ideally, CMs would know the immediate service timesof all processes listening on a port and use these to selectthe best destination. This is unfortunately impossible dueto the lack of cache-coherent shared memory. Instead, eachCM maintains a private accept queue and attempts to modela globally shared queue via message passing. This presentsseveral challenges: (i) maintaining accurate, shared state in avolatile environment, (ii) responding to load imbalance effec-tively without disrupting service time of other PIKA servers orapplications, and (iii) doing so with minimal communicationand synchronization overhead.

PIKA’s load balancing scheme, described below, effectivelybalances incoming connections across all applications givingpriority to local applications (i.e. processes on the privateaccept queue). It automatically handles the special case whena CM has no listening applications, acting as a connectiondistribution mechanism by allowing connections to be “stolen”by other CMs.

The general approach PIKA takes for balancing the loadis to have CMs periodically (every 50µs in the current im-plementation) update each other on the service times of their

4

Offer 0

Offer 1

Offer 2

Local App 0

Local App 1 Local App

Selection

Tariff

Tariff

Tariff >

Filter

Figure 2: PIKA load balancing scheme. Local applicationscompete against offers from other CMs. Offers are onlyconsidered when significantly better than local applications.Competition occurs in rounds to scale gracefully.

applications. These updates are called offers and include theservice time of the best application to receive connections(i.e., the fastest) along with its address (i.e., a handle) so itcan be directly contacted by other CMs without any furthercoordination.

These service times are measured by keeping a windowof the last several connections’ service times. Service timesare reported to CMs by the TSs when a connection is closed.Service time is defined as the time between connection of-fered and connection closed. This design captures end-to-endperformance which includes a variety of load imbalances –such as an application co-scheduled with another process,applications servicing long-lived connections, applications ontemperature-throttled cores, and so on. This metric may needto be changed to match the workload (e.g., for workloads withlong-lived, low-utilization connections). Note, however, thatload estimation is orthogonal to the remainder of the design.

Possessing the service times, CMs then compare their localapplications with offers from other CMs to determine whereto send incoming connections (Figure 2). Because servicetimes are updated periodically, PIKA selects between applica-tions probabilistically to prevent flooding. Each application isweighted by its inverse service time, favoring faster applica-tions. This approach also lets the load balancer sample slowapplications to see if they have recovered.

In order to encourage locality, which improves latencyand reduces the number of messages, PIKA penalizes offers(i.e., non-local applications) by a multiplicative tariff. Offersare further filtered so that only those which are competitivewith local applications are available for selection. This isdone because there is no need for CMs to sample non-localapplications (this will be done by the local CM), so there isnever any benefit from choosing a non-local application withworse service time over a local one. Filtering allows PIKA touse low tariffs and still maintain locality.

PIKA selects between offers and local applications in atournament of three rounds (Figure 2). This design addressesa scaling pathology that occurs with a single selection round.With a single round, the relative weight of local applicationsdecreases with increasing numbers of offers. This means thatas the number of CMs scales up, locality decreases. By using

Figure 3: Diagram of PIKA implementation. All processesrun independently using message passing as the only meansfor communication and synchronization. Dotted box indicatescomponents running on a single core.

a tournament, the relative weight of local applications andoffers is independent of the number of CMs.

Finally, PIKA employs high- and low-watermarking offiltered offers to prevent thrashing (not shown in Figure 2).Once an offer has passed the filter, its performance relativeto local applications must fall below a low-watermark valuebefore it is filtered again.

4 IMPLEMENTATION

In order to facilitate PIKA’s development, we implementeda multikernel emulation environment on top of the Linuxkernel (Figure 3). This implementation allows PIKA to takeadvantage of high-performance Linux’s NIC drivers, as wellas many debugging and performance monitoring tools (suchas gdb, perf, and Valgrind [4]). Note, however, that wesimply use Linux to bootstrap the infrastructure and to mapshared memory pages that are used solely by the user-spacemessage passing library.

All application and server processes communicate and syn-chronize solely through message passing without reliance oncache-coherent shared memory and locking primitives abovethis messaging layer. Hardware DMA rings are mapped todistinct PIKA servers, so no communication occurs betweencores in kernel-space (see Subsection 5.1). Process bound-aries enforce communication via messages in user-space.Thus the system emulates a multikernel architecture. Thisdesign can easily be ported to hardware that does not supportcache-coherent shared memory and other multikernels.

Alternatively, we could have simulated PIKA on a targetarchitecture without cache-coherent shared memory. This ap-proach would prevent any inadvertent use of cache coherence,but it has many problems. Modeling high-performance net-working in a simulator is inherently difficult and error-prone.Because simulation runs orders of magnitude slower than na-tive execution, faithfully modeling external network behavior(e.g., at the switch or client machines) under high load is im-possible. Moreover, modeling the behavior of modern NICsis arcane and difficult to validate. Even disregarding theseproblems specific to PIKA, simulations of large multicore

5

processors carry large margins of error and are difficult tovalidate. We therefore opted for a native implementation oncommodity hardware and confirmed that no inadvertent useof cache-coherent shared memory takes place (Subsection5.1).

Applications communicate with PIKA using the POSIXsocket API. We use the GNU/Linux shared object inter-position mechanism (through the LD_PRELOAD environmentvariable) to load a PIKA compatibility library that interceptsPOSIX socket functions and translates them into messagesto PIKA servers. Applications do not need to be modifiedor recompiled to use PIKA. This development infrastructurecan be used to build a variety of multikernel services. Wewill release the source code publicly to encourage furtherdevelopment.

4.1 Driver and Network Stack ProcessingThe PIKA LS manages the NIC using netmap [19], whichgives user-space programs device-independent access to thenetwork interface. netmap maintains a copy of the devicestate in user-space, as well as a shadow copy in the kernel.The user-space application (i.e., the LS) modifies its viewof the device state to enqueue packets for transmission ordequeue packets from a ring. The interface into netmap isthrough standard POSIX functions, which synchronize thedriver state between the kernel and the application. The LSamortizes the costs of the system calls to interact with netmapby enqueuing and dequeuing batches of packets before eachsynchronization.

We use the version of netmap released in June 2012, whichincludes a modified ixgbe driver 3.3.8-k2 for the Intel 8259910 Gigabit Ethernet Controller. We extend netmap to supportadditional features of our hardware, such as manipulatingflow director tables and reading the MAC address.

Flow director is a feature of the Intel 82599EB 10 GigabitEthernet Controller that allows the software to control whichhardware DMA ring an incoming packet is routed to basedon attributes of the packet [2]. PIKA uses this feature toroute packets to individual hardware DMA rings based onthe low 12bits of the source port number as suggested byPesterev et al. [18]. Each hardware DMA ring is managed bya separate LS, ensuring that different connections are spreadacross the different servers while multiple packets for a givenflow are always delivered to the same server.

The network and transport layer processing code (e.g., IP,TCP, etc.) employed by the PIKA TS is largely adapted fromthe lightweight TCP/IP (lwIP) stack [3].

4.2 Programming ModelPIKA implements a high-performance message passing li-brary whose fundamental abstraction is a first-in first-out(FIFO) queue. Each queue is unidirectional; two queues arecombined to provide a bidirectional communication medium.

PIKA servers are event-driven programs supported by acooperative threading library and dispatcher. Each network

stack component registers callbacks for unique message typesand yields control to the dispatcher loop. Each callback isinvoked in a cooperative thread with its own stack that canyield back to the dispatcher pending a response message fromother PIKA servers or applications. The dispatcher providesits own message passing routines implemented on top of thebase message passing library that handle all remote proce-dure call (RPC) traffic. The threading library implementsmicrothreads [14] that avoid costly context switching. Thismodel allows PIKA components to be easily combined intoa single process avoiding additional context switches amongcomponents.

The message passing library runs on commodity hardwareand therefore relies on cache-coherent shared memory. Thequeue abstraction is general enough, however, to be portableto hardware supporting message passing [15].

5 EVALUATION

This section evaluates these questions experimentally:

• Lacking cache-coherent shared memory, PIKA usesnovel mechanisms to provide the illusion of a sharedaccept queue per port and to balance connections be-tween applications. Does the overhead of maintainingthis shared state affect the performance and scalabilityof PIKA with increasing core count?

• How well does PIKA’s load balancing scheme main-tain consistent service times? Can it adapt to changingworkloads?

• To what extent do the various design trade-offs discussedin Section 3.1 impact the best configuration for PIKA,and how is the choice of best configuration affected byhardware parameters?

5.1 Experimental SetupAll of the results in this section are gathered on 4-socket,40-cores PowerEdge 910 servers with four Intel Xeon E7-4850 processors and an Intel 82599EB 10 Gigabit EthernetController. The machines run Ubuntu Server 11.10 x86-64with the Linux kernel 3.0.0 and the “netmap-modified” ixgbedriver 3.3.8-k2. All machines are connected with an Arista7124S 10 Gigabit Ethernet Switch.

In evaluating PIKA, we focus primarily on connection es-tablishment, since it is the metric which depends the moston effective sharing of state for achieving good performance.Other metrics of network performance, such as bandwidthand small message throughput, are trivially parallelizable andare therefore not very interesting.

Our experimental setup is as follows unless stated other-wise. The application is a small multiprocess web server thatserves pages from memory; this isolates the network stackand removes other potential performance bottlenecks. Sincewe wish to focus on connection establishment, all our experi-ments focus on short-lived connections (HTTP requests fora 4B webpage without keep-alive). The client workload

6

2 4 8 16 32

0

100 000

200 000

300 000

400 000

Cores

Con

nect

ions

per

Seco

nd

Ideal

Linux

Pika

(a) Connections

2 4 8 16 32

500K

1M

1.5M

2M

Cores

Req

uest

spe

rSe

cond

Ideal

Linux

Pika

(b) RequestsFigure 4: Scaling of requests per second vs cores. Clients perform HTTP Gets on a 4 B webpage with and without keep-alive.Results show that PIKA scales ideally up to 32 cores. Linux is included for reference.

DMA misses Other misses Total kernelper req. per req. misses per req.

Same socket 0.98 0.27 1.25Across sockets 0.84 0.30 1.14

Table 1: Last-level cache misses in the kernel per HTTPrequest. Four PIKA servers and four web servers are runwithin a single socket or across different sockets. Cachemisses do not increase across sockets; Linux is faithfullymodeling a multikernel.

consists of 48 apachebench 2.3 [1] processes spread acrossfour machines, each with 4 concurrent connections. For eachresult, clients are run for a sufficient time to capture steadystate behavior. A single web server can saturate a TS, thus allof our configurations use an equal number of web servers andTSs. All processes are run on separate cores with communi-cating processes placed on the same socket where possible.

We provide comparisons to Linux kernel’s network stack(Linux from now on) performance in many experiments todemonstrate that PIKA achieves good performance at low corecounts. Due to the lack of other parallel network stacks thatdo not rely on cache-coherent shared memory, we evaluatePIKA in absolute terms against the network stack providedby Linux. To this end, we also include an “Ideal” configura-tion that ideally scales Linux’s performance at two cores (thesmallest system that can run PIKA) to larger systems. PIKAalso compares well with results published for research net-work stacks employing cache-coherent shared memory [18],however we cannot compare directly as Affinity-Accept [18]does not run on recent versions of the Linux kernel.Linux with netmap as a multikernel. By leveraging Linuxon a cache-coherent shared memory multicore processor, onemight worry that PIKA inadvertently benefits from cache co-herence. To put this worry to rest, we measure the numberof last-level cache (LLC) misses per HTTP request for aconfiguration of four combined PIKA servers and four webservers. Linux device interrupts for each hardware DMA ringare mapped to the corresponding PIKA server. The config-

uration is run with: all processes on the same socket; andserver/application pairs striped across sockets. If any com-munication took place via cache-coherent shared memory tohandle an HTTP request then one would see an increase inLLC misses when running across sockets.

Table 1 shows that LLC misses do not increase across sock-ets; thus, there is no hidden communication among cores.2

The LLC misses that do occur are ≈ 1 miss per request inixgbe_netmap_rxsync when accessing the hardware DMAring, plus a small number of conflict or capacity misses else-where. Process boundaries ensure no inadvertent sharingoccurs in user-space. The combination of Linux and netmaptherefore behaves as a multikernel in our experimental setup.

In the remaining experiments, one core is dedicated tohandling Linux device interrupts. Our driver uses polling, anddevice interrupts are only used to notify netmap of waitingpackets in the hardware DMA rings. As a consequence, wedo not consider this core in our evaluation.

Message Passing Performance. Table 2 measures inter-process communication in PIKA. The first experiment showsthe limits of the cache coherence hardware using atomic mem-ory instructions to receive a message and send a reply, takingon average 186 cycles per round-trip. For synchronous pro-cesses, this benchmark is the best possible result, becausefull-featured message passing must allow receivers to read amessage and write a response. The remaining experimentsevaluate PIKA’s message passing and dispatching. With twosynchronous processes, PIKA takes ≈ 600 cycles for a round-trip. The additional delay over baseline is due to allocationoverhead and polling misses. However, with many communi-cating processes, these misses are hidden and PIKA matchesthe baseline performance at 150 cycles per round-trip. Dis-patch adds modest overhead in all cases. Finally, creation,scheduling, and execution of a microthread that does not yieldtakes 26 cycles.

2Note that we disabled hardware prefetching to avoid false sharing.

7

Benchmark Socket Latency Cache misses

Shmem Micro On 186 2.00Off 964 2.00

SynchronousMessaging On 597 2.95

Off 1898 3.13

Dispatch On 706Off 2058

ConcurrentMessaging On 150 2.00

Off 370 2.60

Dispatch On 378Off 857

Microthread − 26

Table 2: Latency and cache misses for various programmingmodel operations. Latency is shown for both on- and off-socket. Cache misses are an average for a single round-trip.

5.2 ScalabilityThis section evaluates the ability of PIKA to scale to meetdemand with increasing core counts. To apply maximumstress to PIKA, we focus on shared state using a connection-per-second microbenchmark. By using short connections andsmall request sizes, we can isolate PIKA’s scaling withoutany limitations from the application. As past research hasshown, the majority of connections in real-world internettraffic is short-lived TCP connections [11]. We also presentseparate results for long lived connections, validating PIKA’sperformance at the other extreme.

The following results were gathered using the PIKA con-figuration that we empirically established as having the bestperformance (Subsection 5.3 and 5.4): the TS, LS and CMcomponents combined in each server, and using the Tourna-ment scheme with a tariff of 2 for load balancing.

Figure 4a reports the throughput obtained by PIKA forshort-lived connections for various core counts. The samemetric for Linux is included for reference. The data showsthat PIKA achieves performance competitive with Linux forsmall core counts while significantly outperforming it forlarge core counts, matching the Ideal configuration’s scaling.PIKA achieves near perfect scaling, chiefly because the useof individual accept queues by each CM removes the mainsource of contention in connection establishment.

The only possible non-architectural source of imperfectscaling is the messages exchanged by the CMs to maintainsharing semantics and to exchange service times. The numberof these messages increases with increasing number of CMs.Table 3 shows the overhead for these updates. From thisexperiment, we can see that the number of updates per secondincreases as the number of CMs increases. However, theoverhead for updating these local data structures is indeednegligible as only 0.34 % of time is spent on these updates inthe worst case.

It is worth noting that Linux scales poorly due to lock con-tention on the shared accept queue and a lack of connection

Number of CMs Updates / sec Time updating Overhead

2 21 27 µs 0.03%4 63 65 µs 0.07%8 147 131 µs 0.13%

16 315 342 µs 0.34%

Table 3: Time spent updating list of potential remote appli-cations between Connection Managers over a 10-second test.In the worst case, updates incur 0.34% overhead.

1 50 80 95 99100

0.5

1.0

1.5

2.0

2.5

3.0

Percentile

Serv

ice

Tim

eHm

sL

Linux

Pika

Figure 5: Service time percentiles for a connection/sec fo-cused workload with load imbalance (see text). Results in-dicate that PIKA adapts to the load imbalance with minimalimpact on service time (knee in curve occurs past 99th per-centile). Linux is included for reference. Note log scale onx-axis.

affinity to cores. Both of these problems have been addressedin Affinity-Accept [18], however the proposed solutions havenot yet been incorporated into the Linux kernel.

In terms of absolute performance, the throughput num-bers reported in the Affinity-Accept paper on a high numberof cores are roughly similar to the throughput numbers thatPIKA achieves on a high number of cores. PIKA’s throughputnumbers are much better than previously published numbersfor multikernels [6, 24] (i.e., 1.9Mreqs/s vs. 19Kreqs/s).However, any comparison between published results is com-plicated by the differences in experimental setup (e.g., pro-cessors, memory, NIC, etc.).

In addition to good scalability on connection establishment,Figure 4b demonstrates that PIKA achieves good performanceand scalability on long-lived connections. As with short livedconnections, PIKA achieves throughput that is competitivewith Linux at small core counts while outperforming it atlarge core counts, even besting the Ideal configuration. Whilethe handling of long-lived connections can be trivially paral-lelized, these results demonstrate that the various techniquesPIKA uses to optimize connection establishment do not ad-versely affect its performance on long-lived connections.

The combination of these experiments demonstrate thatPIKA performs well both in absolute terms as well as scala-bility for both short and long-lived connections, and thus it isour view that PIKA can achieve good performance across awide range of workloads.

8

0. 0.5 1.5 3.1. 4. 5.0

50

100

150

200

250

300

Time HsecL

Serv

ice

Tim

eHΜ

sL

(a) Actual (b) MeasuredFigure 6: Webserver service times over the duration of the experiment. Three web servers experience transient delays of differentmagnitudes (yellow, green, magenta), while one web server remains consistently slow (purple). The four remaining web serversexperience no artificial delay.

5.3 Load BalancingThis section evaluates PIKA behavior under an imbalancedworkload and the trade-offs among different load balancingschemes. The key questions are:

• Does PIKA maintain consistent service times under animbalanced workload?

• Does PIKA accurately and stably predict applicationservice times?

• How well do different load balancing schemes respondto imbalance?

• What is the impact of different load balancing schemeson performance and scalability?

In order to evaluate different load balancing schemes in thepresence of imbalance, we artificially introduce delays in theweb server as shown in Figure 6a. The configuration is a 16-core system with 8 composite PIKA servers (TS, LS and CMcombined) and 8 web servers. The system is driven by a clientthat maintains a constant request rate of 30K connections persecond.

Given sufficient load, all applications can be kept busyby simply distributing incoming connections round-robin,ignoring application performance. So throughput, while im-portant, does not indicate good load balancing. Worst-case ornear-worst-case service time is an important metric for loadbalancing and can dominate overall response times in largescale parallel computing [5, 9]. We therefore focus on servicetime percentiles, particularly at the 95th percentile and above.

Figure 5 plots service time percentiles for PIKA and Linux.PIKA uses the Tournament scheme with a tariff of 2. PIKAhas consistent service times with the knee in the curve occur-ring past the 99th percentile, indicating that PIKA can respondquickly and effectively to changes in service times of indi-vidual applications, avoiding slow applications and ensuringthat overall service time percentiles do not suffer. Comparingthe shape of the curve to Linux, PIKA does better at higherpercentiles, showing that PIKA’s tournament selection outper-forms traditional first-come-first-serve allocation of a sharedmemory accept queue.

Accurate, stable, and responsive service time estimationis critical to the load balancer. Figure 6b shows PIKA’s es-timates of service time over the duration of the experiment.This is nearly identical to the actual service times (Figure 6a),demonstrating that PIKA has accurate knowledge of the ser-vice times of each application. In these experiments, we useda moving median with a window size of 128 and a sampleinterval of 500µs. We found that a fairly large range of valuesworked well; for example, very similar data was generated us-ing a moving average with a window size of 256 and sampleinterval of 1ms.

Next we compare the following load balancing schemes:Tournament 2, Filtered 2, Unfiltered 2, Unfiltered 100, andNaïve, where the number indicates the tariff in all cases.Naïve always gives an incoming connection to a local appli-cation. Unfiltered X does not filter out offers from other CMsand performs a single selection round. Filtered X is similar toUnfiltered X , except all CMs whose offers do not outperformlocal applications with the tariff are ignored. Tournament Xis the complete load balancing scheme described previously,including watermarking except where noted. The importantmeasures are performance, as reflected in scalability; andquality of service, as reflected in service time percentiles.

Service Time. Figure 7 shows how well each scheme re-sponds to the same imbalanced workload. Naïve’s servicetime quickly degrades after the 80th percentile. BecauseNaïve never re-distributes connections, it always suffers theservice time of its local application. Naïve has the steepestgrowth of any scheme, including Linux. Naïve representsa straightforward use of non-cache-coherent systems (e.g.,running many independent Linux instances). Figure 7 showsthat in the face of load imbalance, PIKA greatly outperformssystems without coordinated load balancing.

Unfiltered 100 performs poorly relative to the otherschemes as well, with its service time rising sharply beforethe 99th percentile and diverging from the other schemes atthe 80th percentile. This reflects the high tariff, which heav-ily penalizes distribution of incoming connections to otherCMs. This prevents Unfiltered 100 from responding except

9

1 50 80 95 99100

0.5

1.0

1.5

2.0

Percentile

Serv

ice

Tim

eHm

sL

NaiveUnfiltered 100Unfiltered 2Filtered 2Tournament 2

Figure 7: Service time percentiles for different load balancingschemes under an imbalanced workload. Tournament 2, Fil-tered 2, and Unfiltered 2 perform similarly and degrade littleuntil the 99th percentile. Unfiltered 100 and Naïve performpoorly, diverging before 80th percentile.

2 4 8 16 32

Cores

50 000

100 000

150 000

200 000

250 000

300 000

350 000

Connections per Second

NaiveUnfiltered 100Unfiltered 2Filtered 2Tournament 2

Figure 8: Scaling of connections per second with differentload balancing schemes under a uniform workload. Naïveperforms no load balancing – connections are always sentto the local web server. All load balancing schemes exceptUnfiltered 2 match naïve scaling.

in the most extreme cases of load imbalance, and it respondshardly at all to minor imbalances like those at 0.5s and 3s(Figure 6a).

The three remaining schemes – Tournament 2, Filtered 2,and Unfiltered 2 – perform nearly identically in this metric.All have low growth in service time before the 99th percentileand their service times are nearly identical.

Scalability. Figure 8 shows the scalability of the five loadbalancing schemes under a uniform workload. These resultsshow the impact of load balancing on a balanced workload,where the proper response is to not balance.

As expected, Naïve performs well as the uniform workloaddoes not require load balancing. Unfiltered 100 also does well,as the large tariff prevents connections from going to otherCMs except in rare cases. Unfiltered 2, however, performsvery poorly. With a low tariff, connections are often givento other CMs despite the uniform workload. This reduceslocality and increases communication overheads, leading toa performance loss of over 2× at 32 cores. Filtering offers

Figure 9: Throughput in connections/sec vs. time for animbalanced workload. Throughput is partitioned by whichweb server received the connection. Ticks along the x-axisindicate changes in connection distribution as PIKA respondsto imbalance.

that do not beat the tariff addresses this problem, as shown byFiltered 2 and Tournament 2, which match Naïve scaling andperformance.

These results, combined with the previous section, demon-strate that Filtered 2 and Tournament 2 are the only schemesthat achieve good scalability and service times under a varietyof workloads.

Temporal Analysis. This section evaluates Tournament 2over the test duration, showing how it responds to load andimproves upon Filtered 2 in some cases.

Figure 9 shows the behavior of the complete system duringthe experiment. System throughput is plotted on the y-axisversus time. Throughput is divided by the number of connec-tions distributed to each application, represented by the samecolors as in Figure 6a. Ticks along the x-axis show whereimbalance changes. Throughput is maintained at request ratethroughout the experiment, with minor fluctuations when theload changes.

The response to load balance is clearly evident in the chang-ing distribution of connections shortly after each tick. At 0.5s,web server #1 (yellow) slows and receives sharply fewer con-nections. At 1s web server #1 recovers and its throughputregains its previous rate. Similarly, web server #3 (green)experiences a drop and recovery at 1.5s and 4s, respectively.

In order to demonstrate the value of the Tournament 2 loadbalancing scheme over the other schemes we have plottedthe number of connections that are assigned locally for webserver #7 (magenta), the one that slows slightly at 3s, for theworkload in Figure 10. For Unfiltered 2 the tariff is inade-quate to keep connections local, and it distributes connectionsto remote connection managers for the duration of the exper-iment yielding poor locality. Filtered 2 keeps connectionslocal while the web server is performing on par with the oth-ers, however when the service time drops slightly at 3s, itgreatly prefers remote web servers much like Unfiltered 2.Tournament 2 performs the best since it keeps connections lo-cal until 3s, after which it shows only a slight decrease in the

10

0 1 2 3 4 50

2000

4000

6000

Time HsecL

Con

n�sec

No Watermark

Unfiltered 2

Filtered 2

Tournament 2

Figure 10: A comparison of different load balancing schemesfor web server #7 (pink in previous figures) which has small,transient slowdown starting at 3 s. See text for discussion.

number of local assignments, reflecting the slight increase inservice time. Due to multiple selection rounds, it still prefersto assign locally and produces a similar response with twoCMs as with sixteen. Lastly, watermarking prevents thrashingof the filter in Tournament 2.Discussion. PIKA’s load balancer responds well to appli-cation demand without degrading scalability. With an im-balanced workload, naïve load balancing quickly degradesperformance to unacceptable levels. Simple load balancingschemes are sufficient to address this, but filtering is necessaryto avoid performance degradation under heavy, uniform load.The basic filtering scheme has undesirable scaling behavior,however, disfavoring locality as the number of CMs increases.This issue is addressed using a Tournament selector, whichallows a single tariff value to scale up to a large number ofCMs.

5.4 ConfigurationsIn any multikernel architecture an important decision to makeis how to split the system services and how many instancesof each component should be run. Other multikernels haveproposed splitting the network service for both performanceand reliability reasons [6, 16, 24]. As discussed in Section3.1, a split design affords better control over parallelism (withthe attendant ease in managing shared state) and may alsoprovide cache benefits. This comes at the price of highercommunication costs and a larger core budget. We use PIKA’sflexible design to evaluate these trade-offs.

We describe configurations in the following notation:

1. XnYn implies that component X is combined with (i.e.,in the same process as) component Y , and there are ninstances of this combined server.

2. XnYm (m ≤ n) implies that there are n servers imple-menting the component X , and m of those also havecomponent Y attached.

3. Xn +Ym implies that the system has n instances of com-ponent X and m instances of component Y (in separateprocesses).

TS 8CM 8+

LS 1

TS 8CM 8+

LS 2

TS 8CM 8+

LS 4

TS 8CM 8+

LS 8

TS 8LS 8+

CM 1

TS 8LS 8+

CM 2

TS 8LS 8+

CM 4

TS 8LS 8+

CM 8

TS 8+

CM 1+

LS 1

TS 8+

CM 2+

LS 2

TS 8+

CM 4+

LS 4

TS 8+

CM 8+

LS 8

TS 8LS 8

CM 1

TS 8LS 8

CM 2

TS 8LS 8

CM 4

TS 8LS 8C

M 8

50 000

100 000

150 000

200 000

Connections per Second

Figure 11: Comparison of different configurations with T S8held constant. Additional cores are used as LS, CM, or both.The combined configuration outperforms all others, whileusing fewer cores.

As an example, the split system architecture diagram (Fig-ure 1a) depicts the configuration T S2 + LS2 +CM1, whichuses 5 cores, while the combined system architecture diagram(Figure 1b) depicts the configuration T S2LS2CM2, which usesonly 2 cores.Combined is Best. To evaluate the impact of using a splitdesign, we fix the number of TSs and applications at 8 coreseach while varying the number of LSs and CMs. This experi-ment offers the most favorable evaluation for a split design, asadditional cores used by the LS and CM are “free”. Figure 11shows the performance in connections per second from hold-ing the number of TSs constant at 8 and adding “free” coresfor other components as required by the configuration.

The far right bar shows the performance of the baselineconfiguration where all of the components are combined, thususing the least number of cores. Results are divided intofour groups, all of which perform worse than the combinedconfiguration:

• Separate LSs. Performance is very poor until LS8. Thisis unsurprising: since the hardware has sufficient par-allelism, lowering the number of LSs makes them thebottleneck. Separate LSs also suffer from additionalmessage passing overhead for each TCP/IP packet.

• Separate CMs. Separating out the CM also degradesperformance, although in this case not substantially fromthe baseline.

• Separate LSs and CMs. Given prior results it is expectedthat adding both LSs and CMs provides no net benefit,although T S8 +LS8 +CM8 is close to the baseline (butusing twice as many cores for the full system).

• CMs in a subset of TSs. The subset configuration de-picted in the last group shows acute performance degra-dation at CM1 and CM2. This configuration reducesthe overhead incurred from updates between connectionmanagers; however, our results demonstrate that thisgain is not enough to improve overall performance.

In conclusion, the baseline (combined) configuration outper-forms any split configurations while using fewer cores. Since

11

Configuration TCM Tbusy Tmsg Ttotal

T S1LS1CM1 0.31 µs 3.13 µs 7.12 µs 307 µsT S1LS1 +CM1 9.24 µs 3.67 µs 7.34 µs 311 µs

Table 4: Time spent establishing a connection with split orcombined CM with a single client. Time spent in connectionestablishment increases significantly with a split CM.

Configuration TCM Tbusy Tmsg

T S8LS8CM8 0.39 µs 3.26 µs 10.05 µsT S8LS8 +CM1 17.78 µs 5.17 µs 8.61 µsT S8LS8 +CM2 14.73 µs 4.96 µs 8.59 µsT S8LS8 +CM4 13.67 µs 4.99 µs 10.05 µsT S8LS8CM1 6766.68 µs 5.91 µs 10.91 µsT S8LS8CM2 39.73 µs 5.92 µs 11.57 µsT S8LS8CM4 10.46 µs 5.18 µs 12.38 µs

Table 5: Time spent establishing a connection over all CMconfigurations using a full client workload. Time spent in theCM increases significantly with all split configurations.

messaging overhead is low and pipeline parallelism is ample,it is surprising that using more cores does not provide anincrease in performance. We now take a more detailed look atwhere the time is being spent and cache behavior to determinethe performance trade-offs.

Performance Breakdown. To better understand the perfor-mance trade-offs of running various components separately,we instrumented our code to measure the cost of differentphases of handling an incoming connection. In particular,we measure the connection management cost (TCM), definedas the time spent between the time a SYN packet is receivedby the LS to the time the packet is handed off to the TS tomanage the connection. We also measure the time spent inthe TS doing TCP/IP processing (Tbusy) and in the messagepassing code (Tmsg).

To estimate the messaging cost incurred by a separate CM,we compare the costs for T S1LS1CM1 and T S1LS1 +CM1,where the application is a single web server and the client isan apachebench process with concurrency of 1 (Table 4) . Thesignificant difference between the two is in TCM (8.93 µs).This represents the additional communication cost incurredin T S1LS1 +CM1 and accounts for the slightly higher end-to-end latency (Ttotal) observed.

Table 5 presents these costs for all CM configurations, withthe full experimental setup shown in Figure 11. Again, themost significant differences are in TCM . The difference inTCM between T S8LS8 +CM1 and T S8LS8CM8 is higher thancan be accounted for by the communication cost alone. Thebalance is the queueing delay experienced when a single CMis overwhelmed doing connection management for 8 TSs. Thequeuing delay decreases as the number of CMs is increased,

though not enough to be an advantage while still incurring acommunication overhead as well as using additional cores.

In the subset configuration, TCM is several orders of mag-nitude higher for T S8LS8CM1. In this case, the connectionmanagement time is made worse by the fact that the singleCM additionally competes with heavily loaded TS/LS compo-nents, causing the queue length at the CM to grow unbounded.These delays explain the reduced throughput for this config-uration observed in Figure 11. This time is decreased as thenumber of CMs is increased, however it remains much higherthan in the combined configuration.

These results indicate that splitting the CM into a separateserver or running it in a subset of the TSs does not provide anyperformance gains on our hardware due to queueing delaysand additional communication cost.Cache Size. One of the main motivations for splitting com-ponents is to reduce the cache footprint of PIKA servers sothey fit in the private cache levels. To determine the cacheimpact of splitting the various components of PIKA, we simu-late PIKA behavior for two configurations (T S1LS1CM1 andT S1 +LS1 +CM1) with a fixed workload of 10,000 requestsfor various cache sizes using Cachegrind [4]. Figure 12presents the number of cache misses per request incurredfor various cache sizes by these two configurations.

The results indicate that the total instruction footprint of allPIKA components fits in a cache size of 64 KB (Figure 12a).Thus, there is no advantage in splitting up the components forcache sizes over 64 KB. For smaller caches, the number ofcache misses in the two cases are very similar; therefore anypossible benefit due to components splitting, if at all present,is likely to be small.

For data caches (Figure 12b), the total cache misses in-curred by the two configurations are very similar for cachesbigger than 16 KB, and there are unlikely to be significantcache benefits in splitting the system. For cache sizes of16 KB or smaller, the data indicates that there might be someadvantageous cache effects to be obtained from splitting thecomponents.

In summary, for systems with private caches larger than64 KB, splitting the components is unlikely to result inbetter cache characteristics. The combined configuration(T SnLSnCMn) is therefore better on commodity hardware,and likely on most multicore and many core processors in theforeseeable future. For smaller cache sizes or other systemservices with larger footprint, the choice is less clear andthe answer depends on the relative importance of the variousfactors discussed above.

6 CONCLUSION

This paper presented PIKA, a network stack for multikerneloperating systems. PIKA maintains a shared accept queue andperforms load balancing through explicit message passing,using a speculative 3-way handshake protocol and a noveldistributed load balancing scheme for connection acceptance.

12

16K 32K 64K 128K 256K 512K 1M

200

400

600

800

1000Misses per request

CombinedTransport ServerLink ServerConnection Manager

(a) i-cache size

16K 32K 64K 128K 256K 512K 1M

100

200

300

400

500Misses per request

CombinedTransport ServerLink ServerConnection Manager

(b) d-cache sizeFigure 12: Cache misses vs. private cache size. For private cache sizes larger than 64 KB it is preferable to combine components.

Our prototype achieves good performance, scalability, andload balance on 32 cores for both uniform and skewed work-loads. Moreover, an exploration of possible multikernel net-work stack designs suggests that a combined network stack oneach individual core delivers the best performance on current“big” cores, and that splitting the stack across several coreswould make sense only on cores with 16 KB private caches.We hope that our techniques and results can inform the designof other system services for multikernels that must maintainshared state on future multicore processors.

ACKNOWLEDGMENTS

This research was supported by Quanta.

REFERENCES

[1] ab - Apache HTTP server benchmarkingtool. http://httpd.apache.org/docs/2.2/programs/ab.html.

[2] Intel 82599 10 GbE Controller Datasheet.http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/82599-10-gbe-controller-datasheet.pdf.

[3] lwIP - A Lightweight TCP/IP stack. http://savannah.nongnu.org/projects/lwip/.

[4] Valgrind. http://www.valgrind.org.

[5] Luiz André Barroso. Warehouse-Scale Computing: En-tering the Teenage Decade. In Proceedings of the 38thAnnual International Symposium on Computer Architec-ture (ISCA), 2011.

[6] Andrew Baumann, Paul Barham, Pierre Évariste Da-gand, Timothy L. Harris, Rebecca Isaacs, Simon Pe-ter, Timothy Roscoe, Adrian Schüpbach, and AkhileshSinghania. The multikernel: a new OS architecture forscalable multicore systems. In Proceedings of the 22ndSymposium on Operating Systems Principles (SOSP),2009.

[7] Silas Boyd-Wickizer, Austin T. Clements, YandongMao, Aleksey Pesterev, M. Frans Kaashoek, RobertMorris, and Nickolai Zeldovich. An Analysis of LinuxScalability to Many Cores. In Proceedings of the 9thSymposium on Operating Systems Design and Imple-mentation (OSDI), 2010.

[8] Valeria Cardellini, Michele Colajanni, and Philip S.Yu. Dynamic Load Balancing on Web-Server Systems.IEEE Internet Computing, 3(3), 1999.

[9] Jeffrey Dean and Luiz André Barroso. The Tail at Scale.Commun. ACM, 56(2), 2013.

[10] Daniel Grosu. Load Balancing in Distributed Systems:A Game Theoretic Approach. PhD thesis, University ofTexas San Antonio, 2003.

[11] Liang Guo and Ibrahim Matta. The War between Miceand Elephants. In Proceedings of the 9th InternationalConference on Network Protocols (ICNP), 2001.

[12] Sangjin Han, Scott Marshall, Byung-Gon Chun, andSylvia Ratnasamy. MegaPipe: A New ProgrammingInterface for Scalable Network I/O. In Proceedings ofthe 10th Symposium on Operating Systems Design andImplementation (OSDI), 2012.

[13] Nikhil Handigol, Srinivasan SeetharamanâAa, MarioFlajslik, Nick McKeown, and Ramesh Johari. Plug-n-Serve: Load-Balancing Web Traffic using OpenFlow.http://conferences.sigcomm.org/sigcomm/2009/demos/sigcomm-pd-2009-final26.pdf,2009.

[14] Joe Hoffert and Kenneth Goldman. Microthread - AnObject Behavioral Pattern for Managing Object Execu-tion. In Proceedings of the 5th Conference on PatternLanguages of Programs (PLoP), 1998.

[15] Jason Howard, Saurabh Dighe, Sriram R. Vangal, Gre-gory Ruhl, Nitin Borkar, Shailendra Jain, Vasantha Er-raguntla, Michael Konow, Michael Riepen, Matthias

13

Gries, Guido Droege, Tor Lund-Larsen, Sebastian Steibl,Shekhar Borkar, Vivek K. De, and Rob F. Van der Wi-jngaart. A 48-Core IA-32 Processor in 45 nm CMOSUsing On-Die Message-Passing and DVFS for Perfor-mance and Power Scaling. J. Solid-State Circuits, 46(1),2011.

[16] Tomas Hruby, Dirk Vogt, Herbert Bos, and Andrew S.Tanenbaum. Keep Net Working - On a Dependable andFast Networking Stack. In Proceedings of the AnnualInternational Conference on Dependable Systems andNetworks (DSN), 2012.

[17] Milo M. K. Martin, Mark D. Hill, and Daniel J. Sorin.Why On-Chip Cache Coherence Is Here to Stay. Com-mun. ACM, 55(7), 2012.

[18] Aleksey Pesterev, Jacob Strauss, Nickolai Zeldovich,and Robert T. Morris. Improving Network ConnectionLocality on Multicore Systems. In Proceedings of the7th European Conference on Computer Systems (Eu-roSys), 2012.

[19] Luigi Rizzo. netmap: a novel framework for fast packetI/O. In Proceedings of the Annual Technical Conference(USENIX ATC), 2012.

[20] Leah Shalev, Julian Satran, Eran Borovik, and MuliBen-Yehuda. IsoStack – Highly Efficient Network Pro-cessing on Dedicated Cores. In Proceedings of theAnnual Technical Conference (USENIX ATC), 2010.

[21] Niranjan G. Shivaratri, Phillip Krueger, and MukeshSinghal. Load Distributing for Locally Distributed Sys-tems. IEEE Computer, 25(12), 1992.

[22] Livio Soares and Michael Stumm. FlexSC: FlexibleSystem Call Scheduling with Exception-Less SystemCalls. In Proceedings of the 9th Symposium on Operat-ing Systems Design and Implementation (OSDI), 2010.

[23] Livio Soares and Michael Stumm. Exception-Less Sys-tem Calls for Event-Driven Servers. In Proceedings ofthe Annual Technical Conference (USENIX ATC), 2011.

[24] David Wentzlaff, Charles Gruenwald, III, Nathan Beck-mann, Kevin Modzelewski, Adam Belay, Lamia Yous-eff, Jason Miller, and Anant Agarwal. An OperatingSystem for Multicore and Clouds: Mechanisms and Im-plementation. In Proceedings of the 1st Symposium onCloud Computing (SoCC), 2010.

14

p ika: a network service for multikernel operating...

Documents