parallel operating system services in a factored operating...

14
Parallel Operating System Services in a Factored Operating System David Wentzlaff Charles Gruenwald III Nathan Beckmann Kevin Modzelewski Adam Belay Harshad Kasture Lamia Youseff Jason Miller Anant Agarwal CSAIL Massachusetts Institute of Technology Abstract Current monolithic operating systems are designed for uniprocessor systems, and their system architecture reflects this. The rise of multicore and cloud computing is dras- tically changing the tradeoffs in operating system design, replacing a culture of scarce computational resources with one of abundant cores. Efforts to parallelize monolithic kernels have been difficult and only marginally successful. This paper describes fos, a new factored operating system designed from the ground up with scalability as the first- order design constraint. To enable scalability, fos factors the functionality of a traditional OS into separate system services. Each system service is provided by a parallel fleet of communicating, message-passing servers. This paper details how the system architecture of fos, along with de- veloper tools, enable the construction of scalable system services. We describe the design and implementation of several critical parallel operating system services (network stack, file system, page allocation) and compare with Linux. These comparisons show that fos achieves competitive per- formance and has better scalability than Linux. Finally, we demonstrate how, with this design, fos can abstract a cloud computer as a single system and scale across a cloud. 1 Introduction fos [30] is a new factored operating system (OS) de- signed for future multicores and cloud computers. These emerging architectures present new challenges to OS de- sign, particularly in the management of a previously un- precedented number of computational cores. Trends in multicore architectures point to an ever- increasing number of cores available on a single chip. Moore’s law predicts an exponential increase in inte- grated circuit density; in the past, this increase in cir- cuit density has translated into higher single-stream per- formance, but recently single-stream performance has plateaued and industry has turned to adding cores to in- crease processor performance. In only a few years, mul- ticores have gone from esoteric to commonplace: 12- core single-chip offerings are available from major ven- dors [3] with research prototypes showing many more cores on the horizon [25, 19], and 64-core chips are avail- able from embedded vendors [29] with 100-core chips available this year [2]. Given exponential scaling, it will not be long before chips with hundreds of cores are standard, with thou- sands of cores following close behind. Recent research, though, has demonstrated problems with scaling mono- lithic OS designs. In monolithic OSs, OS code executes in the kernel on the same core which makes an OS ser- vice request. Corey [10] showed that this led to signifi- cant performance degradation for important applications compared to intelligent, application-level management of system services. Our prior work [28] also showed sig- nificant cache pollution caused by running OS code on the same core as the application. This becomes an even greater problem if multicore trends lead to smaller per- core cache. We also showed severe scalability problems with OS microbenchmarks, even only up to 16 cores. Ef- forts to improve scalability of these services have been difficult and marginally successful. A similar, independent trend can be seen in the growth of cloud computing. Rather than consolidating a large number of cores on a single chip, cloud computing con- solidates cores within a data center. Current Infrastruc- ture as a Service (IaaS) cloud management solutions re- quire a cloud computer user to manage many virtual ma- chines (VMs). Unfortunately, this presents a fractured and non-uniform view of resources to the programmer. For example, the user needs to manage communication differently depending on whether the communication is within a VM or between VMs. Also, the user of a IaaS system has to worry not only about constructing their ap- plication, but also about system concerns such as config- uring and managing communicating operating systems. In fos, we put a cloud under a single system image OS, thereby improving manageability and resource utiliza- tion. There is much commonality between constructing OSs for clouds and multicores, such as large-scale re- source management, heterogeneity, and possible lack of

Upload: others

Post on 30-Jan-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

  • Parallel Operating System Services in a Factored Operating System

    David Wentzlaff Charles Gruenwald III Nathan Beckmann Kevin ModzelewskiAdam Belay Harshad Kasture Lamia Youseff Jason Miller

    Anant AgarwalCSAIL Massachusetts Institute of Technology

    AbstractCurrent monolithic operating systems are designed for

    uniprocessor systems, and their system architecture reflectsthis. The rise of multicore and cloud computing is dras-tically changing the tradeoffs in operating system design,replacing a culture of scarce computational resources withone of abundant cores. Efforts to parallelize monolithickernels have been difficult and only marginally successful.This paper describes fos, a new factored operating systemdesigned from the ground up with scalability as the first-order design constraint. To enable scalability, fos factorsthe functionality of a traditional OS into separate systemservices. Each system service is provided by a parallel fleetof communicating, message-passing servers. This paperdetails how the system architecture of fos, along with de-veloper tools, enable the construction of scalable systemservices. We describe the design and implementation ofseveral critical parallel operating system services (networkstack, file system, page allocation) and compare with Linux.These comparisons show that fos achieves competitive per-formance and has better scalability than Linux. Finally, wedemonstrate how, with this design, fos can abstract a cloudcomputer as a single system and scale across a cloud.

    1 Introductionfos [30] is a new factored operating system (OS) de-signed for future multicores and cloud computers. Theseemerging architectures present new challenges to OS de-sign, particularly in the management of a previously un-precedented number of computational cores.

    Trends in multicore architectures point to an ever-increasing number of cores available on a single chip.Moore’s law predicts an exponential increase in inte-grated circuit density; in the past, this increase in cir-cuit density has translated into higher single-stream per-formance, but recently single-stream performance hasplateaued and industry has turned to adding cores to in-crease processor performance. In only a few years, mul-ticores have gone from esoteric to commonplace: 12-core single-chip offerings are available from major ven-dors [3] with research prototypes showing many more

    cores on the horizon [25, 19], and 64-core chips are avail-able from embedded vendors [29] with 100-core chipsavailable this year [2].

    Given exponential scaling, it will not be long beforechips with hundreds of cores are standard, with thou-sands of cores following close behind. Recent research,though, has demonstrated problems with scaling mono-lithic OS designs. In monolithic OSs, OS code executesin the kernel on the same core which makes an OS ser-vice request. Corey [10] showed that this led to signifi-cant performance degradation for important applicationscompared to intelligent, application-level managementof system services. Our prior work [28] also showed sig-nificant cache pollution caused by running OS code onthe same core as the application. This becomes an evengreater problem if multicore trends lead to smaller per-core cache. We also showed severe scalability problemswith OS microbenchmarks, even only up to 16 cores. Ef-forts to improve scalability of these services have beendifficult and marginally successful.

    A similar, independent trend can be seen in the growthof cloud computing. Rather than consolidating a largenumber of cores on a single chip, cloud computing con-solidates cores within a data center. Current Infrastruc-ture as a Service (IaaS) cloud management solutions re-quire a cloud computer user to manage many virtual ma-chines (VMs). Unfortunately, this presents a fracturedand non-uniform view of resources to the programmer.For example, the user needs to manage communicationdifferently depending on whether the communication iswithin a VM or between VMs. Also, the user of a IaaSsystem has to worry not only about constructing their ap-plication, but also about system concerns such as config-uring and managing communicating operating systems.In fos, we put a cloud under a single system image OS,thereby improving manageability and resource utiliza-tion. There is much commonality between constructingOSs for clouds and multicores, such as large-scale re-source management, heterogeneity, and possible lack of

  • widespread shared memory. These similarities allow fosto be designed for both multicore and cloud computers.

    The primary question facing OS designers over thenext ten years will be: What is the correct design of OSservices that will scale up to hundreds or thousands ofcores? We argue that the structure of monolithic OSsis fundamentally limited in how they can address thisproblem. In contrast, the structure of fos brings scalabil-ity concerns to the forefront by decomposing an OS intoservices, and then parallelizing within each service. Tofacilitate the conscious consideration of scalability, fossystem services are moved into userspace and connectedvia messaging. In fos, a set of cooperating system serverswhich implement a single system service is called a fleet.This paper describes the implementation of several fleetsin fos and the system architecture that makes it possible.

    Monolithic operating systems are designed assumingthat computation is the limiting resource. This assump-tion is proving incorrect as the advent of multicore andclouds is providing abundant computational resources.The abundant computational resources presents oppor-tunities to the OS to allocate computational resources toauxiliary purposes, which accelerates application execu-tion, but do not run the application itself. fos’s architec-ture leverages this insight by factoring OS code out of thekernel and running it on cores separate from applicationcode. Once moved out of the kernel, each fos service isimplemented as a parallel, cooperating set of servers.

    As programmer effort shifts from maximizing per-core performance to producing parallel systems, the OSmust shift to assist the programmer. Another goal of fosis to provide a system architecture that enables and en-courages the design of parallel, scalable OS services. Inthis paper, we make these contributions:

    • We present the system architecture of fos, focusedon how the architecture encourages the design ofparallel, scalable OS services.

    • We present the parallel design of several critical OSservices in fos. Additionally, we present the firstscaling numbers for these services, showing that fosis competitive with Linux when considering appli-cation parallelism.

    • We present a performance breakdown of fos’s mes-saging abstraction, which presents a mailbox-basedinterface that transparently uses different messagetransports based on mailbox location.

    This paper is organized as follows. Section 2 presentsfos’s architecture. Section 3 presents concepts and toolswhich enable building scalable OS services. Section 4presents the design of fos parallel OS services. Section 5presents scalability results for fos services. Section 6 re-lates fos to previous work, and Section 7 concludes.

    PS PS

    PS PS PS

    FS FS

    app2

    blk

    FS

    NS

    File System Server, fleet memberBlock Device DriverProcess Management server, fleet memberName server, fleet memberNetwork Interface

    ApplicationsPage allocator, fleet member

    app3app1

    app4 NS

    NS

    PS

    app5

    Name Server

    microkernel

    nif

    File system server

    microkernel

    Application 5

    microkernel

    libfos

    FS

    NS

    Figure 1: A high-level diagram of the fos servers layoutin a multicore machine. The figure illustrates that eachOS service fleet consists of several servers which are as-signed to different cores.

    2 System ArchitectureCurrent OSs were designed in an era when computationwas a limited resource. With the expected exponentialincrease in number of cores, the landscape has funda-mentally changed. The question is no longer how to copewith limited resources, but rather how to make the mostof the abundant computation available. fos is designedwith this in mind, and takes scalability and adaptabilityas the first-order design constraints. The goal of fos is todesign system services that scale from a few to thousandsof cores. fos’s key design principles are:

    • Space multiplexing replaces time multiplexing. Dueto the growing bounty of cores, there will soon bea time when the number of cores in the system ex-ceeds the number of fruitful active processes. Atthis point, scheduling becomes a layout problem,not a time multiplexing problem.

    • OS is factored into function-specific services, whereeach is implemented as a parallel, distributed ser-vice. The OS runs independent of the application onseparate cores. Applications interact with serversthrough a library layer that translates requests intomessages.

    • OS adapts resource utilization to elasticity of sys-tem needs. The utilization of active services is mea-sured, and highly loaded services are provisionedmore cores.

    Figure 1 shows the high-level architecture of fos. Asmall microkernel runs on every core. Operating systemservices and applications run on distinct cores. Applica-tions can use shared memory, but OS services communi-cate via message passing. A library layer (libfos) trans-lates traditional syscalls into messages to fos services.A naming service is used to find a message’s destination

    2

  • server. The naming service is maintained by a distributedset (fleet) of naming servers. Finally, fos can run on topof a hypervisor and seamlessly span multiple machines,thereby providing a a single system image across a cloudcomputer. The following subsections describe the archi-tecture of fos, and the next section covers how fos sup-ports building parallel services.

    2.1 MicrokernelIn order to factor OS services into separate processes,fos uses a minimal microkernel design. The microker-nel provides only: (i) a protected messaging layer, (ii)a name cache to accelerate message delivery, (iii) rudi-mentary time multiplexing of cores, and (iv) an appli-cation programming interface (API) to allow the modi-fication of address spaces and thread creation. All otherOS functionality and applications execute in user space.OS system services execute as in userspace, but possesscapabilities to communicate with other system serviceswhich user applications do not.

    Capabilities are extensively used to restrict access intothe protected microkernel. For instance, the memorymodification API allows a process on one core to modifythe memory and address space on another core if appro-priate capabilities are held. This approach allows fos tomove significant memory management and schedulinglogic into fos system services running in userland. Bymoving functionality out of the kernel and into userlandOS system services, core OS functionality can run ondistinct cores from application code and parallelizationof system services can be done more effectively. Capa-bilities also determine who is allowed to send messagesto whom.

    Core 1Microkernel

    Application

    - Userland Messaging (URPC)- User-Space Naming Cache (UNC)

    POSIX

    Lib

    fos

    Multicore Server

    - uk Messaging - uk Naming Cache

    libc

    Core 2Microkernel

    fos server

    - Userland Messaging (URPC)- User-Space Naming Cache (UNC)

    POSIX

    Lib

    fos

    libc

    - uk Messaging - uk Naming Cache

    31

    2a

    2b

    Figure 2: fos messaging over kernel(2a) and URPC(2b).

    2.2 Messagingfos provides interprocess communication through amailbox-based message-passing abstraction. The API al-lows processes to create mailboxes to receive messages,and associate the mailbox with a name and capabilty.This design provides several advantages for a scalableOS on multicores and in the cloud. Messaging can be

    implemented via a variety of underlying mechanisms:shared memory, hardware message passing, TCP/IP, etc..This allows fos to run on a variety of architectures andenvironments.

    The traditional alternative to message-passing isshared memory. However, in many cases shared memorymay be unavailable or inefficient: fos can support uncon-ventional architectures [18, 19] that have no shared mem-ory or inefficient support, as well as supporting futuremulticores with thousands of cores where global sharedmemory may prove unscalable. Relying on messagingis even more important in the cloud, where machinescan potentially reside in different datacenters and inter-machine shared memory is unavailable.

    A more subtle advantage of message passing is theprogramming model. Although perhaps less familiar tothe programmer, a message-passing programming modelmakes data sharing much more explicit. This allows theprogrammer to consider carefully the data sharing pat-terns and find performance bottlenecks early in the de-sign. This leads to more efficient and scalable designs.Through message passing, we achieve better encapsula-tion as well as scalability.

    It bears noting that fos supports conventional multi-threaded applications with shared memory, where hard-ware supports it. This is in order to support legacy codeas well as a variety of programming models. However,OS services are implemented strictly using messages.

    Having the OS provide a single message-passing ab-straction allows transparent scale out of the system, sincethe system can decide where best to place processeswithout concern for straddling shared memory domainsas occurs in cloud systems. Also, the flat communicationmedium allows the OS to perform targeted optimizationsacross all active processes, such as placing heavily com-municating processes near each other.

    fos currently provides three different mechanisms formessage delivery. These mechanisms are transparentlymultiplexed in the libfos library layer, based on the loca-tions of the processes and communication patterns.

    Microkernel The fos microkernel provides a simpleimplementation of the mailbox API over shared memory.This is the default mechanism for delivering messageswithin a single machine. Mailboxes are created withinthe address space of the creating process. Messages aresent by trapping into the microkernel, which checks thecapability and delivers the message to the mailbox. Mes-sages are received without trapping into the microkernelby polling on the mailbox.

    User-level For processes that communicate often, fosalso provides a channel-based mechanism via URPC [8,9]. The primary advantage of this mechanism is it avoidssystem call overhead by running entirely in user space.

    3

  • Channels are created and destroyed dynamically, allow-ing compatibility with fos’s mailbox messaging model.Outgoing channels are bound to names and stored in auser-level name cache. When a channel is established,the microkernel maps a shared page between the senderand receiver. This page is treated as a circular queue ofmessages. As shown later in this paper, this mechanismachieves much better per-message latency, at the cost ofinitial overhead to establish a connection.

    Figure 2 shows and compares these two local messag-ing transports.

    Intermachine Messages sent between machines gothrough a proxy server. This server is responsible forrouting the message to the correct machine within thefos system, as well as maintaining the appropriate con-nections and state with other proxy servers in the system.

    It is important to note that these mechanisms are com-pletely transparent to the programmer. As messagingis fundamental to fos performance, making intelligentchoices about which mechanism to use is critical. Akey direction of future work is dynamically switching be-tween the microkernel mechanism and URPC, as well asmigrating processes to avoid inter-machine communica-tion.

    2.3 Name ServiceClosely coupled with messaging, fos provides a nameservice to lookup mailboxes throughout the system. Thisnamespace is a hierarchical URI much like a web ad-dress or filename. The namespace is populated by pro-cesses registering their mailboxes with the name service.The key advantage of the name service is the level of in-direction between the symbolic identifier of a mailboxand its so-called “address” or actual location (machine,memory address, etc.). By dealing with names instead ofaddresses, the system can dynamically load balance andmigrate processes.

    The need for dynamic load balancing and process mi-gration is a direct consequence of the massive scale ofcurrent cloud systems and future multicores. In addi-tion to a greater amount of resources under manage-ment, there is also greater variability of demand. Staticscheduling is inadequate, as even if demand is known itis rarely constant. It is, therefore, necessary to adapt thelayout of processes in the system to respond to where theservice is currently needed.

    The advantage of naming is closely tied to paralleloperating system services. In fos, a number of collab-orating processes (fleet), will implement a single ser-vice. These processes will each have an in-bound mail-box upon which they receive requests, and these mail-boxes will all be registered under a single name. It is theresponsibility of the name service to resolve a request to

    one member of a fleet.1

    Load balancing can be quite complicated and highlycustomized to a specific service. Each service can dy-namically update the name system to control the loadbalancing policy for their fleet. The name service doesnot determine load balancing policy, but merely providesmechanisms to implement a policy. To support state-ful operations, applications or libfos can cache the namelookups so that all messages for a transaction go the samefleet member.2 Alternatively, the fleet can manage sharedstate so that all members can handle any request.

    In fos, another design point is to explicitly load bal-ance within the fleet. This approach may be suitablewhen the routing decision is based on state informationnot available to the name service. In either approach it isimportant to realize that by decoupling this mapping, theOS has the freedom to implement a given service througha dynamic number of servers as long as the protocol andsemantics remain the same between the application andthe service.

    The name service also enables migration of processesand their mailboxes. This is desirable for a number ofreasons, chiefly to improve performance by moving com-municating servers closer to each other. fos implementsa spatial scheduler that observes communication patternsand dynamically migrates processes to optimize globallatency. Migration is also useful to rebalance load as re-sources are freed. The name service provides the essen-tial level of indirection that lets mailboxes move freelywithout interrupting communication.

    2.4 FleetsAs mentioned previously, services in fos are imple-mented by cooperating, spatially-distributed sets of pro-cess, or fleets. This idea is the main research thrust of fos.Whereas prior projects have demonstrated the viability ofmicrokernel approaches, fos aims to implement a com-plete distributed, parallel OS by implementing scalableservice fleets.

    Each system service is implemented by a single fleetof servers. Within a single system, there will be a filesystem fleet, a page allocator fleet, a naming fleet, aprocess management fleet, etc.. The non-uniform costof communication, even in present-day multicores, in-creases the importance of correct spatial scheduling. Ad-ditionally, the fleet may span multiple machines whereadvantageous.

    Fleet management is a challenging problem. Althoughthe name service provides coarse load balancing, fleetsoften must route requests themselves. One important rea-son is resource affinity – if a request uses a resource un-der management of a particular fleet member, then therequest should be forwarded to that member. 3

    Fleets also must support other management tasks.Fleets can grow and shrink to meet demand, and must

    4

  • support rebalancing when a new member joins or leavesthe fleet.

    In the future, these tasks will be handled in a dis-tributed manner by all fleet members. However, cur-rently many services designate a single member, termedthe coordinator, to perform many of these tasks.

    3 Scalability Support for FleetsThe previous section covered the high-level system ar-chitecture of fos and how it enables fleets. This sectiondiscusses at a higher level how fos supports building scal-able system services, and the principles used in buildingthem. fos provides a model that lets users code paral-lel, message-passing servers with straight-line, function-oriented code. Support for shared state is managed viaa library of distributed data structures that also facili-tates development of custom data structures. fos fleetsare elastic, meaning they can grow and shrink to meetdemand, which introduces several challenging problemsto fleet design.

    3.1 Programming Modelfos provides two tools that ease the development of par-allel services and applications. These are designed tomitigate the complexity and unfamiliarity of message-passing parallel codes, thus allowing efficient servers tobe written with simple, straight-line code. These toolsare (i) a token-based cooperative threading model inte-grated with fos’s messaging system, and (ii) a remoteprocedure call (RPC) code generation tool.

    The cooperative threading model lets several activecontexts multiplex within a single process. This en-ables one to write sequential code to handle an individ-ual transaction, eliminating complex state machines totrack transaction state. The most significant feature ofthe threading model is how it is integrated with fos’smessaging system. The threading model provides a dis-patcher, which implements a callback mechanism basedon message types. When a message of a particular typearrives, a new thread is spawned to handle that message.This thread can send messages through the dispatcher,which returns a token for that message. The thread canthen wait for the token, which puts the thread to sleep(and allows other threads to run) until a message arriveswith that token.

    The RPC code generator provides the illusion of lo-cal function calls for services implemented in other pro-cesses. It parses regular C header files and generatesserver and client-side libraries that marshall parametersbetween servers. The tool parses standard C, with somecustom annotations via gccxml [4] indicating the seman-tics of each parameter. Additionally, custom serializationand deserialization routines can be supplied to handle ar-bitrary data structures. The RPC tool is designed on topof the dispatcher, so that servers implicitly sleep on a

    RPC call to another process until it returns. This allowsother threads in the service to make progress while theRPC is pending.

    The result of these two tools is a parallel service thathandles simultaneous transactions with full pipelining,with relatively low complexity. Further discussion of theprograming model is presented in [30].

    3.2 Managing Shared StateOne enormous challenge to implementing distributed OSservices is the management of shared state. This is themajor issue that breaks the illusion of straight-line codediscussed above. fos addresses distributed state by pro-viding a library of distributed data structures that providethe illusion of local data access for distributed, sharedstate.

    The goal of this library is to provide an easy wayto distribute state while giving performance and consis-tency guarantees. Data structures are provided matchingcommon usage patterns seen in the implementation of fosservices. For example, fos provides a distributed hash ta-ble based on Chord [26] and a Pool data structure thatabstracts the management of a shared resource pool.

    The library provides a few common constructs that canbe used to build custom distributed data structures. Forexample, one common construct is a key-to-node map-ping. This is commonly seen in the implementation ofdistributed hash tables, for example, but is not consid-ered a distributed data structure in its own right.

    Another issue is the management of the distributedstate associated with the data structure. It is necessaryfor the data structure to perform background updatesto maintain consistency with other instances in sepa-rate processes. This is achieved using the cooperativedispatcher discussed above. The distributed data struc-ture registers a new mailbox with the dispatcher and itsown callbacks and message types. This lets it transpar-ently manage state without interrupting the application orserver code. Being built on top of the dispatcher also letsthe distributed data structure sleep calling threads untildata becomes available. Thus, the programmer can ac-cess the distributed state in a simple, naive manner withstraight-line code.

    We are currently exploring the set of data structuresthat should be included in this library, and commonparadigms that should be captured to enable custom datastructures. Later in the paper, we discuss how each par-allel service manages its shared data in detail.

    3.3 ElasticityIn addition to unprecedented amounts of resources,clouds and multicores introduce unprecedented variabil-ity in demand for these resources. Dynamic load balanc-ing and migration of processes go a long way towards

    5

  • solving this problem, but still require over-provisioningof resources to meet demand. This would quickly be-come unfeasible, as every service in the system claimsthe maximum amount of resources it will ever need. In-stead, fleets are elastic, meaning they can grow to meetincreases in demand, and then shrink to free resourcesback to the OS.

    Current OSes achieve elasticity by accident, as OScode runs on the same core as the application code. Thisdesign relinquishes control of how many cores to provi-sion the service. As discussed earlier, running the OS onthe same core as the application can force unnecessarystalls waiting for OS requests to finish and also pollutesapplication cache. But it also has other downsides in re-spect to elastic fos services. Running the OS on eachcore leads to unnecessary sharing of OS data in the cacheof each application core. This creates extraneous cachemisses every time the service executes on a new core,whereas factoring the service onto distinct cores avoidsthis. Furthermore, for applications that rely heavily onthe OS it may be best to provision more cores to the OSservice than the application. The servers can then collab-orate to provide the service more efficiently. This designpoint is not possible in monolithic operating systems.

    A fleet is grown by starting a new instance of the ser-vice on a new core. This instance joins the fleet by con-tacting other members (either the coordinator or individ-ual members via a distributed discovery protocol) andsynchronizing its state. Some of the distributed, sharedstate is migrated to the new member, along with the as-sociated transactions. Transactions are migrated in anynumber of ways, for example by sending a redirect mes-sage to the client from the “old” server. Shrinking thefleet can be accomplished in a similar manner.

    A more interesting problem is determining when togrow and shrink the fleet. This problem is closely re-lated to scheduling, as the new fleet member will con-sume a core and affect the spatial layout of processes inthe system. It is not as simple as detecting when a ser-vice is saturated and growing the fleet – doing so mayresult in performance degradations if it forces migrationof important processes. We are only starting to addressthis problem, and it is an important area of future work.

    4 fos Operating System ServicesThis section presents parallel implementations of severalsystem services in fos. We present four fos services: aTCP/IP network stack, a page allocator, a process man-agement service, and a file system. Results for Linux andfos services are presented in the next section.

    4.1 Network Stackfos has a fully featured networking service responsiblefor packing and unpacking data for the various layers ofthe network stack as well as updating state information

    echo_clientproxy

    netstacknetif

    netifnetstack

    proxyecho_server

    Processor cycles (x1000)0 200 400 600 800 1000 1200 1400 1600 1800 2000

    Ethernet Network

    Figure 3: Echo client and echo server communicating viathe networking service

    and tables associated with the various protocols (e.g.,DHCP, ARP, and DNS). The stack was implemented byextending lwIP [14] with fos primitives for paralleliza-tion.

    The design employs a network interface server anda fleet of network stack servers which the packets canbe multiplexed onto. The network interface server alsoserves as fleet coordinator and is responsible for severalmanagement tasks as well as for handling several of theprotocols.

    When the kernel receives data from the network in-terface it delivers it to the network interface server. Thenetwork interface server then peeks into the packet anddelivers it to one of the fleet members depending on theprotocol the packet corresponds to. The handling of eachprotocol is discussed below.

    ICMP (pings) is handled simply by forwarding thepacket to any member of the fleet since no state infor-mation needs to be shared. The DHCP requests and stateinformation are managed by the coordinator and IP ad-dress updates are broadcast to fleet members when up-dates are required. ARP requests can be generated byany network stack and replies are broadcast to all fleetmembers. Using this scheme allows each network stackto have it’s own consistent copy of the address resolutiontable. DNS requests are managed by the coordinator asthe frequency of these packets is low. The two protocolswhich expect to see the highest amount of traffic, thus re-quiring the most optimization for latency and throughputare UDP and TCP. Each of these protocols require the op-erating system to transport data between the applicationand the network interface.

    UDP packets are distributed amongst the fleet mem-bers based on a hash of the (source IP address, sourceport, remote IP address, remote port) tuple. This requiresno sharing of state information between fleet members asthe protocol is stateless.

    Since TCP flows are stateful they must be handled spe-cially. This example demonstrates how fleet memberscan coordinate to handle a given OS service. When anapplication wishes to listen on a port it sends a messageto the coordinator which adds state information associ-

    6

  • ated with that application and port. Once a connectionhas been established the coordinator passes responsibil-ity for this flow to a fleet member. The coordinator alsosets up a forwarding notification such that other packetsdestined for this flow already in the coordinator’s queuesimply get sent to the fleet member who is assigned thisflow. While this approach potentially re-orders packets,as the forwarded packets can be interleaved with new in-put packets, TCP properly handles any re-ordering. Oncethe fleet member has accepted the stream, it notifies thenetwork interface to forward flows based on a hash ofthe same tuple used for UDP. Once this flow forwardinginformation has been updated in the network interfaceserver the packet is delivered directly to the fleet mem-ber and then the application. Note that these mechanismsoccur behind libfos and are abstracted from the applica-tion behind convenient interfaces.

    Like other fos services, dedicating cores to implementnetworking prevents cache pollution and presents oppor-tunities for the service to operate concurrently with theapplication.

    Figure 3 shows a real network trace from fos of anecho client and echo server communicating via the net-work stack. Each arrow represents a message betweenservers. The request from the client is sent out over Eth-ernet via the proxy server, network stack and networkinterface; the networking service on the server receivesthis message, passes it on to the server and sends the re-ply back to the client.

    4.2 Page AllocatorThe page allocator fleet is responsible for knowing thestatus of all the physical pages accessible by the system.Whenever the microkernel needs more physical memory– such as to service an sbrk or spawn call – it requestsa set of free pages from the page allocator. The page al-locator must ensure that pages only get allocated to oneuser, and that after they are freed by the microkernel theycan be allocated again. The page allocator is also ableto do a combined allocate-and-map, which allows page-mapping functionality to be offloaded from normal pro-cesses to the page allocator servers.

    The page allocator fleet is comprised of a set of identi-cal servers, which are each capable of responding to anyrequest. Collectively, the servers in the fleet implement adistributed buddy allocator, which supports two types ofrequests: allocate pages, and free pages. Like a normalbuddy allocator, the page allocator fleet only supports al-locating or freeing blocks of pages whose size is a powerof two.

    Each page allocator server manages a set of pages thatit currently knows is free; at initialization, the set ofall pages is evenly distributed among the initial servers.When a page allocator server receives an allocate re-

    quest, it returns pages from its local set if it can. If not, itwill forward the request to another server in the fleet thatit believes has enough pages.

    The page allocator fleet benefits from the design offos: since the page allocator servers run on separatecores from their clients, they are free to do extra workin the background without introducing latency to clientrequests. The page allocator fleet uses this to asyn-chronously ensure that all the servers have roughly thesame number of pages. When a single server notices thatit is managing fewer pages than a threshold, it requestsa set of ranges from another server. Since the serverssupport efficiently allocating large ranges of pages, andthis is able to happen in the background, this rebalancingdoes not affect client latency.

    A distributed buddy allocator may suffer from a prob-lem that normal buddy allocators do not: that of dis-tributed fragmentation. It is possible that, although alarge range of pages are free, they are distributed amongdifferent servers in the fleet, and thus is unable to becoalesced. To combat this, the page-space is dividedinto large chunks, and each chunk is assigned a “home”server. The fleet then tries to send pages back to theirhome servers, so that ranges in that chunk can be coa-lesced, even if they were freed back to different servers.To do this, servers manage a threshold, above which theywill (asynchronously) start sending pages back to theirhome servers.

    4.3 Process ManagementThe process management service handles a variety oftasks including process creation, migration and termina-tion, as well as maintaining statistics about process ex-ecution. Like other system services in fos, the processmanagement service is implemented as a fleet of servers.

    The process management service performs most of thetasks involved in process creation, including setting uppage tables and the process context, in user space. Tradi-tional operating systems, including all microkernel basedoperating systems that we are aware of, perform thesetasks in kernel space. Pulling this functionality out ofthe kernel allows fos to significantly shrink the size ofthe kernel code base, and allows the implementation ofa scalable process creation mechanism using the sametechniques used for writing other scalable system ser-vices in fos. The role of the kernel is reduced to simplyvalidating the changes made by the process managementservice; this validation is done via capabilities.

    A process creation request from a user process is for-warded to one of the process management servers. Theserver then communicates with the file server to retrievethe executable to be spawned. The physical memoryneeded for the page tables as well as the code and datain the spawned process’ address space is allocated via

    7

  • the page allocation server. Writing to this memory re-quires special capabilities which need to be presented tothe microkernel before each write operation. Finally, theservice sets up the new process’ context and adds the pro-cess to the scheduler’s ready queue. This operations alsohas an associated capability that is checked by the micro-kernel.

    The fact that process creations are handled by a sepa-rate set of servers allows fos to implement asynchronousprocess creation semantics, where a process can fire offmultiple process creation requests and continue doinguseful work while these processes are being set up. Itcan then check for a notification from the process man-agement service when it needs to communicate with thenewly spawned processes.

    The service also collects and maintains statistics dur-ing process execution, such as resource utilization andcommunication patterns. These statistics can then bemade available to other system services such as thescheduler.

    The service also handles process termination, includ-ing faults encountered by the process. Upon process ter-mination, used by the terminating process are freed upand the state maintained within the process managementserver is updated to reflect this.

    4.4 File SystemAnother example of fos implementing OS services as afleet of servers is the file system. We have ported an ext2file system(fs) implementation into a file system server(FSS) which is supported by our implementation of ablock device server (BDS). libfos intercepts POSIX filesystem calls, and generates messages to the FSS.

    In response to the incoming request, the FSS commu-nicates with the BDS in order to serve the request unlessthe block was cached. The BDS services Disk I/O oper-ations and access to the physical disk. The FSS encapsu-lates the required data into messages and sends it back tothe application.

    We built a parallel FSS fleet that consists of severalfile system servers interacting with the same BDS. Whena client requests a file system operations, fos interceptsthe POSIX file system calls and decides on which FSSfleet member to serve the incoming request. In our cur-rent implementation, the incoming file service requestsare distributed among the different members of the fleetin a simplified round-robin style. We are extending thisimplementation to include more intelligent assignmentof the requests to fleet members based on each member’scurrent load and locality information. We have observeda significant difference on the FS throughput for differ-ent server-to-core assignment and the relative closenessof the FSS to the BDS and to the client. Our FSS fleetcoordinator collects information about each fleet mem-

    ber’s load as well as the server layout at runtime in orderto make informed assignment of which FSS member willserve each client request.

    5 EvaluationThis section presents results for the parallel OS servicesdescribed above. Additionally, we provide results andanalysis for several of the underlying mechanisms thatare critical to system performance. The methodologyis first described. Performance of messaging is shown,including tradeoffs between different schemes and theirperformance bottlenecks. Then scaling numbers for eachOS service are presented using microbenchmarks, withcomparison against Linux included. We conclude withtwo studies: a study of elasticity in a file system fleet,and a cloud application to transcode video. The latterdemonstrates fos’s ability to use hardware resources be-tween multiple machines.

    5.1 Methodologyfos runs as a paravirtualized virtual machine under Xen3.2 [7]. This is done in order to facilitate fos’s goal ofrunning across the cloud. All numbers presented in thispaper, for both fos and Linux, are presented running inXen domU’s. We use an bare Ubuntu Linux DomU.

    For scalability results, numbers are presented as ser-vice throughput. This throughput is calculated based onthe median of service latency over n requests, where n isat least 1000 requests. Additionally, error bars are shownfor the 25th and 75th percentile. Throughput is plot-ted versus the number of clients (i.e., application cores)making requests to the OS service. This accurately repre-sents the design point of fos, where bountiful cores allowOS code to be factored out without damaging applicationparallelism. However, note that separate cores are con-sumed to run fos services.

    This data is gathered on currently available multicores.As fos separates OS and application code, this limits thescale of studies that are able to be performed. There-fore in discussion, we must extrapolate based on scalingdata available. Numbers are gathered on quad quad-coreXeon E7340 processes at 2.40 GHz with 16 GB of RAM.

    5.2 MessagingWe evaluated the performance of two messagingschemes on the Intel Xeon E5530 processor: (i)communication over cache-coherent shared memory inuserspace (URPC), and (ii) communication via privi-leged kernel calls to copy messages between addressspaces. URPC shows significant latency benefits, but hassome initial setup cost in mapping a shared memory seg-ment.

    Figure 4 shows the performance of the two schemesand a comparison with Unix domain sockets in Linux.Messaging performance was measured by recording the

    8

  • URPC ShURPCN-Sh

    Kernel ShKernelN-Sh

    Linux ShLinux N-

    Sh

    0 10 20 30 40 50

    5 ´ 1041 ´ 105

    5 ´ 1051 ´ 106

    Num. messages

    Cyc

    les

    elap

    sed

    Figure 4: Performance of messaging for different trans-ports. Results are included for cores with a shared cache(“Sh”) and without (“N–Sh”). For reference, UNIX do-main sockets in Linux are also included (“Linux Sh” and“Linux N–Sh”).

    Operation Latency

    fos URPC shared cache 1503 (48.18)no shared cache 2665 (161.94)

    fos Kernel shared cache 9616 (294.83)no shared cache 12551 (347.37)DomU Linux Syscall 1676 (271.17)

    Table 1: A comparison of fos roundtrip messaging la-tency and Linux syscall latency. Measurements are incycles (σ).

    round-trip time of sending a small data packet to andfrom a message echo service. Two plots are shown foreach message transport method: with a shared L3 cachebetween the communicating cores (“Sh”) and without(“N–Sh”). All tests showed improved latency when thecores shared a cache. Both fos messaging schemes hadsubstantially lower latency than Unix domain sockets.

    We also compared the time it takes to perform a nullsystem call in Linux with fos roundtrip messaging la-tency as shown in Table 1. Since most fos OS servicesare accessed via messages, system calls on monolithicOSs often can be considered comparable operations.

    5.3 OS ServicesNetwork Stack This test compares the latency experi-enced by small sized packets through the network stackin both linux and fos for various numbers of streams.This test demonstrates how the design of fos allows thenetwork stack fleet to scale to match the application’srequirements. The experimental setup for fos includesseveral daytime servers that simply receive a packet andprovide an immediate response, and one client for each

    flow in dom0, each of which continuously performs day-time requests and reports on the amount of latency perrequest. Figure 5c presents our results. The figure plotsthe number of concurrent daytime clients on the hori-zontal axis and throughput on the vertical axis. Resultsare presented for fos using 1-4 network stacks as well asfor linux with varying number of clients. As is depictedin the graph, when the number of requesting clients in-creases the throughput per client decreases for linux aswell as fos with a low number of network stacks. How-ever, as the number of network stacks in the fleet is in-creased, fos is able to achieve near ideal scaling. Notealso that while fos (with 4 network stacks) is using 8cores to handle the 4 streams while linux only uses 4, fosis able to achieve more than 2x performance over linux.We believe this is because of reduced cache pollution infos as well as due to the fact that fos is able to pipeline thenetwork stack processing with the application responses.

    Page Allocator Figure 5a shows the throughput andscalability of the page allocator service for differentnumbers of clients and servers. Results are collectedfor Linux by running multiple, independent copies of amemory microbenchmark that allocates pages as quicklyas possible, touching the first byte of each to guarantee

    that a page is actually allocated and mapped. Resultsare collected from fos by running multiple client pro-cesses and a page allocator fleet comprised of multipleservers. Again, the client processes allocate pages asquickly as possible and no lazy allocation occurs.

    This figure shows how the system responds to increas-ing page allocation request rate, as a function of thenumber of servers in the page allocator fleet. With oneserver in the fleet, the system quickly becomes bottle-necked on how quickly that server can map pages intothe clients’ page tables. Increasing the number of serversto two increases the maximum throughput achievable, atwhich point the Xen hypervisor starts to be a bottleneck.Adding a third server shows no throughput increase, andadding a fourth (and further) servers shows a throughputdecrease.

    Linux exhibits similar behavior, where overallthroughput peaks (at 4 cores), and then decreases. Thisis, again, because of contention from multiple processessimultaneously accessing the hypervisor. As opposedto fos, however, the number of Linux cores accessingthe hypervisor grows as the number of test processes in-creases. This means that as the number of cores allo-cating pages increases, the overall throughput decreases.In fos, the number of cores accessing the hypervisor isfixed at the number of servers in the page allocator fleet,and as the number of clients increases, this number doesnot change. This is why the fos page allocator is able tomaintain performance cores increases: the design of foslets you factor out the page-mapping part of the system,

    9

  • Linuxfos

    H1 serverLfos

    H2 serversLfos

    H3 serversLfos

    H4 serversLfos

    H5 serversL

    æ

    æ

    æ

    æ ææ

    æ

    æ æ

    æ

    æ

    æ

    à

    àà à à à à à à à à à

    àà

    ì

    ì

    ì

    ì

    ì

    ì

    ì

    ìì

    ìì

    ì

    ì

    ò

    ò

    ò

    òò

    ò

    ò

    òò

    ò

    ò ò

    ô

    ô

    ô ô ôô

    ô

    ô ô

    ô ô

    1 2 3 4 5 6 7 8 9 10 11 12 13 140

    50

    100

    150

    200

    Num. clients

    Thr

    ough

    putH

    page

    s1

    06cy

    cles

    L

    (a) Page allocator.

    æ

    æ æ æ æ

    à

    à

    à

    à

    à

    ì

    ì

    ì

    ìì

    ò

    ò

    ò

    ò

    ò

    ô

    ô

    ô

    ô

    ô

    çç

    ç

    çç

    1 2 3 4 50.0

    0.5

    1.0

    1.5

    2.0

    Num. clients

    Thr

    ough

    putH

    tran

    sact

    ions

    106

    cycl

    esL

    (b) File system.

    æ

    æ

    ææ

    à

    à

    à

    à

    ì

    ì

    ì

    ì

    ò

    ò

    ò

    ò

    ô

    ô

    ô

    ô

    1 2 3 40

    50

    100

    150

    Num. network clients

    Thr

    ough

    putH

    requ

    ests

    106

    cycl

    esL

    (c) Network stack

    Figure 5: Performance and scalability of parallel services in fos.

    fos (Sync) fos (Async) LinuxLatency (ms) 112.55 0.12 1.16

    Table 2: Process creation latencies in fos and Linux

    limiting the number of cores simultaneously accessing ahighly-contended resource.

    Process Management The semantics of process cre-ation in fos doesn’t have a Linux equivalent, and there-fore it is difficult to directly compare process creation la-tencies. The closest Linux equivalent to a synchronousspawn on fos is a fork followed by an exec in Linux,while the asynchronous spawn semantics is unique tofos. Table 2 presents average latencies for these threeoperations, averaged over 100 operations. Linux outper-forms fos on synchronous spawns, mainly because wehave not yet spent time optimizing this aspect of fos. Forexample, all the pages in the process’ address space areloaded into memory at process creation time in fos, whilein Linux pages are only loaded on first access, resultingin significant gains in process creation latency. Further,the asynchronous process creation semantics in fos meanthat process creation latency can be hidden by doing use-ful work concurrently in the parent process, as reflectedin the significantly lower latency number presented in Ta-ble 2.

    File System The file system study evaluates thethroughput achieved by the Linux file system as well asthe fos file system with different numbers of servers asthe number of clients is varied. For this study, each clientperforms 1000 transactions, where each transaction con-sists of opening a unique file, reading its contents andclosing the file. Figure 5b presents the overall through-put achieved in each configuration.

    The decrease in file system throughput for Linux asthe number of clients increases can be attributed tocontention over shared data [10]. The fos file systemachieves higher throughput than Linux for all configu-

    rations. For fos, the achieved throughput flattens forlarge number of clients when the number of servers issmall; the small degradation as the number of clients isincreased can be attributed to contention over the servermailbox. However, higher throughput can be achievedby using more servers in the fleet as shown in the figure.

    Note that unlike Linux, the fos block device serverdoes not support multiple outstanding IO requests to thehypervisor, thus denying it the ability to coalesce or re-order requests to improve IO throughput. Once this andother optimizations have been implemented, we expectthe performance of the fos file system to improve further.

    5.4 ElasticityThis study focuses on the elastic features of fleets. Inthe experiment, two clients make queries to the file sys-tem service. Each client repeats a single transaction(open a file, read one kilobyte, and close the file) 5,000times. The request rate is throttled initially to one trans-action per two million cycles. The throttling graduallydecreases until the service is saturated, before graduallyincreasing again to the initial value.

    The file system service is implemented as an elasticfleet. The service starts initially with a single server,which acts as the coordinator. This server monitors theutilization of the service, and detects when the servicehas become saturated. At this point, the fleet grows byadding a new member to meet demand. Similarly, thecoordinator detects if utilization has dropped sufficientlyto shrink the fleet, freeing up system resources.

    Load balancing is performed through the name server,as discussed previously. Each file system server is regis-tered under a single alias. At the beginning of each trans-action, a name look-up is performed through the nameserver. This name is cached for the remainder of thetransaction. The names are served round-robin, perform-ing dynamic load balancing across the two servers. Thisload balancing is performed transparently through libfos,

    10

  • 1 Server 2 Servers Elastic fleet

    GrowShrink

    0 1� 109 2� 109 3� 109 4� 109 5� 109 6� 1090

    1

    2

    3

    4

    Cycles elapsed

    Throughput�trans

    actions�106

    cycle�

    Figure 6: A demonstration of an elastic file system fleet.The number of servers in the fleet is adjusted on the flyto match demand.

    by counting the number of open files and only refreshingthe cache when no files are open.4

    Figure 6 shows the results for three different serviceconfigurations. Aggregate throughput from both clientsis plotted against the number of elapsed cycles in the test.The first configuration is with a single file system serverserving both clients. As expected, the service is saturatedquickly and peaks at roughly 2 transactions per millioncycles. This configuration also takes the longest to com-plete all transactions. The second configuration has twofile system servers, and each client is statically sched-uled to communicate with one of the servers. This avoidsany contention overhead from dynamic load balancing,so should provide peak performance. This configurationpeaks at just below 4 transactions per million cycles. Thelast configuration is the elastic fleet. This configurationbriefly saturates at the same level as a single file sys-tem server, before expanding the fleet (marked by the“grow” arrow on the graph). Subsequently, it immedi-ately achieves the throughput of the two server, staticallyload balanced configuration. Finally, when the requestrate again decreases near the end of the experiment, thefleet shrinks to use fewer resources.

    Figure 6 shows that the elastic fleet performs equiv-alently to an over-provisioned service, and that there isnegligible overhead for performing dynamic load balanc-ing through the name service. The elastic fleet consumesa single core for the majority of the experiment, only ex-panding between the “grow” and “shrink” arrows, or for18.8% of the run-time of the experiment.5 Comparedto the two server configuration, this translates into a re-source savings of 40.6% at no performance cost, whichcould be used for other computation or to save power byturning off the core.

    5.5 Scaling Across CloudA unique capability of fos is it’s ability to utilize moreresources than can be contained within a single machine.In order to achieve this, fos launches additional fos VMinstances on different machines. These new VMs na-tively join the currently running fos instances and thebuilding blocks of fos, naming and messaging, are trans-parently bridged via the proxy server. Our previouswork [30] describes how a proxy server is used to trans-parently build a single system image across a cloud.

    To demonstrate fos’s cloud functionality, how fosscales across a cloud computer, and as full system test,we have constructed an application which mimics the keyparts of a website such as YouTube. In this test, a user ac-cess a webpage served by the fos webserver, in responseto this request, the webserver launches copies of ffmpeg1.5.0 to transcode a video into two different file formats.Finally the user downloads the files from the webserver.For this test, we used fos to transcoded the first YouTubevideo, ”Me at the zoo”, from Flash Video (H.263) to twooutput formats. An AVI encoded in Windows Media 8,and an AVI encoded in MPEG2. Each video is between600KB - 700KB.

    Eucalyptus Cloud Manager

    Multicore Server

    fosVM

    ffmpeg

    netstack

    NetDriver

    Multicore Server

    httpd

    Proxy

    netstack

    Cloud Manager

    File system

    Block Driver

    NetDriver

    Proxy

    Figure 7: Servers used in the video transcode application.

    Figure 7 shows the key components of this test. Thetest begins by spawning the fos VM on the right. TheCloud Manager Server on the right fos VM contact theEucalyptus Cloud infrastructure to start a second fosVM on a different machine as indicated by the left fosVM. This VM joins the currently running fos instanceand messaging and naming are transparently bridge viathe Proxy Servers. Next a user accesses a CGI pagehosted on webserver to begin the transcode. The httpdserver messages the left fos VM which launches ffm-peg to transcode the input FLV file into two AVI out-puts. The fos VM containing ffmpeg does not have a filesystem attached therefore all of the file system accessare proxied over the network between machines to theright fos VM. After the transcoding is complete, ffmpegwrites the result files to the file system and client down-loads the transcoded video from the httpd server. Thiswhole transaction takes 4:45 minutes with the bulk of the

    11

  • time spent in the transcode followed by the time taken toserve the final transcoded files. It should be pointed outthat due to the transparency at the messaging layer, nocode in ffmpeg or library code of ffmpeg needed to bechanged to do inter machine file system access therebyrealizing fos’s single system image across the cloud. Acopy of the fos webserver running on fos can be accessedat http://fos.csail.mit.edu

    6 Related WorkThere are several classes of systems which have similar-ities to fos: traditional microkernels, distributed OSes,and cloud computing infrastructure.

    Traditional microkernels include Mach [5] andL4 [21]. fos is designed as a microkernel and extendsthe microkernel design ideas. However, it is differen-tiated from previous microkernels in that instead of sim-ply exploiting parallelism between servers which providedifferent functions, this work seeks to distribute and par-allelize within a server for a single high-level function.fos also exploits the “spatialness” of massively multi-core processors by spatially distributing servers whichprovide a common OS function.

    Like Tornado [15] and K42 [6], fos explores howto parallelize microkernel-based OS data structures.They are differentiated from fos in that they requireSMP and NUMA shared memory machines instead ofloosely coupled single-chip massively multicore ma-chines and clouds of multicores. Also, fos targets amuch larger scale of machine than Tornado/K42. Therecent Corey [10] OS shares the spatial awareness aspectof fos, but does not address parallelization within a sys-tem server and focuses on smaller configuration systems.Also, fos is tackling many of the same problems as Bar-relfish [8] but fos is focusing more on how to parallelizethe system servers as well as addresses the scalability onchip and in the cloud.

    Disco [11] and Cellular Disco [17] run multiple coop-erating virtual machines on a single multiprocessor sys-tem. fos’s spatial distribution of fleet resources is similarto the way that different VM system services communi-cate within Cellular Disco. Disco and Cellular Disco ar-gue leveraging traditional OSs as an advantage, but thisapproach likely does not reach the highest level of scal-ability as a purpose built scalable OS such as fos will.Also, the fixed boundaries imposed by VM boundariescan impair dynamic resource allocation.

    fos bears much similarity to distributed OSes such asAmoeba [27], Sprite [22], and Clouds [13]. One ma-jor difference is that fos communication costs are muchlower when executing on a single massive multicore,and the communication reliability is much higher. Also,when fos is executing on the cloud, the trust model andfault model is different than previous distributed OSes

    where much of the computation took place on student’sdesktop machines.

    The manner in which fos parallelizes system servicesinto fleets of cooperating servers is inspired by dis-tributed Internet services. For instance, load balancingis one technique taken from clustered webservers. Thename server of fos derives inspiration from the hierarchi-cal caching in the Internet’s DNS system. In future work,we hope to leverage many distributed shared state tech-niques such as those in peer-to-peer and distributed hashtables such as Bit Torrent [12] and Chord [26]. fos alsotakes inspiration from distributed services such as dis-tributed file systems such as AFS [24], OceanStore [20]and the Google File System [16].

    fos differs from existing cloud computing solutions inseveral aspects. Cloud (IaaS) systems, such as Amazon’sElastic compute cloud (EC2) [1] and VMWare’s VCloud,provide computing resources in the form of virtual ma-chine (VM) instances and Linux kernel images. fosbuilds on top of these virtual machines to provide a sin-gle system image across an IaaS system. With the tradi-tional VM approach, applications have poor control overthe co-location of the communicating applications/VMs.Furthermore, IaaS systems do not provide a uniform pro-gramming model for communication or allocation of re-sources. Cloud aggregators such as RightScale [23] pro-vide automatic cloud management and load balancingtools, but they are application-specific, whereas fos pro-vides these features in an application agnostic manner.

    7 ConclusionNew computer architectures are forcing software to bedesigned in a parallel fashion. Traditional monolithicOSs are not suitable for this design and become a hin-drance to the application when the number of coresscales up. The unique design of fos allows the OS toscale to high core counts without interfering with theapplication. Furthermore, the abstractions provided al-low an application written to our programming model toseamlessly span across a cloud just as it would across amulticore chip. We have demonstrated that this designis suitable for the multicore architectures of the futurewhile remaining on-par or better than existing solutionsin raw performance. While a functional base system hasbeen presented here, there still remain many areas openfor exploration.

    References[1] Amazon Elastic Compute Cloud (Amazon EC2), 2009.

    http://aws.amazon.com/ec2/.

    [2] Tilera Announces the World’s First 100-core Processor with theNew TILE-Gx Family, Oct. 2009. http://www.tilera.com/.

    [3] AMD Opteron 6000 Series Press Release, Mar. 2010.http://www.amd.com/us/press-releases/Pages/amd-sets-the-new-standard-29mar2010.aspx.

    12

  • [4] GCC-XML project, 2010. http://www.gccxml.org/.

    [5] ACCETTA, M., BARON, R., BOLOSKY, W., GOLUB, D.,RASHID, R., TEVANIAN, A., AND YOUNG, M. Mach: A newkernel foundation for UNIX development. In Proceedings of theUSENIX Summer Conference (June 1986), pp. 93–113.

    [6] APPAVOO, J., AUSLANDER, M., BURTICO, M., DA SILVA,D. M., KRIEGER, O., MERGEN, M. F., OSTROWSKI, M.,ROSENBURG, B., WISNIEWSKI, R. W., AND XENIDIS, J. K42:an open-source linux-compatible scalable operating system ker-nel. IBM Systems Journal 44, 2 (2005), 427–440.

    [7] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S.,HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., ANDWARFIELD, A. Xen and the art of virtualization. In SOSP ’03:Proceedings of the nineteenth ACM symposium on Operating sys-tems principles (New York, NY, USA, 2003), ACM, pp. 164–177.

    [8] BAUMANN, A., BARHAM, P., DAGAND, P.-E., HARRIS, T.,ISAACS, R., PETER, S., ROSCOE, T., SCHÜPBACH, A., ANDSINGHANIA, A. The multikernel: a new OS architecture forscalable multicore systems. In SOSP ’09: Proceedings of theACM SIGOPS 22nd symposium on Operating systems principles(2009), pp. 29–44.

    [9] BERSHAD, B. N., ANDERSON, T. E., LAZOWSKA, E. D., ANDLEVY, H. M. User-level interprocess communication for sharedmemory multiprocessors. ACM Transactions on Computer Sys-tems 9, 2 (May 1991), 175 – 198.

    [10] BOYD-WICKIZER, S., CHEN, H., CHEN, R., MAO, Y.,KAASHOEK, F., MORRIS, R., PESTEREV, A., STEIN, L., WU,M., ZHANG, Y. D. Y., AND ZHANG, Z. Corey: An operatingsystem for many cores. In Proceedings of the Symposium on Op-erating Systems Design and Implementation (Dec. 2008).

    [11] BUGNION, E., DEVINE, S., AND ROSENBLUM, M. Disco:Running commodity operating systems on scalable multiproces-sors. In Proceedings of the ACM Symposium on Operating SystemPrinciples (1997), pp. 143–156.

    [12] COHEN, B. Incentives build robustness in bittorrent, 2003.

    [13] DASGUPTA, P., CHEN, R., MENON, S., PEARSON, M., ANAN-THANARAYANAN, R., RAMACHANDRAN, U., AHAMAD, M.,LEBLANC, R. J., APPLEBE, W., BERNABEU-AUBAN, J. M.,HUTTO, P., KHALIDI, M., AND WILEKNLOH, C. J. The designand implementation of the Clouds distributed operating system.USENIX Computing Systems Journal 3, 1 (1990), 11–46.

    [14] DUNKELS, A., WOESTENBERG, L., MANSLEY, K.,AND MONOSES, J. lwIP embedded TCP/IP stack.http://savannah.nongnu.org/projects/lwip/, Accessed 2004.

    [15] GAMSA, B., KRIEGER, O., APPAVOO, J., AND STUMM, M.Tornado: Maximizing locality and concurrency in a shared mem-ory multiprocessor operating system. In Proceedings of the Sym-posium on Operating Systems Design and Implementation (Feb.1999), pp. 87–100.

    [16] GHEMAWAT, S., GOBIOFF, H., AND LEUNG, S.-T. The Googlefile system. In Proceedings of the ACM Symposium on OperatingSystem Principles (Oct. 2003).

    [17] GOVIL, K., TEODOSIU, D., HUANG, Y., AND ROSENBLUM,M. Cellular Disco: Resource management using virtual clusterson shared-memory multiprocessors. In Proceedings of the ACMSymposium on Operating System Principles (1999), pp. 154–169.

    [18] GSCHWIND, M., HOFSTEE, H. P., FLACHS, B., HOPKINS, M.,WATANABE, Y., AND YAMAZAKI, T. Synergistic processing inCell’s multicore architecture. IEEE Micro 26, 2 (March-April2006), 10–24.

    [19] HOWARD, J., DIGHE, S., HOSKOTE, Y., VANGAL, S., FI-NAN, D., RUHL, G., JENKINS, D., WILSON, H., BORKAR,

    N., SCHROM, G., PAILET, F., JAIN, S., JACOB, T., YADA, S.,MARELLA, S., SALIHUNDAM, P., ERRAGUNTLA, V., KONOW,M., RIEPEN, M., DROEGE, G., LINDEMANN, J., GRIES, M.,APEL, T., HENRISS, K., LUND-LARSEN, T., STEIBL, S.,BORKAR, S., DE, V., VAN DER WIJNGAART, R., AND MATT-SON, T. A 48-core ia-32 message-passing processor with dvfsin 45nm cmos. In Solid-State Circuits Conference Digest ofTechnical Papers (ISSCC), 2010 IEEE International (7-11 2010),pp. 108 –109.

    [20] KUBIATOWICZ, J., BINDEL, D., CHEN, Y., CZERWINSKI, S.,EATON, P., GEELS, D., GUMMADI, R., RHEA, S., WEATH-ERSPOON, H., WEIMER, W., WELLS, C., AND ZHAO, B.Oceanstore: An architecture for global-scale persistent storage.In Proceedings of the Conference on Architectural Support forProgramming Languages and Operating Systems (Nov. 2000),pp. 190–201.

    [21] LIEDTKE, J. On microkernel construction. In Proceedings of theACM Symposium on Operating System Principles (Dec. 1995),pp. 237–250.

    [22] OUSTERHOUT, J. K., CHERENSON, A. R., DOUGLIS, F., NEL-SON, M. N., AND WELCH, B. B. The Sprite network operatingsystem. IEEE Computer 21, 2 (Feb. 1988), 23–36.

    [23] Rightscale home page. http://www.rightscale.com/.

    [24] SATYANARAYANAN, M. Scalable, secure, and highly availabledistributed file access. IEEE Computer 23, 5 (May 1990), 9–18,20–21.

    [25] SEILER, L., CARMEAN, D., SPRANGLE, E., FORSYTH, T.,ABRASH, M., DUBEY, P., JUNKINS, S., LAKE, A., SUGER-MAN, J., CAVIN, R., ESPASA, R., GROCHOWSKI, E., JUAN,T., AND HANRAHAN, P. Larrabee: A many-core x86 architec-ture for visual computing. In SIGGRAPH ’08: ACM SIGGRAPH2008 papers (New York, NY, USA, 2008), ACM, pp. 1–15.

    [26] STOICA, I., MORRIS, R., KARGER, D., KAASHOEK, M. F.,AND BALAKRISHNAN, H. Chord: A scalable peer-to-peerlookup service for internet applications. pp. 149–160.

    [27] TANENBAUM, A. S., MULLENDER, S. J., AND VAN RENESSE,R. Using sparse capabilities in a distributed operating system.In Proceedings of the International Conference on DistributedComputing Systems (May 1986), pp. 558–563.

    [28] WENTZLAFF, D., AND AGARWAL, A. Factored operating sys-tems (fos): the case for a scalable operating system for multi-cores. SIGOPS Oper. Syst. Rev. 43, 2 (2009), 76–85.

    [29] WENTZLAFF, D., GRIFFIN, P., HOFFMANN, H., BAO, L., ED-WARDS, B., RAMEY, C., MATTINA, M., MIAO, C.-C., BROWNIII, J. F., AND AGARWAL, A. On-chip interconnection architec-ture of the Tile Processor. IEEE Micro 27, 5 (Sept. 2007), 15–31.

    [30] WENTZLAFF, D., III, C. G., BECKMANN, N., MODZELEWSKI,K., BELAY, A., YOUSEFF, L., MILLER, J., AND AGARWAL,A. An operating system for multicore and clouds: Mechanismsand implementation. In Proceedings of the ACM Symposium onCloud Computing (SOCC) (June 2010).

    Notes1In fact, even the name service itself is implemented as a fleet.2For example, the name lookup of /foo/bar results in the sym-

    bolic name /foo/bar/3, which is the third member of the fleet. Thisis the name that is cached, and subsequent requests forward to thisname, wherever it should be.

    3A simple example of this is local state kept by each fleet memberfor a transaction, for example a TCP/IP connection. In this case, rout-ing through the name service is insufficient because state has already

    13

  • been created during connection establishment, and the connection isassociated with a particular fleet member when the first message onthat connection arrives. Another example is if a request uses a hard-ware resource on a different machine. In this case, the request must beforwarded to the fleet member on the machine that has access to thehardware.

    4This ensures that there is no local state stored on the file systemserver, and eliminates the need to distribute the state for the open files.

    5Note that roughly one-third of the transactions lie within this range,but the fleet processes these transactions at nearly double rate. Becauseof this, in general a fleet needs to be provisioned extra resources forless than the percentage of requests that saturate the service. Therefore,demand on cores is less than one might otherwise expect.

    14