xen and the art of open source virtualization

Xen and the Art of Open Source Virtualization

Keir Fraser, Steven Hand, Christian Limpach, Ian PrattUniversity of Cambridge Computer Laboratory

{first.last}@cl.cam.ac.uk

Dan MagenheimerHewlett-Packard Laboratories

{first.last}@hp.com

Abstract

Virtual machine (VM) technology has beenaround for 40 years and has been experiencinga resurgence with commodity machines. VMshave been shown to improve system and net-work flexibility, availability, and security in avariety of novel ways. This paper introducesXen, an efficient secure open source VM mon-itor, to the Linux community.

Key features of Xen are:

1. supports different OSes (e.g. Linux 2.4,2.6, NetBSD, FreeBSD, etc.)

2. provides secure protection between VMs

3. allows flexible partitioning of resourcesbetween VMs (CPU, memory, networkbandwidth, disk space, and bandwidth)

4. very low overhead, even for demandingserver applications

5. support for seamless, low-latency migra-tion of running VMs within a cluster

We discuss the interface that Xen/x86 exportsto guest operating systems, and the kernelchanges that were required to Linux to portit to Xen. We compare Xen/Linux to User

Mode Linux as well as existing commercialVM products.

1 Introduction

Modern computers are sufficiently powerfulto use virtualization to present the illusion ofmany smaller virtual machines (VMs), eachrunning a separate operating system instance.This has led to a resurgence of interest in VMtechnology. In this paper we present Xen,a high performance resource-managed virtualmachine monitor (VMM) which enables ap-plications such as server consolidation, co-located hosting facilities, distributed web ser-vices, secure computing platforms, and appli-cation mobility.

Successful partitioning of a machine to supportthe concurrent execution of multiple operatingsystems poses several challenges. Firstly, vir-tual machines must be isolated from one an-other: it is not acceptable for the executionof one to adversely affect the performance ofanother. This is particularly true when vir-tual machines are owned by mutually untrust-ing users. Secondly, it is necessary to supporta variety of different operating systems to ac-commodate the heterogeneity of popular appli-cations. Thirdly, the performance overhead in-troduced by virtualization should be small.

330 • Linux Symposium 2004 • Volume Two

Xen hosts commodity operating systems, albeitwith some source modifications. The prototypedescribed and evaluated in this paper can sup-port multiple concurrent instances of our Xen-Linux guest operating system; each instanceexports an application binary interface identi-cal to a non-virtualized Linux 2.6. Xen ports ofNetBSD and FreeBSD have been completed,along with a proof of concept port of WindowsXP.1

There are a number of ways to build a sys-tem to host multiple applications and serverson a shared machine. Perhaps the simplest isto deploy one or more hosts running a stan-dard operating system such as Linux or Win-dows, and then to allow users to install files andstart processes—protection between applica-tions being provided by conventional OS tech-niques. Experience shows that system adminis-tration can quickly become a time-consumingtask due to complex configuration interactionsbetween supposedly disjoint applications.

More importantly, such systems do not ad-equately support performance isolation; thescheduling priority, memory demand, networktraffic and disk accesses of one process impactthe performance of others. This may be ac-ceptable when there is adequate provisioningand a closed user group (such as in the case ofcomputational grids, or the experimental Plan-etLab platform [11]), but not when resourcesare oversubscribed, or users uncooperative.

One way to address this problem is to retrofitsupport for performance isolation to the op-erating system, but a difficulty with such ap-proaches is ensuring thatall resource usage isaccounted to the correct process—consider, forexample, the complex interactions between ap-plications due to buffer cache or page replace-

1The Windows XP port required access to Microsoftsource code, and hence distribution is currently re-stricted, even in binary form.

ment algorithms. Performing multiplexing at alow level can mitigate this problem; uninten-tional or undesired interactions between tasksare minimized. Xen multiplexes physical re-sources at the granularity of an entire operat-ing system and is able to provide performanceisolation between them. This allows a rangeof guest operating systems to gracefully coex-ist rather than mandating a specific applicationbinary interface. There is a price to pay for thisflexibility—running a full OS is more heavy-weight than running a process, both in terms ofinitialization (e.g. booting or resuming an OSinstance versusfork /exec ), and in terms ofresource consumption.

For our target of 10-100 hosted OS instances,we believe this price is worth paying: It allowsindividual users to run unmodified binaries, orcollections of binaries, in a resource controlledfashion (for instance an Apache server alongwith a PostgreSQL backend). Furthermore itprovides an extremely high level of flexibilitysince the user can dynamically create the pre-cise execution environment their software re-quires. Unfortunate configuration interactionsbetween various services and applications areavoided (for example, each Windows instancemaintains its own registry).

Experience with deployed Xen systems sug-gests that the initialization overheads and ad-ditional resource requirements are in practicequite low: An operating system image may beresumed from an on-disk snapshot in typicallyjust over a second (depending on image mem-ory size), and although multiple copies of theoperating system code and data are stored inmemory, the memory requirements are typi-cally small compared to those of the applica-tions that will run on them. As we shall showlater in the paper, the performance overhead ofthe virtualization provided by Xen is low, typ-ically just a few percent, even for the most de-manding applications.

Linux Symposium 2004 • Volume Two • 331

2 XEN: Approach & Overview

In a traditional VMM the virtual hardware ex-posed is functionally identical to the underly-ing machine [14]. Althoughfull virtualizationhas the obvious benefit of allowing unmodifiedoperating systems to be hosted, it also has anumber of drawbacks. This is particularly truefor the prevalent Intelx86architecture.

Support for full virtualization was never partof the x86 architectural design. Certain su-pervisor instructions must be handled by theVMM for correct virtualization, but executingthese with insufficient privilege fails silentlyrather than causing a convenient trap [13]. Effi-ciently virtualizing the x86 MMU is also diffi-cult. These problems can be solved, but only atthe cost of increased complexity and reducedperformance. VMware’s ESX Server [3] dy-namically rewrites portions of the hosted ma-chine code to insert traps wherever VMM in-tervention might be required. This translationis applied to the entire guest OS kernel (withassociated translation, execution, and cachingcosts) since all non-trapping privileged instruc-tions must be caught and handled. ESX Serverimplements shadow versions of system struc-tures such as page tables and maintains consis-tency with the virtual tables by trapping everyupdate attempt—this approach has a high costfor update-intensive operations such as creat-ing a new application process.

Notwithstanding the intricacies of the x86,there are other arguments against full virtual-ization. In particular, there are situations inwhich it is desirable for the hosted operatingsystems to see real as well as virtual resources:providing both real and virtual time allows aguest OS to better support time-sensitive tasks,and to correctly handle TCP timeouts and RTTestimates, while exposing real machine ad-dresses allows a guest OS to improve perfor-mance by using superpages [10] or page color-

ing [7].

We avoid the drawbacks of full virtualizationby presenting a virtual machine abstractionthat is similar but not identical to the under-lying hardware—an approach which has beendubbedparavirtualization[17]. This promisesimproved performance, although it does re-quire modifications to the guest operating sys-tem. It is important to note, however, that wedo not require changes to the application bi-nary interface (ABI), and hence no modifica-tions are required to guestapplications.

We distill the discussion so far into a set of de-sign principles:

1. Support for unmodified application bina-ries is essential, or users will not transi-tion to Xen. Hence we must virtualize allarchitectural features required by existingstandard ABIs.

2. Supporting full multi-application operat-ing systems is important, as this allowscomplex server configurations to be virtu-alized within a single guest OS instance.

3. Paravirtualization is necessary to obtainhigh performance and strong resource iso-lation on uncooperative machine architec-tures such as x86.

4. Even on cooperative machine architec-tures, completely hiding the effects ofresource virtualization from guest OSesrisks both correctness and performance.

In the following section we describe the virtualmachine abstraction exported by Xen and dis-cuss how a guest OS must be modified to con-form to this. Note that in this paper we reservethe termguest operating systemto refer to oneof the OSes that Xen can host and we use theterm domainto refer to a running virtual ma-chine within which a guest OS executes; the


distinction is analogous to that between apro-gram and aprocessin a conventional system.We call Xen itself thehypervisorsince it oper-ates at a higher privilege level than the super-visor code of the guest operating systems thatit hosts.

2.1 The Virtual Machine Interface

The paravirtualized x86 interface can be fac-tored into three broad aspects of the system:memory management, the CPU, and deviceI/O. In the following we address each machinesubsystem in turn, and discuss how each is pre-sented in our paravirtualized architecture. Notethat although certain parts of our implemen-tation, such as memory management, are spe-cific to the x86, many aspects (such as our vir-tual CPU and I/O devices) can be readily ap-plied to other machine architectures. Further-more, x86 represents aworst casein the areaswhere it differs significantly from RISC-styleprocessors—for example, efficiently virtualiz-ing hardware page tables is more difficult thanvirtualizing a software-managed TLB.

2.1.1 Memory management

Virtualizing memory is undoubtedly the mostdifficult part of paravirtualizing an architec-ture, both in terms of the mechanisms re-quired in the hypervisor and modifications re-quired to port each guest OS. The task iseasier if the architecture provides a software-managed TLB as these can be efficiently vir-tualized in a simple manner [5]. A taggedTLB is another useful feature supported bymost server-class RISC architectures, includ-ing Alpha, MIPS and SPARC. Associating anaddress-space identifier tag with each TLB en-try allows the hypervisor and each guest OSto efficiently coexist in separate address spacesbecause there is no need to flush the entire TLB

when transferring execution.

Unfortunately, x86 does not have a software-managed TLB; instead TLB misses are ser-viced automatically by the processor by walk-ing the page table structure in hardware. Thusto achieve the best possible performance, allvalid page translations for the current ad-dress space should be present in the hardware-accessible page table. Moreover, because theTLB is not tagged, address space switches typ-ically require a complete TLB flush. Giventhese limitations, we made two decisions: (i)guest OSes are responsible for allocating andmanaging the hardware page tables, with mini-mal involvement from Xen to ensure safety andisolation; and (ii) Xen exists in a 64MB sectionat the top of every address space, thus avoidinga TLB flush when entering and leaving the hy-pervisor.

Each time a guest OS requires a new pagetable, perhaps because a new process is be-ing created, it allocates and initializes a pagefrom its own memory reservation and regis-ters it with Xen. At this point the OS mustrelinquish direct write privileges to the page-table memory: all subsequent updates must bevalidated by Xen. This restricts updates in anumber of ways, including only allowing anOS to map pages that it owns, and disallow-ing writable mappings of page tables. GuestOSes maybatch update requests to amortizethe overhead of entering the hypervisor. Thetop 64MB region of each address space, whichis reserved for Xen, is not accessible or remap-pable by guest OSes. This address region isnot used by any of the common x86 ABIs how-ever, so this restriction does not break applica-tion compatibility.

Segmentation is virtualized in a similar way,by validating updates to hardware segment de-scriptor tables. The only restrictions on x86segment descriptors are: (i) they must have


lower privilege than Xen, and (ii) they may notallow any access to the Xen-reserved portionof the address space.

2.1.2 CPU

Virtualizing the CPU has several implicationsfor guest OSes. Principally, the insertion of ahypervisor below the operating system violatesthe usual assumption that the OS is the mostprivileged entity in the system. In order to pro-tect the hypervisor from OS misbehavior (anddomains from one another) guest OSes must bemodified to run at a lower privilege level.

Efficient virtualizion of privilege levels is pos-sible on x86 because it supports four distinctprivilege levels in hardware. The x86 privi-lege levels are generally described asrings, andare numbered from zero (most privileged) tothree (least privileged). OS code typically exe-cutes in ring 0 because no other ring can ex-ecute privileged instructions, while ring 3 isgenerally used for application code. To ourknowledge, rings 1 and 2 have not been usedby any well-known x86 OS since OS/2. AnyOS which follows this common arrangementcan be ported to Xen by modifying it to exe-cute in ring 1. This prevents the guest OS fromdirectly executing privileged instructions, yet itremains safely isolated from applications run-ning in ring 3.

Privileged instructions are paravirtualized byrequiring them to be validated and executedwithin Xen—this applies to operations suchas installing a new page table, or yielding theprocessor when idle (rather than attempting tohlt it). Any guest OS attempt to directly ex-ecute a privileged instruction is failed by theprocessor, either silently or by taking a fault,since only Xen executes at a sufficiently privi-leged level.

Exceptions, including memory faults and soft-ware traps, are virtualized on x86 very straight-forwardly. A table describing the handler foreach type of exception is registered with Xenfor validation. The handlers specified in thistable are generally identical to those for realx86 hardware; this is possible because the ex-ception stack frames are unmodified in our par-avirtualized architecture. The sole modifica-tion is to the page fault handler, which wouldnormally read the faulting address from a priv-ileged processor register (CR2); since this isnot possible, we write it into an extended stackframe2. When an exception occurs while exe-cuting outside ring 0, Xen’s handler creates acopy of the exception stack frame on the guestOS stack and returns control to the appropriateregistered handler.

Typically only two types of exception oc-cur frequently enough to affect system perfor-mance: system calls (which are usually im-plemented via a software exception), and pagefaults. We improve the performance of sys-tem calls by allowing each guest OS to reg-ister a ‘fast’ exception handler which is ac-cessed directly by the processor without indi-recting via ring 0; this handler is validated be-fore installing it in the hardware exception ta-ble. Unfortunately it is not possible to applythe same technique to the page fault handlerbecause only code executing in ring 0 can readthe faulting address from registerCR2; pagefaults must therefore always be delivered viaXen so that this register value can be saved foraccess in ring 1.

Safety is ensured by validating exception han-dlers when they are presented to Xen. Theonly required check is that the handler’s codesegment does not specify execution in ring 0.Since no guest OS can create such a segment,

2In hindsight, writing the value into a pre-agreedshared memory location rather than modifying thestack frame would have simplified the XP port.


it suffices to compare the specified segment se-lector to a small number of static values whichare reserved by Xen. Apart from this, any otherhandler problems are fixed up during excep-tion propagation—for example, if the handler’scode segment is not present or if the handleris not paged into memory then an appropri-ate fault will be taken when Xen executes theiret instruction which returns to the handler.Xen detects these “double faults” by checkingthe faulting program counter value: if the ad-dress resides within the exception-virtualizingcode then the offending guest OS is terminated.

Note that this “lazy” checking is safe even forthe direct system-call handler: access faultswill occur when the CPU attempts to directlyjump to the guest OS handler. In this case thefaulting address will be outside Xen (since Xenwill never execute a guest OS system call) andso the fault is virtualized in the normal way. Ifpropagation of the fault causes a further “dou-ble fault” then the guest OS is terminated asdescribed above.

2.1.3 Device I/O

Rather than emulating existing hardware de-vices, as is typically done in fully-virtualizedenvironments, Xen exposes a set of clean andsimple device abstractions. This allows us todesign an interface that is both efficient and sat-isfies our requirements for protection and iso-lation. To this end, I/O data is transferred toand from each domain via Xen, using shared-memory, asynchronous buffer-descriptor rings.These provide a high-performance communi-cation mechanism for passing buffer informa-tion vertically through the system, while al-lowing Xen to efficiently perform validationchecks (for example, checking that buffers arecontained within a domain’s memory reserva-tion).

Linux subsection # linesArchitecture-independent 78Virtual network driver 484Virtual block-device driver 1070Xen-specific (non-driver) 1363Total 2995Portion of total x86 code base 1.36%

Table 1: The simplicity of porting commodityOSes to Xen.

Similar to hardware interrupts, Xen supportsa lightweight event-delivery mechanism whichis used for sending asynchronous notificationsto a domain. These notifications are made byupdating a bitmap of pending event types and,optionally, by calling an event handler speci-fied by the guest OS. These callbacks can be‘held off’ at the discretion of the guest OS—toavoid extra costs incurred by frequent wake-upnotifications, for example.

2.2 The Cost of Porting an OS to Xen

Table 1 demonstrates the cost, in lines of code,of porting commodity operating systems toXen’s paravirtualized x86 environment.

The architecture-specific sections are effec-tively a port of the x86 code to our paravirtual-ized architecture. This involved rewriting rou-tines which used privileged instructions, andremoving a large amount of low-level systeminitialization code.

2.3 Control and Management

Throughout the design and implementation ofXen, a goal has been to separate policy frommechanism wherever possible. Although thehypervisor must be involved in data-path as-pects (for example, scheduling the CPU be-tween domains, filtering network packets be-fore transmission, or enforcing access control


XEN

H/W (SMP x86, phy mem, enet, SCSI/IDE)

virtual network

virtual blockdev

virtual x86 CPU

virtual phy mem

ControlPlane

Software

GuestOS(XenoLinux)

GuestOS(XenoBSD)

GuestOS(XenoXP)

UserSoftware

UserSoftware

UserSoftware

GuestOS(XenoLinux)

Xeno-AwareDevice Drivers




Domain0control

interface

Figure 1: The structure of a machine runningthe Xen hypervisor, hosting a number of dif-ferent guest operating systems, includingDo-main0running control software in a XenLinuxenvironment.

when reading data blocks), there is no need forit to be involved in, or even aware of, higherlevel issues such as how the CPU is to beshared, or which kinds of packet each domainmay transmit.

The resulting architecture is one in which thehypervisor itself provides only basic controloperations. These are exported through aninterface accessible from authorized domains;potentially complex policy decisions, such asadmission control, are best performed by man-agement software running over a guest OSrather than in privileged hypervisor code.

The overall system structure is illustrated inFigure 1. Note that a domain is created at boottime which is permitted to use thecontrol in-terface. This initial domain, termedDomain0,is responsible for hosting the application-levelmanagement software. The control interfaceprovides the ability to create and terminateother domains and to control their associatedscheduling parameters, physical memory allo-cations and the access they are given to the ma-chine’s physical disks and network devices.

In addition to processor and memory resources,the control interface supports the creation and

deletion of virtual network interfaces (VIFs)and block devices (VBDs). These virtual I/Odevices have associated access-control infor-mation which determines which domains canaccess them, and with what restrictions (for ex-ample, a read-only VBD may be created, ora VIF may filter IP packets to prevent source-address spoofing or apply traffic shaping).

This control interface, together with profil-ing statistics on the current state of the sys-tem, is exported to a suite of application-level management software running inDo-main0. This complement of administrativetools allows convenient management of the en-tire server: current tools can create and destroydomains, set network filters and routing rules,monitor per-domain network activity at packetand flow granularity, and create and delete vir-tual network interfaces and virtual block de-vices.

Snapshots of a domains’ state may be capturedand saved to disk, enabling rapid deploymentof applications by bypassing the normal bootdelay. Further, Xen supportslive migrationwhich enables running VMs to be moved dy-namically between different Xen servers, withexecution interrupted only for a few millisec-onds. We are in the process of developinghigher-level tools to further automate the ap-plication of administrative policy, for example,load balancing VMs among a cluster of Xenservers.

3 Detailed Design

In this section we introduce the design of themajor subsystems that make up a Xen-basedserver. In each case we present both Xen andguest OS functionality for clarity of exposition.In this paper, we focus on the XenLinux guestOS; the *BSD and Windows XP ports use theXen interface in a similar manner.


3.1 Control Transfer: Hypercalls and Events

Two mechanisms exist for control interactionsbetween Xen and an overlying domain: syn-chronous calls from a domain to Xen may bemade using ahypercall, while notifications aredelivered to domains from Xen using an asyn-chronous event mechanism.

The hypercall interface allows domains to per-form a synchronous software trap into thehypervisor to perform a privileged operation,analogous to the use of system calls in conven-tional operating systems. An example use of ahypercall is to request a set of page-table up-dates, in which Xen validates and applies a listof updates, returning control to the calling do-main when this is completed.

Communication from Xen to a domain is pro-vided through an asynchronous event mech-anism, which replaces the usual deliverymechanisms for device interrupts and allowslightweight notification of important eventssuch as domain-termination requests. Akin totraditional Unix signals, there are only a smallnumber of events, each acting to flag a partic-ular type of occurrence. For instance, eventsare used to indicate that new data has been re-ceived over the network, or that a virtual diskrequest has completed.

Pending events are stored in a per-domain bit-mask which is updated by Xen before invok-ing an event-callback handler specified by theguest OS. The callback handler is responsiblefor resetting the set of pending events, and re-sponding to the notifications in an appropriatemanner. A domain may explicitly defer eventhandling by setting a Xen-readable softwareflag: this is analogous to disabling interruptson a real processor.

Figure 2: The structure of asynchronous I/Orings, which are used for data transfer betweenXen and guest OSes.

3.2 Data Transfer: I/O Rings

The presence of a hypervisor means there isan additional protection domain between guestOSes and I/O devices, so it is crucial that adata transfer mechanism be provided that al-lows data to move vertically through the sys-tem with as little overhead as possible.

Two main factors have shaped the design ofour I/O-transfer mechanism: resource manage-ment and event notification. For resource ac-countability, we attempt to minimize the workrequired to demultiplex data to a specific do-main when an interrupt is received from adevice—the overhead of managing buffers iscarried out later where computation may be ac-counted to the appropriate domain. Similarly,memory committed to device I/O is providedby the relevant domains wherever possible toprevent the crosstalk inherent in shared bufferpools; I/O buffers are protected during datatransfer by pinning the underlying page frameswithin Xen.

Figure 2 shows the structure of our I/O descrip-tor rings. A ring is a circular queue of descrip-tors allocated by a domain but accessible fromwithin Xen. Descriptors do not directly con-tain I/O data; instead, I/O data buffers are al-


located out-of-band by the guest OS and in-directly referenced by I/O descriptors. Ac-cess to each ring is based around two pairsof producer-consumer pointers: domains placerequests on a ring, advancing a request pro-ducer pointer, and Xen removes these requestsfor handling, advancing an associated requestconsumer pointer. Responses are placed backon the ring similarly, save with Xen as the pro-ducer and the guest OS as the consumer. Thereis no requirement that requests be processed inorder: the guest OS associates a unique identi-fier with each request which is reproduced inthe associated response. This allows Xen tounambiguously reorder I/O operations due toscheduling or priority considerations.

This structure is sufficiently generic to supporta number of different device paradigms. Forexample, a set of ‘requests’ can provide buffersfor network packet reception; subsequent ‘re-sponses’ then signal the arrival of packets intothese buffers. Reordering is useful when deal-ing with disk requests as it allows them tobe scheduled within Xen for efficiency, andthe use of descriptors with out-of-band buffersmakes implementing zero-copy transfer easy.

We decouple the production of requests or re-sponses from the notification of the other party:in the case of requests, a domain may enqueuemultiple entries before invoking a hypercall toalert Xen; in the case of responses, a domaincan defer delivery of a notification event byspecifying a threshold number of responses.This allows each domain to trade-off latencyand throughput requirements, similarly to theflow-aware interrupt dispatch in the ArseNICGigabit Ethernet interface [12].

3.3 Subsystem Virtualization

The control and data transfer mechanisms de-scribed are used in our virtualization of the var-ious subsystems. In the following, we discuss

how this virtualization is achieved for CPU,timers, memory, network and disk.

3.3.1 CPU scheduling

Xen currently schedules domains according tothe Borrowed Virtual Time (BVT) schedulingalgorithm [4]. We chose this particular algo-rithms since it is both work-conserving and hasa special mechanism for low-latency wake-up(or dispatch) of a domain when it receives anevent. Fast dispatch is particularly importantto minimize the effect of virtualization on OSsubsystems that are designed to run in a timelyfashion; for example, TCP relies on the timelydelivery of acknowledgments to correctly es-timate network round-trip times. BVT pro-vides low-latency dispatch by using virtual-time warping, a mechanism which temporarilyviolates ‘ideal’ fair sharing to favor recently-woken domains. However, other scheduling al-gorithms could be trivially implemented overour generic scheduler abstraction. Per-domainscheduling parameters can be adjusted by man-agement software running inDomain0.

3.3.2 Time and timers

Xen provides guest OSes with notions of realtime, virtual time and wall-clock time. Realtime is expressed in nanoseconds passed sincemachine boot and is maintained to the accu-racy of the processor’s cycle counter and canbe frequency-locked to an external time source(for example, via NTP). A domain’s virtualtime only advances while it is executing: thisis typically used by the guest OS scheduler toensure correct sharing of its timeslice betweenapplication processes. Finally, wall-clock timeis specified as an offset to be added to the cur-rent real time. This allows the wall-clock timeto be adjusted without affecting the forward


progress of real time.

Each guest OS can program a pair of alarmtimers, one for real time and the other for vir-tual time. Guest OSes are expected to main-tain internal timer queues and use the Xen-provided alarm timers to trigger the earliesttimeout. Timeouts are delivered using Xen’sevent mechanism.

3.3.3 Virtual address translation

As with other subsystems, Xen attempts to vir-tualize memory access with as little overheadas possible. As discussed in Section 2.1.1,this goal is made somewhat more difficult bythe x86 architecture’s use of hardware page ta-bles. The approach taken by VMware is to pro-vide each guest OS with a virtual page table,not visible to the memory-management unit(MMU) [3]. The hypervisor is then responsiblefor trapping accesses to the virtual page table,validating updates, and propagating changesback and forth between it and the MMU-visible‘shadow’ page table. This greatly increasesthe cost of certain guest OS operations, suchas creating new virtual address spaces, andrequires explicit propagation of hardware up-dates to ‘accessed’ and ‘dirty’ bits.

Although full virtualization forces the use ofshadow page tables, to give the illusion of con-tiguous physical memory, Xen is not so con-strained. Indeed, Xen need only be involved inpage tableupdates, to prevent guest OSes frommaking unacceptable changes. Thus we avoidthe overhead and additional complexity asso-ciated with the use of shadow page tables—theapproach in Xen is to register guest OS page ta-bles directly with the MMU, and restrict guestOSes to read-only access. Page table updatesare passed to Xen via a hypercall; to ensuresafety, requests arevalidatedbefore being ap-plied.

To aid validation, we associate a type and ref-erence count with each machine page frame.A frame may have any one of the followingmutually-exclusive types at any point in time:page directory (PD), page table (PT), local de-scriptor table (LDT), global descriptor table(GDT), or writable (RW). Note that a guestOS may always create readable mappings toits own page frames, regardless of their currenttypes. A frame may only safely be retaskedwhen its reference count is zero. This mecha-nism is used to maintain the invariants requiredfor safety; for example, a domain cannot havea writable mapping to any part of a page tableas this would require the frame concerned tosimultaneously be of types PT and RW.

The type system is also used to track whichframes have already been validated for use inpage tables. To this end, guest OSes indicatewhen a frame is allocated for page-table use—this requires a one-off validation of every en-try in the frame by Xen, after which its typeis pinned to PD or PT as appropriate, until asubsequent unpin request from the guest OS.This is particularly useful when changing thepage table base pointer, as it obviates the needto validate the new page table on every contextswitch. Note that a frame cannot be retaskeduntil it is both unpinned and its reference counthas reduced to zero – this prevents guest OSesfrom using unpin requests to circumvent thereference-counting mechanism.

3.3.4 Physical memory

The initial memory allocation, orreservation,for each domain is specified at the time ofits creation; memory is thus statically parti-tioned between domains, providing strong iso-lation. A maximum-allowable reservation mayalso be specified: if memory pressure withina domain increases, it may then attempt toclaim additional memory pages from Xen, up


to this reservation limit. Conversely, if adomain wishes to save resources, perhaps toavoid incurring unnecessary costs, it can re-duce its memory reservation by releasing mem-ory pages back to Xen.

XenLinux implements aballoon driver [16],which adjusts a domain’s memory usage bypassing memory pages back and forth be-tween Xen and XenLinux’s page allocator.Although we could modify Linux’s memory-management routines directly, the balloondriver makes adjustments by using existingOS functions, thus simplifying the Linux port-ing effort. However, paravirtualization can beused to extend the capabilities of the balloondriver; for example, the out-of-memory han-dling mechanism in the guest OS can be mod-ified to automatically alleviate memory pres-sure by requesting more memory from Xen.

Most operating systems assume that memorycomprises at most a few large contiguous ex-tents. Because Xen does not guarantee to al-locate contiguous regions of memory, guestOSes will typically create for themselves theillusion of contiguousphysical memory, eventhough their underlying allocation ofhardwarememoryis sparse. Mapping from physical tohardware addresses is entirely the responsibil-ity of the guest OS, which can simply main-tain an array indexed by physical page framenumber. Xen supports efficient hardware-to-physical mapping by providing a shared trans-lation array that is directly readable by all do-mains – updates to this array are validated byXen to ensure that the OS concerned owns therelevant hardware page frames.

Note that even if a guest OS chooses to ig-nore hardware addresses in most cases, it mustuse the translation tables when accessing itspage tables (which necessarily use hardwareaddresses). Hardware addresses may also beexposed to limited parts of the OS’s memory-

management system to optimize memory ac-cess. For example, a guest OS might allo-cate particular hardware pages so as to opti-mize placement within a physically indexedcache [7], or map naturally aligned contigu-ous portions of hardware memory using super-pages [10].

3.3.5 Network

Xen provides the abstraction of a virtualfirewall-router (VFR), where each domain hasone or more network interfaces (VIFs) logi-cally attached to the VFR. A VIF looks some-what like a modern network interface card:there are two I/O rings of buffer descriptors,one for transmit and one for receive. Each di-rection also has a list of associated rules of theform (<pattern>, <action>)—if the patternmatches then the associatedaction is applied.

Domain0 is responsible for inserting and re-moving rules. In typical cases, rules will beinstalled to prevent IP source address spoof-ing, and to ensure correct demultiplexing basedon destination IP address and port. Rules mayalso be associated with hardware interfaces onthe VFR. In particular, we may install rules toperform traditional firewalling functions suchas preventing incoming connection attempts oninsecure ports.

To transmit a packet, the guest OS simply en-queues a buffer descriptor onto the transmitring. Xen copies the descriptor and, to ensuresafety, then copies the packet header and ex-ecutes any matching filter rules. The packetpayload is not copied since we use scatter-gather DMA; however note that the relevantpage frames must be pinned until transmissionis complete. To ensure fairness, Xen imple-ments a simple round-robin packet scheduler.

To efficiently implement packet reception, we


require the guest OS to exchange an unusedpage frame for each packet it receives; thisavoids the need to copy the packet betweenXen and the guest OS, although it requiresthat page-aligned receive buffers be queued atthe network interface. When a packet is re-ceived, Xen immediately checks the set of re-ceive rules to determine the destination VIF,and exchanges the packet buffer for a pageframe on the relevant receive ring. If no frameis available, the packet is dropped.

3.3.6 Disk

Only Domain0 has direct unchecked accessto physical (IDE and SCSI) disks. All otherdomains access persistent storage through theabstraction of virtual block devices (VBDs),which are created and configured by manage-ment software running withinDomain0. Al-lowing Domain0 to manage the VBDs keepsthe mechanisms within Xen very simple andavoids more intricate solutions such as theUDFs used by the Exokernel [6].

A VBD comprises a list of extents with asso-ciated ownership and access control informa-tion, and is accessed via the I/O ring mecha-nism. A typical guest OS disk scheduling al-gorithm will reorder requests prior to enqueu-ing them on the ring in an attempt to reduceresponse time, and to apply differentiated ser-vice (for example, it may choose to aggres-sively schedule synchronous metadata requestsat the expense of speculative readahead re-quests). However, because Xen has more com-plete knowledge of the actual disk layout, wealso support reordering within Xen, and so re-sponses may be returned out of order. A VBDthus appears to the guest OS somewhat like aSCSI disk.

A translation table is maintained within the hy-pervisor for each VBD; the entries within this

table are installed and managed byDomain0via a privileged control interface. On receivinga disk request, Xen inspects the VBD identi-fier and offset and produces the correspondingsector address and physical device. Permissionchecks also take place at this time. Zero-copydata transfer takes place using DMA betweenthe disk and pinned memory pages in the re-questing domain.

Xen servicesbatchesof requests from com-peting domains in a simple round-robin fash-ion; these are then passed to a standard ele-vator scheduler before reaching the disk hard-ware. Domains may explicitly pass downre-order barriersto prevent reordering when thisis necessary to maintain higher level seman-tics (e.g. when using a write-ahead log). Thelow-level scheduling gives us good through-put, while the batching of requests providesreasonably fair access. Future work will in-vestigate providing more predictable isolationand differentiated service, perhaps using exist-ing techniques and schedulers [15].

4 Evaluation

In this section we present a subset of our eval-uation of Xen against a number of alternativevirtualization techniques. A more completeevaluation, as well as detailed configurationand benchmark specs, can be found in [1] Forthese measurements, we used our 2.4.21-basedXenLinux port as, at the time of this writing,the 2.6-port was not stable enough for a fullbattery of tests.

There are a number of preexisting solutionsfor running multiple copies of Linux on thesame machine. VMware offers several com-mercial products that provide virtual x86 ma-chines on which unmodified copies of Linuxmay be booted. The most commonly used ver-sion is VMware Workstation, which consists


of a set of privileged kernel extensions to a‘host’ operating system. Both Windows andLinux hosts are supported. VMware also offeran enhanced product called ESX Server whichreplaces the host OS with a dedicated kernel.By doing so, it gains some performance bene-fit over the workstation product. We have sub-jected ESX Server to the benchmark suites de-scribed below, but sadly are prevented from re-porting quantitative results due to the terms ofthe product’s End User License Agreement. In-stead we present results from VMware Work-station 3.2, running on top of a Linux hostOS, as it is the most recent VMware productwithout that benchmark publication restriction.ESX Server takes advantage of its native archi-tecture to equal or outperform VMware Work-station and its hosted architecture. While Xenof course requires guest OSes to be ported, ittakes advantage of paravirtualization to notice-ably outperform ESX Server.

We also present results for User-mode Linux(UML), an increasingly popular platform forvirtual hosting. UML is a port of Linux to runas a user-space process on a Linux host. LikeXenLinux, the changes required are restrictedto the architecture dependent code base. How-ever, the UML code bears little similarity tothe native x86 port due to the very different na-ture of the execution environments. AlthoughUML can run on an unmodified Linux host, wepresent results for the ‘Single Kernel AddressSpace’ (skas3) variant that exploits patches tothe host OS to improve performance.

We also investigated three other virtualiza-tion techniques for running ported versions ofLinux on the same x86 machine. Connec-tix’s Virtual PC and forthcoming Virtual Serverproducts (now acquired by Microsoft) are sim-ilar in design to VMware’s, providing full x86virtualization. Since all versions of Virtual PChave benchmarking restrictions in their licenseagreements we did not subject them to closer

analysis. UMLinux is similar in concept toUML but is a different code base and has yetto achieve the same level of performance, sowe omit the results. Work to improve the per-formance of UMLinux through host OS modi-fications is ongoing [8]. Although Plex86 wasoriginally a general purpose x86 VMM, it hasnow been retargeted to support just Linux guestOSes. The guest OS must be specially com-piled to run on Plex86, but the source changesfrom native x86 are trivial. The performance ofPlex86 is currently well below the other tech-niques.

4.1 Relative Performance

The first cluster of bars in Figure 3 repre-sents a relatively easy scenario for the VMMs.The SPEC CPU suite contains a series oflong-running computationally-intensive appli-cations intended to measure the performanceof a system’s processor, memory system, andcompiler quality. The suite performs little I/Oand has little interaction with the OS. Withalmost all CPU time spent executing in user-space code, all three VMMs exhibit low over-head.

The next set of bars show the total elapsed timetaken to build a default configuration of theLinux 2.4.21 kernel on a local ext3 file sys-tem with gcc 2.96. Native Linux spends about7% of the CPU time in the OS, mainly per-forming file I/O, scheduling and memory man-agement. In the case of the VMMs, this ‘sys-tem time’ is expanded to a greater or lesser de-gree: whereas Xen incurs a mere 3% overhead,the other VMMs experience a more significantslowdown.

Two experiments were performed using thePostgreSQL 7.1.3 database, exercised bythe Open Source Database Benchmark suite(OSDB) in its default configuration. Wepresent results for the multi-user Information


L

567

X

567

V

554

U

550

SPEC INT2000 (score)

L

263

X

271

V

334

U

535

Linux build time (s)

L

172

X15

8V

80

U

65

OSDB-IR (tup/s)

L

1714

X

1633

V

199

U

306

OSDB-OLTP (tup/s)

L

418

X

400

V

310

U

111

dbench (score)

L

518

X

514

V

150

U

172

SPEC WEB99 (score)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

Rel

ativ

e sc

ore

to L

inux

Figure 3: Relative performance of native Linux (L), XenLinux (X), VMware workstation 3.2 (V)and User-Mode Linux (U).

Retrieval (IR) and On-Line Transaction Pro-cessing (OLTP) workloads, both measured intuples per second. PostgreSQL places consid-erable load on the operating system, and this isreflected in the substantial virtualization over-heads experienced by VMware and UML. Inparticular, the OLTP benchmark requires manysynchronous disk operations, resulting in manyprotection domain transitions.

The dbench program is a file system bench-mark derived from the industry-standard ‘Net-Bench’. It emulates the load placed on a fileserver by Windows 95 clients. Here, we ex-amine the throughput experienced by a singleclient performing around 90,000 file systemoperations.

SPEC WEB99 is a complex application-levelbenchmark for evaluating web servers and thesystems that host them. The benchmark isCPU-bound, and a significant proportion of thetime is spent within the guest OS kernel, per-forming network stack processing, file systemoperations, and scheduling between the manyhttpd processes that Apache needs to handle

the offered load. XenLinux fares well, achiev-ing within 1% of native Linux performance.VMware and UML both struggle, supportingless than a third of the number of clients of thenative Linux system.

4.2 Operating System Benchmarks

To more precisely measure the areas of over-head within Xen and the other VMMs, we per-formed a number of smaller experiments tar-geting particular subsystems. We examinedthe overhead of virtualization as measured byMcVoy’s lmbenchprogram [9]. The OS per-formance subset of the lmbench suite consistof 37 microbenchmarks.

In 24 of the 37 microbenchmarks, XenLinuxperforms similarly to native Linux, tracking theLinux kernel performance closely. In Tables 2to 4 we show results which exhibit interest-ing performance variations among the test sys-tems; particularly large penalties for Xen areshown in bold face.

In the process microbenchmarks (Table 2), Xen


Config nullcall

nullI/O

openclose

slctTCP

siginst

sighndl

forkproc

execproc

shproc

Linux 0.45 0.50 1.92 5.70 0.68 2.49 110 530 4k0Xen 0.46 0.50 1.88 5.69 0.69 1.75198 768 4k8VMW 0.73 0.83 2.99 11.1 1.02 4.63 874 2k3 10kUML 24.7 25.1 62.8 39.9 26.0 46.0 21k 33k 58k

Table 2:lmbench : Processes - times inµs

Config 2p0K

2p16K

2p64K

8p16K

8p64K

16p16K

16p64K

Linux 0.77 0.91 1.06 1.03 24.3 3.61 37.6Xen 1.97 2.22 2.67 3.07 28.7 7.0839.4VMW 18.1 17.6 21.3 22.4 51.6 41.7 72.2UML 15.5 14.6 14.4 16.3 36.8 23.6 52.0

Table 3: lmbench : Context switching timesin µs

Config 0K File 10K File Mmap Prot Pagecreate delete create delete lat fault fault

Linux 32.1 6.08 66.0 12.5 68.0 1.06 1.42Xen 32.5 5.86 68.2 13.6 139 1.40 2.73VMW 35.3 9.3 85.6 21.4 620 7.53 12.4UML 130 65.7 250 113 1k4 21.8 26.3

Table 4: lmbench : File & VM system laten-cies inµs

exhibits slowerfork, exec, andshperformancethan native Linux. This is expected, since theseoperations require large numbers of page ta-ble updates which must all be verified by Xen.However, the paravirtualization approach al-lows XenLinux to batch update requests. Cre-ating new page tables presents an ideal case:because there is no reason to commit pendingupdates sooner, XenLinux can amortize eachhypercall across 2048 updates (the maximumsize of its batch buffer). Hence each updatehypercall constructs 8MB of address space.

Table 3 shows context switch times betweendifferent numbers of processes with differentworking set sizes. Xen incurs an extra over-head between 1µs and 3µs, as it executes a hy-percall to change the page table base. How-ever, context switch results for larger work-

ing set sizes (perhaps more representative ofreal applications) show that the overhead issmall compared with cache effects. Unusually,VMware Workstation is inferior to UML onthese microbenchmarks; however, this is onearea where enhancements in ESX Server areable to reduce the overhead.

The mmap latencyand page fault latencyre-sults shown in Table 4 are interesting since theyrequire two transitions into Xen per page: oneto take the hardware fault and pass the detailsto the guest OS, and a second to install the up-dated page table entry on the guest OS’s behalf.Despite this, the overhead is relatively modest.

One small anomaly in Table 2 is that Xen-Linux has lower signal-handling latency thannative Linux. This benchmark does not re-quire any calls into Xen at all, and the 0.75µs(30%) speedup is presumably due to a fortu-itous cache alignment in XenLinux, hence un-derlining the dangers of taking microbench-marks too seriously.

4.3 Additional Benchmarks

We have also conducted comprehensive exper-iments that: evaluate the overhead of virtual-izing the network; compare the performanceof running multiple applications in their ownguest OS against running them on the samenative operating system; demonstrate perfor-mance isolation provided by Xen; and examineXen’s ability to scale to its target of 100 do-mains. All of the experiments showed promis-ing results and details have been separatelypublished [1].

5 Conclusion

We have presented the Xen hypervisor whichpartitions the resources of a computer betweendomains running guest operating systems. Our


paravirtualizing design places a particular em-phasis on protection, performance and resourcemanagement. We have also described and eval-uated XenLinux, a fully-featured port of theLinux kernel that runs over Xen.

Xen and the 2.4-based XenLinux are suffi-ciently stable to be useful to a wide audi-ence. Indeed, some web hosting providersare already selling Xen-based virtual servers.Sources, documentation, and a demo ISO canbe found on our project page3.

Although the 2.4-based XenLinux was the ba-sis of our performance evaluation, a 2.6-basedport is well underway. In this port, much care isbeen given to minimizing and isolating the nec-essary changes to the Linux kernel and mea-suring the changes against benchmark results.As paravirtualization techniques become moreprevalent, kernel changes would ideally be partof the main tree. We have experimented withvarious source structures including a separatearchitecture,a la UML, a subarchitecture, anda CONFIG option. We eagerly solicit input anddiscussion from the kernel developers to guideour approach. We also have considered trans-parent paravirtualization [2] techniques to al-low a single distro image to adapt dynamicallybetween a VMM-based configuration and baremetal.

As well as further guest OS ports, Xen it-self is being ported to other architectures. Anx86_64 port is well underway, and we are keento see Xen ported to RISC-style architectures(such as PPC) where virtual memory virtual-ization will be much easier due to the software-managed TLB.

Much new functionality has been added sincethe first public availability of Xen last Octo-ber. Of particular note are a completely re-vamped I/O subsystem capable of directly uti-

3http://www.cl.cam.ac.uk/netos/xen

lizing Linux driver source, suspend/resume andlive migration features, much improved con-sole access, etc. Though final implementation,testing, and documentation was not completeat the deadline for this paper, we hope to de-scribe these in more detail at the symposiumand in future publications.

As always, there are more tasks to do than thereare resources to do them. We would like togrow Xen into the premier open source virtual-ization solution, with breadth and features thatrival proprietary commercial products.

We enthusiastically welcome the help and con-tributions of the Linux community.

Acknowledgments

Xen has been a big team effort. In partic-ular, we would like to thank Zachary Ams-den, Andrew Chung, Richard Coles, BorisDragovich, Evangelos Kotsovinos, Tim Har-ris, Alex Ho, Kip Macy, Rolf Neugebauer, BinRen, Russ Ross, James Scott, Steven Smith,Andrew Warfield, Mark Williamson, and MikeWray. Work on Xen has been supported bya UK EPSRC grant, Intel Research, MicrosoftResearch, and Hewlett-Packard Labs.

References

[1] P. Barham, B. Dragovic, K. Fraser, S. Hand,T. Harris, A. Ho, R. Neugebauer, I. Pratt, andA. Warfield. Xen and the Art of Virtualiza-tion. InProceedings of the 19th ACM SIGOPSSymposium on Operating Systems Principles,volume 37(5) ofACM Operating Systems Re-view, pages 164–177, Bolton Landing, NY,USA, Dec. 2003.

[2] D. Magenheimer and T. Christian. vBlades:Optimized Paravirtualization for the ItaniumProcessor Family. InProceedings of the


USENIX 3rd Virtual Machine Research andTechnology Symposium, May 2004.

[3] S. Devine, E. Bugnion, and M. Rosenblum.Virtualization system including a virtual ma-chine monitor for a computer with a seg-mented architecture. US Patent, 6397242,Oct. 1998.

[4] K. J. Duda and D. R. Cheriton. Borrowed-Virtual-Time (BVT) scheduling: supportinglatency-sensitive threads in a general-purposescheduler. InProceedings of the 17th ACMSIGOPS Symposium on Operating SystemsPrinciples, volume 33(5) ofACM OperatingSystems Review, pages 261–276, Kiawah Is-land Resort, SC, USA, Dec. 1999.

[5] D. Engler, S. K. Gupta, and F. Kaashoek.AVM: Application-level virtual memory. InProceedings of the 5th Workshop on Hot Top-ics in Operating Systems, pages 72–77, May1995.

[6] M. F. Kaashoek, D. R. Engler, G. R. Granger,H. M. Briceño, R. Hunt, D. Mazières,T. Pinckney, R. Grimm, J. Jannotti, andK. Mackenzie. Application performance andflexibility on Exokernel systems. InProceed-ings of the 16th ACM SIGOPS Symposium onOperating Systems Principles, volume 31(5)of ACM Operating Systems Review, pages 52–65, Oct. 1997.

[7] R. Kessler and M. Hill. Page placementalgorithms for large real-indexed caches.ACM Transaction on Computer Systems,10(4):338–359, Nov. 1992.

[8] S. T. King, G. W. Dunlap, and P. M. Chen.Operating System Support for Virtual Ma-chines. InProceedings of the 2003 AnnualUSENIX Technical Conference, Jun 2003.

[9] L. McVoy and C. Staelin. lmbench: Portabletools for performance analysis. InProceed-ings of the USENIX Annual Technical Con-ference, pages 279–294, Berkeley, Jan. 1996.Usenix Association.

[10] J. Navarro, S. Iyer, P. Druschel, and A. Cox.Practical, transparent operating system sup-port for superpages. InProceedings of the 5thSymposium on Operating Systems Design andImplementation (OSDI 2002), ACM Operat-ing Systems Review, Winter 2002 Special Is-sue, pages 89–104, Boston, MA, USA, Dec.2002.

[11] L. Peterson, D. Culler, T. Anderson, andT. Roscoe. A blueprint for introducing dis-ruptive technology into the internet. InPro-ceedings of the 1st Workshop on Hot Topicsin Networks (HotNets-I), Princeton, NJ, USA,Oct. 2002.

[12] I. Pratt and K. Fraser. Arsenic: A user-accessible gigabit ethernet interface. InPro-ceedings of the Twentieth Annual Joint Con-ference of the IEEE Computer and Communi-cations Societies (INFOCOM-01), pages 67–76, Los Alamitos, CA, USA, Apr. 22–262001. IEEE Computer Society.

[13] J. S. Robin and C. E. Irvine. Analysis ofthe Intel Pentium’s ability to support a securevirtual machine monitor. InProceedings ofthe 9th USENIX Security Symposium, Denver,CO, USA, pages 129–144, Aug. 2000.

[14] L. Seawright and R. MacKinnon. VM/370 –a study of multiplicity and usefulness.IBMSystems Journal, pages 4–17, 1979.

[15] P. Shenoy and H. Vin. Cello: A Disk Schedul-ing Framework for Next-generation Operat-ing Systems. InProceedings of ACM SIG-METRICS’98, the International Conferenceon Measurement and Modeling of ComputerSystems, pages 44–55, June 1998.

[16] C. A. Waldspurger. Memory resource man-agement in VMware ESX server. InPro-ceedings of the 5th Symposium on Oper-ating Systems Design and Implementation(OSDI 2002), ACM Operating Systems Re-view, Winter 2002 Special Issue, pages 181–194, Boston, MA, USA, Dec. 2002.


[17] A. Whitaker, M. Shaw, and S. D. Gribble. De-nali: Lightweight Virtual Machines for Dis-tributed and Networked Applications. Techni-cal Report 02-02-01, University of Washing-ton, 2002.

Proceedings of theLinux Symposium

Volume Two

July 21st–24th, 2004Ottawa, Ontario

Canada

Conference Organizers

Andrew J. Hutton,Steamballoon, Inc.Stephanie Donovan,Linux SymposiumC. Craig Ross,Linux Symposium

Review Committee

Jes Sorensen,Wild Open Source, Inc.Matt Domsch,DellGerrit Huizenga,IBMMatthew Wilcox,Hewlett-PackardDirk Hohndel,IntelVal Henson,Sun MicrosystemsJamal Hadi Salimi,ZnyxAndrew Hutton,Steamballoon, Inc.

Proceedings Formatting Team

John W. Lockhart,Red Hat, Inc.

Authors retain copyright to all submitted papers, but have granted unlimited redistribution rightsto all as a condition of submission.