6740234

5/22/2018 6740234

1/8

Evaluation of a Server-Grade Software-Only

ARM Hypervisor

Alexey Smirnov

Industrial Technology Research Institute

Hsinchu, Taiwan

Email: [email protected]

Mikhail Zhidko


Hsinchu, Taiwan


Yingshiuan Pan


Hsinchu, Taiwan


Po-Jui Tsao


Hsinchu, Taiwan


Kuang-Chih Liu


Hsinchu, Taiwan


Tzi-Cker Chiueh


Hsinchu, Taiwan


AbstractBecause of its enormous popularity in embeddedsystems and mobile devices, ARM CPU is arguably the most usedCPU in the world. The resulting economies of scale benefit enticessystem architects to ponder the feasibility of building lower-costand lower-power-consumption servers using ARM CPU. In mod-ern data centers, especially those built to host cloud applications,virtualization is a must. So how to support virtualization on ARMCPUs becomes a major issue for constructing ARM-based servers.Although the latest versions of ARM architecture (Cortex-A15and beyond) provide hardware support for virtualization, themajority of ARM-based SOCs (system-on-chip) currently avail-able on the market do not. This paper presents results of anevaluation study of a fully operational hypervisor that successfullyruns multiple VMs on an ARM Cortex A9-based server, whichis architecturally non-virtualizable, and supports VM migration.This hypervisor features several optimizations that significantlyreduce the performance overhead of virtualization, includingphysical memory remapping, and batching of sensitive/privilegedinstruction emulation.

I. INTRODUCTION

Today the compute server market is dominated by x86-based CPUs. The resulting servers are relatively power-hungryand expensive. In contrast, ARM CPU has been the dominantCPU in embedded systems and mobile devices, because of itsemphasis on low-power design and lower cost. As a result,there is an emergence of ARM-based server boards comingfrom a number of hardware vendors such as Marvell [1],Calxeda [2], ZTSystem [3], which are based on ARM Cortex-A9 CPUs, the most popular ARM CPUs that are commerciallyavailable today but lack of server-grade hypervisor support.

Compared with their x86 counterparts, these server boardsmay have lower absolute performance, but their lower cost andlower power consumption allow them to fare well in terms ofmetrics such as amount of work done per watt and amount ofwork done per dollar.

Because cloud data centers use server virtualization tech-nology extensively in their resource allocation and manage-ment, an immediate requirement for ARM-based servers is thatit provides the same server virtualization capability availableon current x86 servers. Although the latest ARM architecture,Cortex-A15 and beyond, does provide architectural support

for virtualization similar to the VT technology [4] in x86,

production-grade SOCs based on this new architecture remainunavailable. Judging from the market adoption rate of therecent generations of ARM CPUs, it is expected to take severalyears for ARM CPUs with hardware virtualization support tobecome the mainstream in the market place. In the mean time,a hypervisor that provides virtualization support for currentlyavailable ARM CPUs is needed to fill the gap. Without such ahypervisor, it is impossible for servers using current-generationARM CPUs to effectively compete with x86 servers.

Because current-generation ARM CPUs do not have hard-ware virtualization support, we set out to build a production-grade software-only hypervisor that guarantees isolation be-tween VMs and between VMs and the hypervisor, minimizesthe amount of paravirtualization required on the guest OSs, and

most importantly incurs acceptable performance overhead. Theresulting hypervisor is the ITRI ARM hypervisor.

Although there existed hypervisors for ARM-based mobiledevices [5], [6], [7] or embedded systems [8], none of themsatisfy all three requirements listed above and therefore are notready for deployment on data center servers. In contrast, theITRI ARM hypervisor evaluated in this paper is the first knownserver-grade hypervisor that runs on a state-of-the-art ARM-based SOC, i.e. Marvell ARMADA-XP SOC, which comeswith a quad-core 1.6GHz Cortex-A9 CPU and consumes fewerthan 10W, and supports concurrent execution of multiple VMsand VM migration.

The rest of the paper is organized as follows. In SectionII we review the related work in the area of ARM-basedhypervisors. In Section III we present the overall architectureof our hypervisor and discuss the challenges associated withimplementing a hypervisor on a CPU without hardware supportof virtualization. In this section, we also present memoryvirtualization and exception handling framework and the bi-nary rewriting approach for CPU virtualization. We presentperformance evaluation of our hypervisor in Section IV.Then, Section V concludes the paper with a summary of ourcontributions and directions for future work.

2013 IEEE Sixth International Conference on Cloud Computing

978-0-7695-5028-2/13 $26.00 2013 IEEE

DOI 10.1109/CLOUD.2013.71

855

5/22/2018 6740234

2/8

II. RELATED WORK

Open source hypervisors in x86 world such as Xen [9] andKVM [10] have been ported to ARM platform by enthusiastdevelopers [5], [6], [8]. Xen-based hypervisors [6], [8] providethe virtualization by replacing the privileged operations insidesource code of guest OS with hypercalls (traps to hypervisors).On the other hand, KVM-ARM [5] and ARMvisor [11]leverage the fact that guest VM runs as an unprivileged processinside host OS, thus any attempt to execute a privilegedoperation results in a trap that host OS can intercept. Still,KVM-ARM hypervisor needs to resort to para-virtualizationalso, since there are some instructions with side effects calledsensitive instructions that do not generate traps.

Most hypervisors, both on x86 and ARM, employ shadowpage table (SPT) [12] as the memory virtualization technique.Its idea is that the hypervisor injects a so-called shadow pagetable between the guest VM and the hardware, which maps theguest physical addresses into machine physical addresses. Thehypervisor also carries out the burden of allocating the memoryfor the guest OS and mapping those ranges into guest physicaladdress space. This approach is favorable only when supportedby the hardware, and is notoriously difficult to implementcorrectly in software due to a number of challenges. Someadvanced hardware-assisted techniques for managing shadowpage table on x86 architecture have been proposed [13],[14]. In addition to virtualizing hardware resources, a hy-pervisor needs to perform privileged operations on behalf ofthe guest (such as changing page table base pointer). Thereare two well-known techniques that a hypervisor can employto deal with this problem [5]: 1) binary translation wheneach sensitive instruction is translated to a block of non-SIand 2) paravirtualization, when every guest kernel functioncontaining SIs is replaced with a hypercall. Our hypervisoruses a hybrid approach: it employs both minimal lightweightparavirtualization and binary translation.

The industry progress for ARM virtualization is rapid.ARM Cortex-A15 CPUs can provide hardware support forvirtualization, similar to capabilities found in x86 CPUs.NICTA [15] has done a quantitative analysis for Cortex-A15.VirtualOpenSystems [16], is a company that ported KVM-ARM to Cortex-A15 platform and submitted the patches toLinux kernel maintainers. Xen started to port xen hypervisorfor ARMv7 with virtualization extension and submitted thepatches to mailing list as well [17]. VMware Horizon Mo-bile [7] is the most prominent commercial hypervisor thatcan also support Cortex-A9, and its main purpose is to helpIT to manage a corporate mobile workspace on employeessmartphones. Some other popular commercial hypervisors arevLogix Mobile [18], OKL4 microvisor [19]. ARM announced

a technology codenamed big.LITTLE [20] whose goal is tocombine the performance of Cortex-A15 cores with energyefficiency of Cortex-A7 processors. The most advanced archi-tecture is 64-bit ARMv8 core. We expect a whole variety ofcommercial solutions and open source to be released in thenear future [21].

III. ITRI ARM HYPERVISOR

In this section we describe relevant architectural featuresof Marvell ARMADA-XP, the target of the ITRI ARM hy-

pervisor, and the ITRI ARM hypervisor itself, particularly itsperformance optimizations.

A. Marvell ARMADA-XP

Marvell ARMADA-XP is a SOC containing a quad-coreCortex-A9 CPU, which supports 7 modes of operation, such asuser mode, supervisor mode, abort mode, undefined mode, etc.The ITRI ARM hypervisor only uses user (unprivileged) modeand supervisor mode (also known as SVC mode or kernelmode).

All ARM instructions are of the same size: four bytes,which greatly simplifies binary rewriting. However, not everyinstruction could run in every possible mode. Some privilegedinstructions (PI) could only run on a specific co-processor.For example, most MMU instructions could run only on co-processor 15 and in SVC mode. When these instructions areexecuted in any other mode the CPU generates an (undefinedinstruction) exception. In addition, there are so-called sensitiveinstructions(SI), which can only be executed correctly in SVCmode but do not trap when executed in user mode. The effectof their execution in user mode is undefined. Examples of

those instructions are MSR and MRS, which write to andread from the program status registers, respectively. Finally,ARM architecture supports a domain mechanism [22], whichallows a processs address space to be partitioned into up to 16protection domains whose accessibility could be dynamicallycontrolled.

In addition to a Cortex-A9 CPU, Marvell ARMADA-XP includes an L2 cache, a memory controller, a set ofperipheral devices, and a system interconnect called Mbus,which just like the PCI Express architecture, allows eachMbus device to own a window of the physical memoryspace and even to remap each incoming physical memoryaddress to something else through a remapping register. Anunusual feature of ARMADA-XP is it allows regions of the

main memory (SDRAM) to be accessed as a Mbus device.Therefore, an access to the main memory could be captured byan Mbus device and then remapped. Because of this remappingcapability, the physical memory space of an ARMADA-XPserver is 16GB rather than 4GB. Finally, the physical memoryaddress window associated with an Mbus device could beturned on and off at run time. When an Mbus devices physicalmemory address window is turned off, any physical memoryaccess that targets at the Mbus devices physical memoryaddress window triggers an exception. When an Mbus devicesphysical memory address window is turned on (off), it is on(off) to all cores of the CPU.

As will become clear later, ARMADA-XPs ability to treatregions of main memory as Mbus devices offers a poor-man

alternative to such memory virtualization support as Intels ex-tended page table (EPT) [4] or AMDs nested page table [23],because it offers the ITRI ARM hypervisor a hardware-basedcheck and remap capability for every physical memory accesscoming from a guest VM.

B. Baseline Implementation

The ITRI ARM hypervisor is similar to KVM in archi-tecture, and runs on top of an Ubuntu Linux distribution thatcomes with the ARMADA-XP reference design. The host OS

856

5/22/2018 6740234

3/8

runs on the ARMADA-XP SOC in SVC mode, and everyguest VM runs entirely in user mode and consists of one ormultiple processes running on the host OS. Isolation betweenthe host OS and the guest VMs is guaranteed through the mbusmechanism or shadow page table, whereas isolation between aguest OS and the user processes running on top of it is throughthedomainmechanism [5]. The current implementation of the

ITRI ARM hypervisor consists of the following components:

1) A loadable kernel module, which is a significantlyrevised version of KVM and is responsible for allo-cating memory for newly started guest VMs, handlingexceptions when guest VMs run, and performingworld context switches, including changing the op-eration mode, the page table and exception handlertable.

2) Para-virtualizations of the supported guest OS, specif-ically the Linux kernel 2.6.35, including remov-ing unnecessary privileged instructions, e.g. redun-dant cache and TLB flushing, and simplifying thearchitecture-dependent code in the booting process.

3) An ARM binary rewriting tool used to patch the

binary image of the supported guest OS.4) A user-level machine emulator QEMU [24] that al-

lows users to configure and launch guest VMs, andinterfaces with real device drivers and backend devicedrivers.

To run a guest VM, the QEMU process loads the guest ker-nel image into its address space, performs a few initializationtasks and then enters the main loop, which makes an ioctl()to the revised KVM module, which in turn transfers control tothe guest OS by making a world context switch. Executionof the guest VM occasionally triggers exceptions, becauseof privileged instructions, sensitive instructions, system callsor hypercalls. The revised KVM module either handles theseexceptions itself or delegates them to the QEMU emulator or

some backend device drivers. The set of exception handlersused when a guest VM runs are different from those usedwhen the host OS runs. Therefore, the world context switchincludes a switch of the exception handler table.

The ITRI ARM hypervisor employs static binary rewritingto recognize sensitive instructions in a guest kernel binaryand replace them with non-sensitive instructions. There is noneed to rewrite privileged instructions since they cause trapsto the host OS anyway and thus could be emulated afterward.Our first implementation replaces every recognized sensitiveinstruction with a software interrupt instruction (SWI), whichcauses a trap to the host OS. The original instruction beingreplaced was encoded as the replacing SWI instructions 24-bit argument, the same way as was done in [5]. However, the

performance overhead of this implementation is too high to beacceptable. The work flow of this binary rewriter is shown inFigure 1. The current prototype of the ITRI ARM hypervisordoes not support run-time sensitive instruction rewriting, andtherefore does not allow new code to be added to the VMskernel, either through dynamic code generation or throughkernel module loading.

The ITRI ARM hypervisor virtualizes the memory resourceusing a shadow page table (SPT) [12] implementation, whichcreates a shadow copy of each guest page table so that the

MMU actually points to the shadow copy of each guest pagetable rather than the guest page table itself. Whenever amodification is made to a guest page table, this modifica-tion must be propagated to its corresponding shadow pagetable. Automatically detecting changes to guest page tablesis the most complicated and time-consuming part of an SPTimplementation. The ITRI ARM hypervisor relies on para-

virtualization to detect such changes.The ITRI ARM Hypervsior virtualizes I/O devices using

the Virtio framework [25], which requires a front-end driverinstalled in every guest VM, and a back-end driver insidethe host OS (vhost). Because ARM does not support I/Oinstructions, the ITRI ARM hypervisor needs to interceptmemory-mapped I/O instructions by marking the target pagesof these instructions as invalid and forcing them to trap to thehost OS.

C. Performance Optimizations

1) Light-Weight Context Switch: The common type ofexceptions are exceptions that could be handled by the guest

OS itself without involving the host OS, e.g. a user processin a guest VM making a system call or executing certainprivileged instructions. A naive implementation requires twoworld context switches, one from the guest VM to the host OS,and then from the host OS to the guest VM. The ITRI ARMhypervisor features a light-weight context switch mechanismthat embeds in exception handlers the logic to recognize thistype of exceptions and immediately transfer control back tothe guest OS without performing any world context switches.Light-weight context switch triggers protection domain cross-ing, but does not require page table or exception handler tableswitching and thus avoids the associated performance overheaddue to flushing of TLB and L2 cache.

2) Static Memory Partitioning: The idea of static memory

partitioning is to partition the physical memory on an ARMserver into multiple contiguous chunks and assign each chunkto the host and guest VM. The Linux kernel supports akernel boot option that specifies a physical memory addressrange given to that kernel, e.g. 16GB to 24GB. With thisoption, a well-behaved kernel, i.e., one that is not infectedby virus or malicious code, puts only physical page numbersin its assigned chunk in its page tables, and thus makes itimpossible to access the physical memory pages outside itschunk. However, this approach does not prevent a kernelrootkit in one guest VM from tampering with the page tablesand thus accessing physical memory pages of other guest VMs.

The ITRI ARM hypervisor closes this security holeby leveraging the physical address window mechanism in

ARMADA-XP. More specifically, the ITRI ARM hypervisorassigns each guest VM a physical memory chunk, and asso-ciates each physical memory chunk with a separate Mbus de-vice by setting the devices physical memory address windowto the chunks physical address range. When the hypervisorschedules a guest VM for execution, it turns on the physicalmemory address window of the Mbus device associated withthe physical memory chunk of the guest VM, and leaves allother Mbus devices physical memory address windows off.With this set-up, it is impossible for a guest VM to access anyphysical memory page outside its chunk. Because a physical

857

5/22/2018 6740234

4/8

Code Section

elf32obj file

C f iles with emulation

functions

Guest OS image

ELF header

Programheader table

Sectionheader table

Section N

. . .

Compile

& Link

. . .

Code Section

. . .

Section 1

. . .

Generate

wrappersofemulation

functions

Link SI and wrapper functions

Find all SIs in

guest image

Append the section to the guest image

Fig. 1. Binary rewriting algorithm flowchart showing the process of finding sensitive instructions (SIs) in guest image, generating assembly wrappers withinvocation of emulation functions for each SI, and replacing SIs to branch to the corresponding wrappers and appending and linking the emulation code to theguest image.

address window could service any memory access requestsfrom any core in ARMADA-XP, simultaneously running multi-ple guest VMs on different cores introduces the security risk ofone guest VM corrupting another guest VMs physical memorychunk. Therefore, for absolute security, it is advisable to runonly one guest VM on all CPU cores at a time, rather thanmultiple guest VMs each on a separate CPU core.

The main advantage of this physical memory remappingapproach to memory virtualization is it greatly simplifiesthe implementation complexity and reduces the performanceoverhead associated with shadow page table-based implemen-tations, especially with respect to detection of guest page tablemodification if para-virtualization is not used.

In addition to physical memory pages, a guest VM is alsogiven a physical memory address range as targets of memory-mapped I/O instructions when it starts up. The ITRI ARMhypervisor gives each guest VM the same unmapped range sothat when a guest VM accesses any location in this range, atrap is generated and control is transferred to the host OS.

3) Streamlined PI/SI Emulation: Emulation of many sen-sitive instructions does not always need to run in SVC mode.

Therefore, replacing all sensitive instructions with SWI causestraps to the host OS and incurs world context switch overheadsunnecessarily. The ITRI ARM hypervisor solves this problemby creating a dedicated emulation instruction sequence foreach sensitive instruction and then replacing each sensitiveinstruction with a branch instruction to its associated emulationinstruction sequence. Instead of performing instruction decod-ing, i.e., parsing the opcode, operands and addressing modes,at run time, the binary rewriter performs these operationsstatically and outputs the result to the emulation instructionsequences for a selective set of sensitive instructions.

Many SI emulation instruction sequences could run in usermode just fine because the host resources these SIs access arevirtualized, e.g. program status register and fault status register(FSR). As a result, accesses to these virtualized host resourcesdo not need to go through the host OS. Unfortunately, not allhost resources could be virtualized this way. Thats why someSI emulation instruction sequences still need to call on thehelp of the host OS, e.g. those that change the VCPU mode,

because the hypervisor needs to keep track of whether a guestVM runs in the guest user or guest kernel mode.

Some benchmarks, such as iperf and hdpram, incursignificant performance overhead when emulating sensi-tive instructions. After investigating the root cause, itwas found that the performance loss is due to exten-sive execution of load and store with translation instruc-tions (LDR[B]T/STR[B]T) in kernel routines such as

__copy_to_user_std, __copy_from_user, etc. Forexample in the __copy_from_user function a bunch ofLDRT instructions follow one after another:

(1) ldrt r3, [r1], #4

(2) ldrt r4, [r1], #4

(3) ldrt r5, [r1], #4(4) ...

(5) ldrt r8, [r1], #4

(6) ldrt ip, [r1], #4

(7) ldrt lr, [r1], #4

(8) subs r2, r2, #32

Our original solution to this problem was to recognize suchsensitive instruction sequences, and replace each sequence witha single branch to a specific emulation instruction sequencethat emulates the aggregated effects of these instructions and

858

5/22/2018 6740234

5/8

Fig. 2. To detect pages that are modified within a period of time, the hypervisormarks them as read-only at the beginning of the period, forces write protectionfaults when pages are modified, and record those pages in the dirty page bitmap.

then returns to the first instruction after the bunch ((8) inabove example). All the sensitive instructions in the bunch

except the first one are replaced with a NOP (No OPeration)instruction. Unfortunately this solution did not always work,because in the guest kernel image there exist such instructionbunches in which some indirect branch actually jumps to aninstruction within the bunch. To address this problem, our finalsolution is to replace each instruction with a branch to anemulation instruction sequence that performs loads or storesassociated with the replaced instructions and those followingit in the bunch. In above example, the emulation instructionsequence for instruction (5) should emulate instructions (5), (6)and (7). This way, even if instruction (5) is a target of someindirect branch, the execution result remains correct. For someworkloads this technique doesnt provide any performanceimprovement, but for iperf benchmark this technique reducesthe number of branches due to sensitive instructions by morethen 4 times.

The binary rewriter also strives to reduce the run-timeoverhead of guest VMs by avoiding unnecessary trappings intothe hypervisor. In the original ARM Linux kernel, there aremany instances in which one or multiple privileged instructionsare embedded inside a loop, but there is another instructionoutside the loop that could actually achieve the same effect asthese privileged instructions. For example, a loop containingseveral cache cleaning instructions is followed by anotherinstruction that triggers a context switch, which automaticallyentails a cache cleaning operation. In this case, all the cachecleaning instructions could be safely removed. Our binaryrewriter is able to identify these types of loops and removethe unnecessary privileged instructions in these loops.

D. Live Migration

The ITRI hypervisor also supports live migration [26],which iteratively copies the memory state of a VM from thesource machine to the destination machine, and stops the VMin order to copy the residual memory state over to complete themigration when the residual memory state is sufficiently small.To keep track of the set of dirty pages that still need to bemigrated, the ITRI ARM hypervisor implements a dirty pagetracking mechanism, as shown in Figure 2. At the beginning of

TABLE I. THE SET OF BENCHMARKS USED IN THE PERFORMANCEEVALUATION STUDY, INCLUDING THE TYPE OF SYSTEM RESOURCES THEY

STRESS AND THEIR INVOCATION PARAMETERS

Name TYPE Parameters

sysbench-cpu CPU cpu-max-prime=100

sysbench-threads

CP U, M emo ry nu m-th re ad s=6 4thread-yields=100

thread-locks=2

sysbench-memory Memory memory-total-size=100M

Apache Benchmark CPU, IO requests=10000concurrency=100

iperf IO time=60

hdparm IO -t

dd IO Read: if=/dev/zero of=200MB file

Write: if=200MB file of=/dev/null

Unixbench-double CPU n/a

Unixbench-spawn Memory n/a

every iteration, the hypervisor marks all the pages as read-onlyin the shadow page table. When a page is first modified in aniteration, a write protection fault occurs, and the hypervisorrestores the pages original permission, as indicated in thecorresponding guest page table entry, and records the pagein the dirty page bitmap. Pages in the dirty page bitmap are

targets of transfer in the next iteration.

IV. PERFORMANCEE VALUATION

A. Evaluation Methodology

The hardware test-bed used to evaluate the ITRI ARMhypervisor prototype is the Marvell ARMADA-XP develop-ment board [27] with a quad-core Sheeva Cortex-A9 CPU. Theboard has 4GB of RAM and is equipped with a SATA disk.It runs Linux 2.6.35 kernel with Marvells proprietary patch.The host OS is Ubuntu 9.10 (karmic) for ARM. The guest VMuses the same Linux kernel version without the Marvell patch,and boots from the same root file system. We are using thearm-softmmu target of qemu v. 0.14.50 with virtio patches.

Table I lists the set of benchmarks used in this performanceevaluation study. Sysbench [28] is an open-source benchmarkwith many testing options. The Sysbench CPU test is to findthe maximum prime number below 100. The Sysbench threadstest creates 64 threads and each thread processes 100 requestseach of which consists of locking 2 mutexes, yielding theCPU, and then unlocking when the thread is rescheduled. TheSysbench memory test is to read/write 100MB of data.

Apache Benchmark (ab) [29] is a benchmark designedto test the service performance and quality of web sites. Itmeasures the total time a web server spends on handling 10000requests that come in through 100 connections simultaneously.Iperf [30] is a network benchmark that tests how much data canbe transmitted within 60 seconds. Hdparm [31] is a benchmark

for testing the read/write performance of hard disks. The tooldd is for disk data reading or writing. The dd test is writing200MB to disk from a zero device (/dev/zero) and reading thisfile to the null device (/dev/null).

Unixbench [32] is an open source benchmark that consistsof a number of specific tests. Every test uses INDEX to showthe performance result. The higher the INDEX value, thebetter. In this paper we choose Whetstone and Process Creationtest. The Unixbench Whetstone test measures the executiontime of such floating-point operations as math operations ofsin, cos, sqrt, exp, and log. The Unixbench Process Creation

859

5/22/2018 6740234

6/8

TABLE II. MEASURED PERFORMANCE OF THE SET OF BENCHMARKSWHEN EXECUTED NATIVELY ON THE M ARVELL BOARD, O N AV M

RUNNING ON QEMU AND THEN ON THEM ARVELL BOARD, AND ON A V MRUNNING ON THE ITRI ARM HYPERVISOR AND THEN ON THE M ARVELL

BOARD. THE ITRI ARM HYPERVISOR USES STATIC MEMORYPARTITIONING, EMPLOYS PARA-VIRTUALIZATION RATHER THAN

SENSITIVE INSTRUCTION EMULATION FOR SENSITIVE INSTRUCTIONS , A NDENABLES LIGHT-WEIGHT CONTEXT SWITCHING.

Benchmark Native TCG ITRI-ARMsysbench-cpu (s) 0.8 45.1 0.8

sysbench-threads (s) 1.2 388.7 6.3

sysbench-memory (s) 0.5 43.3 0.9

Apache Benchmark (s) 3.7 106.1 7.7

Iperf (Mb/s) 761 19.5 435

hdparm (MB/s) 125 11.1 17.2

dd read/write (MB/s) 68 .1 / 11 8 2. 6 / 8 .8 1 9. 2 / 9 .6

Unixbench-double 72.7 0.6 71.7

Unixbench-spawn 222.3 7.1 182.8

test measures the number of processes that could be forkedwithin 30 seconds when processes are forked and then closed.

For Sysbench and Apache benchmark, the performancemeasure is the total execution time, and so the smaller the

better. For Iperf and hdparm, the performance measure isthe average throughput, and so the higher the better. Forthe Unixbench benchmark, the performance measure is theirINDEX number, and so the higher the better.

B. Performance Overhead of ITRI ARM Hypervisor

We ran the test benchmarks natively on the Marvell board(Native), on a VM running on qemu which in turn runs on theMarvell board (TCG or Tiny Code Generator [24]) and on aVM running on the ITRI ARM hypervisor, which in turn runson the Marvell board (ITRI-ARM). The native configurationdoes not have any virtualization overhead. The performanceoverhead of the TCG configuration represents the upper boundof the virtualization overhead, because qemu emulates every

instruction executed by a VM. The version of the ITRI ARMhypervisor used in this test adopts static memory partitioningrather than shadow page table for memory virtualization, em-ploys light-weight context switching to eliminate unnecessaryworld context switches, and uses manual para-virtualizationrather than SI emulation to handle sensitive instructions. There-fore this version represents the most performant configurationof the ITRI ARM hypervisor.

For CPU-intensive benchmarks, i.e., sysbench-cpu,sysbench-threads, ab and Unixbench-double, TCGsperformance is 30 to 60 times slower than Nativesperformance, but ITRI-ARMs performance is almost thesame as Natives performance, except sysbench-threads, whichrequires the guest OS to execute many privileged instructions

and trap to the hypervisor. As a result, the performance ofsysbench-threads under ITRI-ARM is five times slower thanthat under Native, and this performance difference mainlycomes from the additional context switches caused by theprivileged instructions executed by the guest OS. Because theITRI ARM hypervisor runs guest VMs as user processes, itis inevitable that privileged instructions trigger exceptions.

For memory-intensive benchmarks, i.e., sysbench-threads,sysbench-memory, and Unixbench-spawn, TCGs performanceis again 30 to 400 times slower than Natives performance.But excluding sysbench-threads, ITRI-ARMs performance is

between 1.3 to 1.8 times slower compared with Nativesperformance. Because more than 90% of the PI exceptions inUnixbench-spawn are handled by light-weight context switch,its relative performance penalty is less than that of sysbench-threads.

For network-intensive benchmarks, i.e., iperf and ab,TCGs performance is 30 to 40 times slower than Nativesperformance, but the slow-down of ITRI-ARM when comparedwith Native is around a factor of 2. The performance differencebetween ITRI-ARM and Native comes from additional contextswitches on the network packet delivery path induced by thevirtio architecture. For disk-intensive benchmarks, i.e., hdparmand dd, TCGs performance is 10 to 20 times slower thanNatives performance and ITRI-ARM is between 7 to 10 timesslower than Natives performance. The substantial performancedifference between ITRI-ARM and Native comes from the factthe former uses readv/writev for file I/O whereas the latteruses read/write for file I/O. The reason that the ITRI-ARMuses readv/writev is because the guest physical addressesof the payload of a file I/O operation issued from a guest VMin general are not consecutive.

To better understand the performance gap between Nativeand ITRI-ARM under sysbench-threads and sysbench-memory,we decompose the performance overhead of the ITRI ARMhypervisor into the following five categories: (1) emulation ofprivileged instructions (PI) (2) handling data access exceptions(DA), (3) processing due to interrupts (IRQ), (4) handling ofsystem calls (SC) and (5) emulation of sensitive instructions(SI), as shown in Table III. Without any optimization, pro-cessing of these five types of exceptions requires full contextswitching. The light-weight context switch optimization in theITRI ARM hypervisor reduces the context switching overheadsof SC and some percentages of PI, DA and SI. In addition,the ITRI ARM hypervisor emulates some sensitive instructionsin user space, thus incurring no context switch at all. The

fact that a significant percentage of sensitive instructions arehandled completed in user space demonstrates the effectivenessof the SI emulation optimizations built into the ITRI ARMhypervisor.

This overhead breakdown shows that the main reasonbehind the significant performance penalty incurred by ITRIARM hypervisor in sysbench-threads and sysbench-memoryworkloads is increased number of PI exceptions that require afull context switch, compared to other workloads. In addition,sysbench-threads also makes many more system calls, eachof which requires a number of sensitive instructions to returnfrom the guest kernel to the guest user.

C. Effectiveness of Performance Optimizations

Table IV presents the performance benefit of each of threemain performance optimizations: static memory partitioning,light-weight context switch, and para-virtualization to removeSI emulation. The with SPT column of Table IV showsthat, implementing memory virtualization using shadow pagetable rather than static memory partitioning, which correspondsto the fastest version of the ITRI ARM hypervisor (ITRI-ARM column), incurs no performance loss for all benchmarksexcept Unixbench-spawn. This is because Unixbench-spawncreates and destroys many processes, causing many updates to

860

5/22/2018 6740234

7/8

TABLE II I. COUNT OF AND TIME SPENT IN FIVE TYPES OF TRAPS WHEN THE ITRI ARM HYPERVISOR RUNS UNDER SYSBENCH-THREADS ANDSYSBENCH-MEMORY: (1) PRIVILEGED INSTRUCTION (PI), (2) DATA ACCESS EXCEPTION (DA), INTERRUPT (IRQ), SYSTEM CALL (SC), AND SENSITIVE

INSTRUCTIONS (SI). LIGHT-WEIGHT CONTEXT SWITCH CAN ONLY BE APPLIED TOS C AND SOME PERCENTAGES OFDA, PI A ND S I.

Benchmarks F ull C on te xt S wi tc h Li gh t- Wei gh t C on te xt S wi tch N o C on te xt S wi tc h

PI DA IRQ PI DA SC SI SI

sysbench-thread # of traps 531275 180 579 386481 1060 1161816 1161184 12882150Total = 9.2 sec) Time (sec) 3.03 0.0024 0.024 0.5 0.0006 0.64 0.5 2.4

sysbench-memory # of traps 891 104 169 2187 708 409801 414384 2908546Total = 3.9 sec) Time (sec) 0.0092 0.002 0.007 0.192 0.0005 0.238 2.3 0.7

TAB LE I V. PERFORMANCE IMPACT OF SHADOW PAGE TABLE (SPT), LIGHT-WEIGHT CONTEXT SWITCHING(LWCS), AND EMULATION OF SENSITIVEINSTRUCTIONS IN EXCEPTION HANDLERS (SIE) ON THE ITRI ARM HYPERVISOR, WHOSE OPTIMIZED SETTING IS S PT O FF, LWCS ON AND S IE O FF.

B en ch ma rk I TR I- AR M Fo ur G ue st s w it h S PT w it ho ut LWC S w it h S IE

sysbench-cpu (s) 0.8 0.8 0.8 1.1 0.8

sysbench-threads (s) 6.3 6.6 6.6 34.2 9.2

sysbench-memory (s) 0.9 0.9 0.9 6.3 3.9

Apache Benchmark (s) 7.7 14.1 7.7 10.9 12.7

Iperf (Mbps) 435 123 434 428 109

hdparm (MB/s) 17.2 16.5 17.1 17.1 16.7

dd read/write (MB/s) 19.2 / 9.6 11.3 / 4.8 18.6 / 9.1 12.5 /8.8 1.4 / 9.5

Unixbench-double 71.7 71.6 71.5 71.6 71.7

Unixbench-spawn 182.8 117.5 40.5 44.1 95.3

the guest pages tables and eventually to their shadow copies.

These guest page table updates eventually result in more than4 times slow-down (from 182.8 to 40.5). This degradationis surprisingly low, and one explanation is the current SPTimplementation requires para-virualization to capture guestpage table modifications.

The substantial difference between the without LWCScolumn and the ITRI-ARM column across all benchmarks con-firms that light-weight context switch is a useful optimizationthat effectively cuts down unnecessary full context switches.The with SIE column corresponds to the guest OS con-figuration that uses emulation rather than para-virtualizationto handle sensitive instructions. As a result, for Unixbench-spawn, iperf and sysbench-memory benchmarks, which triggermany SI exceptions, the difference between the with SIE

column and the ITRI-ARM column is substantial. For the otherbenchmarks, such as sysbench-cpu and Unixbench-double,these two columns are essentially the same. The performancepenalty of each test benchmark when SIE is turned on is pro-portional to the number of sensitive instructions encountered inthe test run. For example, the performance penalty of SIE forsysbench-memory is more significant than that of sysbench-CPU because the number of sensitive instructions executed inthe former is more than 8 times of that in the latter.

Because the Marvell board has a quad-core ARM CPU,it is possible to run four guest VMs on it simultaneously,each running on a separate core. The Four Guest column inTable IV shows the average measured performance of each testbenchmark running on each guest VM when four guest VMs

run on the Marvell board. For CPU-intensive benchmarks,there is nearly no difference between the one-guest and four-guest configurations. This means the ITRI ARM hypervisoris able to efficiently utilize all four cores to run the fourguest VMs. However, for some memory-intensive benchmarks,e.g. Unixbench-spawn, the four-guest configuration shows anon-trivial performance degradation, because the L2 cacheand memory is shared among the guest VMs. For iperf, theperformance of the four-guest configuration is roughly onequarter of that of the one-guest configuration because thenetwork card of the Marvell board is shared among the four

TABLE V. THE PERFORMANCE OF A SET OF BENCHMARKS WHENTHEY RUN IN A NORMAL VM AND WHEN THEY RUN IN A MIGRATED VM ,

AND THE SERVICE DOWN TIME OF THE TEST V MS WHEN THEY AREMIGRATED

Ben chmark Normal In Mig ra tion Down Time (s)

sysbench-cpu2 (s) 10.77 25.91 2.05

sysbench-threads (s) 5.46 19.42 2.27

sysbench-memory2 (s) 8.28 25.73 2.43

hdparm (MB/s) 10.24 2.40 2.53

dd for write/read (MB/s) 7.3/10.0 6.0/8.0 4.05

Unixbench-double 72.4 72.2 2.05

Unixbench-spawn 198.5 197.3 2.13

guest VMs. For dd, surprisingly the performance of the four-guest configuration is one half rather than one quarter of thatof the one-guest configuration, because the bottleneck in thiscase is the virtio block device rather than the underlying disk

and so the one-guest configurations performance is not thebest it could be.

D. Live Migration Performance

Table V shows the performance of a set of benchmarkswhen they run in a normal VM and when they run ina VM that is being migrated from one physical machineto another. The performance differences between these twocolumns represent the performance impact of migration onthe test applications. The Normal column has the sameconfiguration as the ITRI-KVM column in previous tablesexcept for sysbench-cpu and sysbench-memory. Due to theoriginal elapsed time of sysbench-cpu and sysbench-memorywere too short to measure, we adjusted their parameters to

make them execute for a longer period. Also, we put the virtualdisk image on network file system for migration. In general,the more memory state is modified, the larger the performancedegradation. For CPU/memory intensive workloads, the per-formance degradation is particularly significant because theymodify more memory pages, and thus cause higher dirty pagetracking and memory state copying overhead. The Down Timecorresponds to the freeze time of the VM being migrated ineach test, and ranges between 2 to 4 seconds. These Downtimes are higher than expected because the current ITRI ARMhypervisor prototype uses only two iterations in each VM

861

5/22/2018 6740234

8/8

migration operation, rather than multiple iterations which canmake residual memory state small enough.

V. CONCLUSION

In this paper we present a fully functional KVM-ARM hy-pervisor running on the Marvell ARMADA-XP board, whichhas already been deployed in some server products. To the best

of our knowledge, this is the first hypervisor that successfullyruns multiple VMs with full isolation on a real-world ARMCotex-A9-based SOC, which does not provide any hardwaresupport for virtualization. Although the KVM-ARM hypervi-sor described here borrows heavily ideas and implementationsfrom the KVM-86 project, it still features several innovationsthat are tailored to the Marvell platform, including binaryrewriting to remove unnecessary cache/TLB flushing, memoryvirtualization using static memory partitioning and physicalmemory remapping, and light-weight context switch.

With these optimizations, the performance of CPU-intensive benchmarks running on VMs that sit on top ofthe KVM-ARM hypervisor is almost the same as nativeperformance, but the performance of system activity-intensive

benchmarks fares much worse, sometimes up to a factor of5 slow-down. Most of the performance penalty associatedwith system activity-intensive benchmarks is due to sensitiveinstruction emulation and exception handling for privilegedinstructions. We plan to further optimize this KVM-ARMhypervisor, and expect to port some of advanced featuressuch as VM migration and develop VM fault tolerance tothe hypervisor of Cortex-A15 and next generation Cortex-A53/A57.

REFERENCES

[1] Marvell. powers dell copper ARM server.http://www.marvell.com/company/news/pressDetail.do?releaseID=2396.

[2] Calxeda ecx-1000, http://www.calxeda.com/technology/products/processors/ecx-1000-series/.

[ 3] Zt s ys tem s announces ARM-bas ed server soluti on,http://ztsystems.com/Default.aspx?tabid=1484.

[4] Intel Virtualization Technology, http://www.intel.com/technology/virtualization/.

[5] C. Dall and J. Nieh, Kvm for arm, in Proceedings of the OttawaLinux Symposium, Ottawa, Canada, 2010.

[6] S. Sang-bum, Secure xen on ARM: Status and driver domain separa-tion, 5th Xen Summit, 2007.

[7] VMware, Horizon mobile, http://www.vmware.com/products/desktop virtualization/mobile/overview.html.

[8] Embeddedxen virtualization framework, http://sourceforge.net/projects/embeddedxen/.

[ 9] P. Barham , B. Dr agovic, K. Fraser, S. Hand, T. Harr is,A . H o, R. Neugebauer, I . Pratt , and A. Warfield, X enand the art of virtualization, SIGOPS Oper. Syst. Rev.,vol. 37, no. 5, pp. 164177, October 2003. [Online]. Available:

http://doi.acm.org/10.1145/1165389.945462

[10] A. Kivity, Y. Kamay, D. Laor, U. Lublin, and A. Liguori, kvm: thelinux virtual machine monitor, inProceedings of the Linux Symposium,vol. 1, 2007, pp. 225230.

[11] J. Ding, C. Lin, P. Chang, C. Tsang, W. Hsu, and Y. Chung, ARMvisor:System virtualization for ARM, in Linux Symposium, 2012, p. 95.

[12] K. Adams and O. Agesen, A comparison of software andhardware techniques for x86 virtualization, in Proceedings ofthe 12th international conference on Architectural support for

programming languages and operating systems, ser. ASPLOS XII.New York, NY, USA: ACM, 2006, pp. 213. [Online]. Available:http://doi.acm.org/10.1145/1168857.1168860

[13] R. Bhargava, B. Serebrin, F. Spadini, and S. Manne, Acceleratingtwo-dimensional page walks for virtualized systems, in Proceedingsof the 13th international conference on Architectural support for

programming languages and operating systems, ser. ASPLOS XIII.New York, NY, USA: ACM, 2008, pp. 2635. [Online]. Available:http://doi.acm.org/10.1145/1346281.1346286

[14] G. Hoang, C. Bae, J. Lange, L. Zhang, P. Dinda, and R. Joseph, A casefor alternative nested paging models for virtualized systems, Computer

Architecture Letters, vol. 9, no. 1, pp. 1720, jan. 2010.

[15] P. Varanasi and G. Heiser, Hardware-supported virtualization onARM, in Proceedings of the Second Asia-Pacific Workshop onSystems, ser. APSys 11. New York, NY, USA: ACM, 2011, pp. 11:111:5. [Online]. Available: http://doi.acm.org/10.1145/2103799.2103813

[16] Virtual open systems, http://www.virtualopensystems.com/.

[17] S. Stabellini and I. Campbell, Xen on ARM Cortex A15,Xen Summit,2012.

[18] Redbend software: vlogix mobile for mobile virtualiza-tion, http://www.redbend.com/en/products-services/mobile-virtualization/how-it-works.

[19] G. Heiser and B. Leslie, The okl4 microvisor: convergence pointof microkernels and hypervisors, in Proceedings of the first ACMasia-pacific workshop on Workshop on systems, ser. APSys 10.New York, NY, USA: ACM, 2010, pp. 1924. [Online]. Available:http://doi.acm.org/10.1145/1851276.1851282

[20] ARM, big.little processing, http://www.arm.com/products/processors/

technologies/biglittleprocessing.php.

[21] M. Zyngier, KVM on arm64, http://lwn.net/Articles/529848/.

[22] ARM, ARM architecture reference manual ARMv7-a and ARMv7-redition. [Online]. Available: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0406b/index.html

[23] AMD, Amd64 virtualization codenamed pacifica technology:Secure virtual machine architecture reference manual,http://www.cs.utexas.edu/users/hunt/class/2005-fall/cs352/docs-em64t/AMD/virtualization-33047.pdf.

[24] F. Bellard, Qemu, a fast and portable dynamic translator,in Proceedings of the annual conference on USENIX AnnualTechnical Conference, ser. ATEC 05. Berkeley, CA, USA:USENIX Association, 2005, pp. 4141. [Online]. Available:http://dl.acm.org/citation.cfm?id=1247360.1247401

[25] R. Russell, virtio: towards a de-facto standard for virtual i/o devices,SIGOPS Oper. Syst. Rev., vol. 42, no. 5, pp. 95103, jul 2008.

[Online]. Available: http://doi.acm.org/10.1145/1400097.1400108[26] C. Clark, K. Fraser, S. Hand, J. G. Hansen, E. Jul, C. Limpach, I. Pratt,

and A. Warfield, Live migration of virtual machines, in Proceedingsof the 2nd conference on Symposium on Networked Systems Design

& Implementation - Volume 2, ser. NSDI05. Berkeley, CA,USA: USENIX Association, 2005, pp. 273286. [Online]. Available:http://dl.acm.org/citation.cfm?id=1251203.1251223

[27] Marvell, Armada-xp overview, http://www.marvell.com/embedded-processors/armada-xp/.

[28] A. Kopytov, Sysbench, a system performance benchmark,http://sysbench.sourceforge.net/.

[29] T. A. S. Foundation, ab - apache http server benchmarking tool,http://httpd.apache.org/docs/2.0/programs/ab.html.

[30] A. Tirumala, F. Qin, J. Dugan, J. Ferguson, and K. Gibbs, Iperf: Thetcp/udp bandwidth measurement tool, http://iperf.sourceforge.net/.

[31] M. Lord, hdparm, http://hdparm.sourceforge.net/.

[32] I. Smith and K. Lucas, Unixbench, http://code.google.com/p/byte-unixbench/.

862

6740234

Documents