1 efﬁcient shared memory orchestration · swapping can be effectively reduced, compared to the...

1

Efficient Shared Memory OrchestrationTowards Demand Driven Memory Slicing

Qi Zhang1,2, Ling Liu1, Calton Pu1, Wenqi Cao1, and Semih Sahin1

1Georgia Institute of Technology2IBM Thomas. J. Watson Research

Abstract—Memory is increasingly becoming a bottleneckfor big data and latency-demanding applications invirtualized systems. Memory efficiency is critical forhigh-performance execution of virtual machines (VMs).Mechanisms proposed for improving memory utilizationoften rely on accurate estimation of VM working set sizeat runtime, which is difficult under changing workloads.This paper studies memory efficiency and its impact on theperformance of VMs. Through MemLego, a light weightshared memory based system, we show some opportunitiesof improving memory efficiency for high performance VMexecution. First, we show that if each VM is initializedwith an application-specified lower bound of memory,then by maintaining a shared memory region across VMsin the presence of temporal memory usage variationson the host, those VMs under high memory pressurecan minimize their performance loss by opportunisticallyand transparently harvesting idle memory on other VMs.Second, we show that by enabling on-demand VM memoryallocation and deallocation in the presence of changingworkloads, VM performance degradation due to memoryswapping can be effectively reduced, compared to theconventional VM configuration scenario, in which allVMs are allocated with the upper-bound of memoryrequested by their applications. Third, we show that byproviding shared memory pipes between co-resident VMs,one can speed up the inter-VM communication and datatransfer by leveraging the network-memory latency gapand avoiding unnecessary overhead of establishing networkconnections. MemLego achieves all these without requiringany modification to user applications and the OSes.We demonstrate the effectiveness of these opportunitiesthrough extensive experiments on unmodified Redis andMemCached. Using MemLego, throughput of Redis andMemcached improves by up to 4x over the native systemwithout MemLego, up to 2 orders of magnitude when theapplications working set size does not fit in memory.

I. INTRODUCTION

Main memory is a critical and shared resource invirtualized computing environment. As the number ofCPU cores doubles approximately every 2 years, theDRAM capacity is doubling roughly every 3 years [1].As a result, the memory capacity per core is expected

to drop by about 30% every two years [2]. The trend isworse for memory bandwidth per core [3]. At the sametime, with the upsurge of big data processing and bigdata analytics, main memory is increasingly becominga bottleneck in virtualized systems. Large portionsof big data softwares and applications are aiming atmaximizing the use of memory and minimizing theaccess to high latency backing storage. For example,Redis [4] and Memcached [5] are popular in-memorystores for big data workloads. Spark [6], Flume [7],Kafka [8] are popular big data computing platforms formemory-intensive applications.

However, memory utilization is far from satisfying andimproving memory efficiency is becoming an importantoptimization goal for many big data systems andapplications. Recent statistics from Google datacentertraces [9], [10] show that many applications allocatemore than 90% of the memory, but only less than 50%of them has been used. One reason is that predictingthe working set of memory is difficult and memoryallocation configured for each virtual machine (VM)tends to be conservative to meet the expected andoften over estimated peak demand, resulting in temporalmemory usage variations and idle memory across VMsmost of the time.

Advances in computer hardware technologies tacklethe memory bottleneck problem along two dimensions.(1) Hardware enhancements on single machine: VariousDRAM optimization technologies have been proposedand developed for improving DRAM parallelism [11],[12], latency and energy [13], [14], [15], [16], [17], andminimizing memory capacity and bandwidth waste [18],[19], [20]. In addition, emerging memory technologies,such as non-volatile random-access memory (NVRAM)technologies, and hybrid main memory systems, areproposed and being deployed. Research and development(R&D) efforts along both dimensions confirm theimportance of achieving efficient memory utilizationunder changing workloads. [3] provides a good

review of problems and challenges in memory systemsfrom computer hardware research perspective. (2) Newarchitectures and new hardware design for memorydisaggregation: Examples include HP’s the machineproject [21], Intel RSA [22], proposals for disaggregatedmemory and new network protocols for cutting downnetwork communication cost [23], [24].

Orthogonal research efforts for improving memoryefficiency have been engaged from software designand optimization, such as dynamic memory balancingthrough estimating the working set of each VM [25],[26], [27], [28] and moving memory from one VMto another via balloon driver [29] or applicationlevel ballooning [30]. However, accurate estimation ofVM working set size, and accurate detection of theright timing to turn on ballooning are both difficultproblems [31], [28].

It is widely applauded that virtualization technologieshave fundamentally changed how computing resourcesshould be shared in two notably ways: (1) Itprovides Xen-hypervisor coordinated, or KVM-likehost coordinated, CPU scheduler to allow VMssimultaneously share the same CPU capacity of theirhost by demand-driven CPU time slicing [32], [33],[34], transparent to guest OSes and user applications.(2) It provides a uniformed virtual block device driverinterface to enable transparent and selective sharingof external computing resources as I/O block devices(e.g., NIC, GPU, disk, SSD). However, demand-drivenmemory slicing remains to be an open challenge.This paper gears towards exploiting the potential ofcreating host coordinated shared memory pools toenable demand-driven memory slicing, without requiringany modifications to applications and the OSes. Wepresent MemLego, a light weight and shared memorybased approach for improving memory efficiencyand achieving high performance VM execution, withtransparency to applications and OSes. ThroughMemLego, we show the opportunity and feasibilityof using shared memory based approach to providedemand-driven memory slicing from three dimensions.First, we show that by reserving and managing a sharedmemory region at host and providing efficient memorysharing across its VMs, a light-weight mechanism canbe effective for VMs to enjoy on-demand memoryallocation and de-allocation that are proportional to theirworkloads’ demands. Second, by presenting a sharedmemory backed guest paging facility as an integral partof MemLego system, we show that a shared memorybased solution can effectively utilize host idle memorywhen the working set of applications does not fitfully in memory. Third, by establishing and enabling

inter-VM shared memory pipe in MemLego, we showthe opportunity to further improve the communicationefficiency for data transfer across VMs on a host, whichresult in performance enhancement of VM execution andapplication throughput. We demonstrate the effectivenessof these opportunities through extensive experiments onunmodified Redis and MemCached workloads, showingthat using MemLego, throughput of MemCached andRedis improves by up to 4x on average. When theworking set of applications does not fit in memory, usingMemLego shared memory paging capability, throughputof Redis and Memcached improves by up to 2 orders ofmagnitude over conventional OS swap facility.

II. MOTIVATION AND OBSERVATIONS

In this section, we motivate the design of MemLego.(1) Barriers of Dynamic Memory BalancingIn-memory computing is gaining its popularity for bigdata analytics applications for one obvious reason: whenapplications are completely served in memory and diskaccesses are minimized, they will enjoy the full benefit ofthe large disk-memory latency gap with high throughputsand low latencies. However, as soon as their working setsdo not fully fit in memory, these applications suffer largeperformance loss, even when idle memory exists in otherVMs on the host.

Hence, existing dynamic memory balancing solutionshave been centered on accurate working set sizeestimation and memory ballooning. Just in timeballooning depends heavily on accurate working setsize estimation across VMs. When the estimation ispessimistic, ballooning might be triggered too early, anda VM may receive more memory than it actually needs,at the cost of memory access interception [28]. Whenthe estimation is optimistic, it means that the workingsets of applications no longer fully fit in memory,memory paging has happened, and applications havestarted suffering performance loss, before ballooningis triggered. In addition to the timing delay fortriggering ballooning, VMs under memory pressure mayexperience the balloon driver delay for moving sufficientmemory between host and guest VMs, as well as theVM swap-in paging delay. Each of these three types ofdelay may incur detrimental performance degradation forapplications. The timing delay can cause guest VM tostart swapping even when there is enough idle memoryat the host, and VM performance falls down to thefloor [35]. The balloon driver delay may happen inthe presence of high CPU utilization at host or guestVMs, which further exacerbates guest memory paging,decreases application throughputs, and increases thelikelihood of out of memory (OOM) induced crashes.

2

Fig. 1: Balancing delay Fig. 2: Memory inflating delay Fig. 3: Redis performance

Also per-page based swap-in upon page fault may causefurther delay that prevent applications from immediatebenefit of the additional ballooned memory.

To illustrate these problems, we conduct a setof experiments on an Intel server with two 6-coreIntel Xeon-Westmere X5675 processors, 24GB DDR3physical memory, 1.5 TB SCSI hard disk, and 1GbitEthernet interface. The host machine runs Ubuntu 14.04with kernel version 4.1.0, with KVM 1.2.0 with QEMU2.0.0 as the virtualization platform. The guest VMsalso run Ubuntu 14.04 with kernel version 4.1.0 and4 VMs with 4GB each are simultaneously running onthe host. We refer to this setup as the native system forpresentation convenience. We run Redis [4] with YCSBworkload [36] on a VM. Peak memory usage for loading3GB data to Redis was close to 4GB, thus 100% ofits working set fits in memory. We then increase Redisworkload to 6GB data such that only 75% of its workingset fits in memory. Figure 1 shows the throughput ofRedis. Figure 2 shows the elapse time of ballooningvarying amount of memory to the VM running Redisunder varying CPU utilization ratio.

We observe that as soon as guest memory pagingoccurs, throughput of Redis drops sharply. As sufficientamount of additional memory is added by ballooning,throughput performance of Redis starts to recoverbut at a very slow pace and it took more than 60seconds, because of per-page swap-in upon page faultin conventional OS disk swap facility. The elapse timein Figure 2 includes the CPU time consumed by theballoon driver and the waiting time for the balloon driverto be scheduled. Clearly, the elapse time increases asthe CPU utilization on the VM is getting higher. Ifhigh CPU utilization is present in both the VM and itshost, even longer CPU waiting time will be introducedand thus longer total elapse time to move the scheduledamount of memory. Through this experiment, we showthe barriers of dynamic memory balancing through thethree types of delays. This motivates us to design andimplement MemLego, a shared memory based solutionfor exploiting idle memory across VMs on the host in alight weight and non-intrusive manner.

(2) Benefit of MemLego ApproachInstead of using a reactive approach like existing

proposals for dynamic memory balancing, MemLegoimproves memory efficiency by taking a proactiveapproach. Instead of depending on estimating workingset sizes and detecting memory imbalance across VMs,MemLego exploits temporal and transient memory usagevariations to utilize idle memory proactively in threesteps. First, MemLego allocates the initial memoryfor each VM according to its lower bound memorycapacity. Second, MemLego reserves and manages ahost memory region that is shared cross co-hosted VMs.Third, MemLego provides a light-weight and on-demandmemory allocation and de-allocation mechanism to allowVMs to add more memory from or release idle memoryto the shared memory pools according to their changingworkloads. This design provides three benefits: (i) itcan conserve more free memory at the host, making itready to use, (ii) it can allocate addition memory to VMimmediately upon workload demand without relying onconstant monitoring, and (iii) it avoids the unnecessaryballooning delay for moving memory between VMs viahost.

Figure 3 shows a comparison of MemLego withboth native system and native system with ballooningenabled. The experimental set up is the same as the oneused in Figure 1. We observe that without ballooning,Redis server crashed when only 65% of its working setfits in memory. Enabling Balloon driver on the nativesystem, Redis avoids crashing as additional memorywas ballooned to the VM , but it takes more than 60seconds for throughput of Redis to return to its expectedperformance after the VM has obtained sufficientmemory. This long delay is due to the page-faulttriggered per-page based swap-in. Using MemLego,Redis responds to guest paging gracefully through itsshared memory based paging (MemSwap) and respondsseamlessnessly to the high memory pressure on the Redisserver by dynamically allocating additional memoryfrom host managed shared memory to the Redis server.These experiments demonstrate the benefits of MemLegoover the conventional (native) system, and the native

3

system with existing memory balancing solution, suchas the balloon driver, enabled.

III. MEMLEGO SHARED MEMORY SYSTEM

This section describes the design of MemLego corecomponents: ShmManager and MemExpand. The formeris responsible for establishing a shared memory channelbetween the host and the VMs to enable flexible andon demand sharing of the free memory at the host. Thelatter is responsible for providing dynamic allocation andde-allocation of shared memory based on the workloaddemands of respective VMs. Figure 4 shows a sketch ofthe MemLego system architecture. We will illustrate thedetail in the subsequent sections.

Fig. 4: MemLego Architecture

A. Establishing Shared Memory Channel

MemLego is implemented as a shared memoryoptimization layer between the host kernel and the VMsrunning on the host. The main job of the ShmManager isto enable VMs to access the host-guest shared memoryregion in a coordinated manner. Before launching VMson the host, MemLego Initiator allocates and initializesa segment of the host free memory as the reservedshared memory region managed by MemLego. It alsocreates an emulated virtual PCI device as the bridgebetween the host and the VMs such that a virtual PCIdevice can be mounted to each VM. A PCI device hasseveral base address registers (BARs), which is used tospecify the virtual memory regions that this PCI devicecan use. Therefore, the shared memory allocated by theInitiator is assigned to one of the BARs of the PCIdevice as its memory region. Inside a VM, a PCI devicedriver is created, which maps the PCI device memoryinto its kernel space. The device driver will be invokedonce the driver in the VM is executed. For example,pci ioremap bar(pci dev, 2) requests the kernel to mapthe memory region specified by BAR2 into the VM’s

kernel address space. After the VM kernel maps amemory region into its kernel address space, the otherkernel modules as well as the user-level applications canaccess the shared memory.

The two interface functions that ShmManagerprovides to the VMs are unsigned longshm malloc(size t size) and void shm free(unsignedlong offset, unsigned long len). The former allocates apiece of shared memory from the shared memory regionwith the capacity specified by the input parametersize, and returns to the VM the offset in the sharedmemory region where the allocated shared memorypiece starts. The shm malloc() returns -1 when there isinsufficient free shared memory to satisfy the allocationrequest. The latter API function allows the VM to freea pre-allocated piece of shared memory.

B. Organizing Shared Memory

To support flexible shared memory management, theshared memory region in the host is divided into sharedmemory slabs of fixed size (e.g., 32KB, 128KB, or512KB per slab) and free slabs are maintained usinglinked list. This enables the shared memory to beallocated to or revoked from the VMs slab by slab. InMemLego, we organize the shared memory chunks intothree types of linked lists, representing three states ofslabs: active, inactive, and idle. Active slabs refer tothe ones that have allocated to some VM and currentlycontain valid data. Inactive slabs are those that have beenallocated to some VM but have not been or no longerbe actively used by the applications running on the VM.When a slab does not belong to any VM, this slab isidle. Slabss may dynamically convert from one state toanother based on their current usage. For each VM, wemaintain an active slabs list and an inactive slabs list.In addition, all the free slabs that have not been allocatedto any VM will be put together using a linked list, namedfree slabs.

Three types of metadata, free shm, list descriptor,and chunk descriptor, are maintained to facilitate VMsto access the host coordinated shared memory pool.free shm is a global meta data which locates in thefront of the shared memory region and indicates theoffset of the first element in the free slabs. WheneverMemLego wants to allocate a shared memory slab toa VM, it needs to check this metadata first. Given thatthis is a global metadata that is concurrently accessedand updated by multiple VMs, in order to guaranteethe update consistency, a global lock is assigned tofree shm, and every VM needs to successfully acquirethis lock before reading and updating the free shm. The

4

list descriptor contains the information about a singlelinked list that belongs to some VM. For example, twopointers are maintained, pointing to the start slab andthe end slab that store the valid data, so that when newdata arrives, it can be appended next to the end of thecurrent valid data. Since a slab may be partially used,each slab descriptor records which part of the data inthe slab is valid, while the rest of slab is free.

C. On Demand Memory Allocation

MemExpand is the second core component ofMemLego, which is built on top of ShmManager toprovide on demand memory allocation from the hostshared memory to a VM that is under memory pressure.The implementation of the MemExpand needs to addresstwo issues: (1) how to allocate and deallocate sharedmemory, (2) how to turn on shared memory allocationon demand, when a VM is under memory pressure.

MemExpand provides two basic interface functionsof MemExpand is shm malloc() and shm revoke() toaddress the first issue. When a VM needs more memoryresource, it uses the shm malloc() to allocate some freeshared memory slabs from the shared memory poolmaintained by MemLego, for example, by removingthe newly allocated slabs from the free list and addingthem to the corresponding active slab linked list forthe respective VM. Similarly, a separate thread ismaintained by MemExpand, which periodically checksthe utilization of the linked list in each host-guestshared memory pool, and uses shm revoke() to removethe inactive slabs when possible. In order to preventinformation leakage, when removing memory slabs fromthe linked list in a VM, all data content is fully erasedfirst before putting it back to free list. Thus, each newlyallocated shared memory slab is completely cleaned(empty).

To address the second issue, we first describe somebackground. Generally speaking, applications requestmemory from the operating system by either declaringan array or use the library call malloc(). Declaring anarray helps the applications to get memory from itsstack space, and the lifetime of this memory region isonly within the function in which the array declarationis called. In other words, the allocated array will beautomatically destroyed when the function finishes. Atthe same time, the applications can get memory from itsheap space by using malloc(), and the memory regionallocated by malloc() can last until a free() has beenexplicitly invoked. Although allocating memory fromthe stack is easier and faster, allocating from the heapspace using malloc() is a more flexible approach for

large memory allocation and it is also widely used by alarge number of applications for their memory allocationfor a number of reasons: (1) Since the size of heapspace is much larger than that of the stack space, largeblock of memory is usually allocated from the heapspace by using malloc(). (2) The duration of the memoryallocated from the heap can be explicitly controlled bythe application. (3) The size of the memory regionsallocated by malloc() can be dynamically changed, whilethe array based allocation from stack is static and fixed.

In the first prototype implementation of MemLego,the MemExpand intercepts the library call malloc()in the guest and replaced it with its own functionnamed expand malloc(). This approach is light weightand transparent to both the applications and the VMkernel, and thus neither applications nor guest OS kernelneed to be modified. We achieve the application leveltransparency by leveraging the dynamic linker featureof GCC. Concretely, GCC provides an option named”PRE LOAD” which allows users to selectively overridefunctions in shared libraries. The new memory allocationfunction expand malloc needs to be compiled first into ashared library ended with .so and then the applicationswill be executed by dynamically linking to this sharedmemory. After that, all the malloc() in the applicationwill be automatically replaced by expand malloc().

The implementation of expand malloc() needs todetermine when it should resort to the default malloc()and when it should allocate memory from the sharedmemory pool. This is critical since the shared memoryallocation should be triggered as soon as the VMexperiences memory pressure. Given the high costof memory scan to estimate the working set size,MemExpand utilizes MemLego’s shared memory pagingmodule to determine when to trigger shared memoryallocation. Concretely, when a expand malloc() is called,it first checks whether the usage of the VM swaparea has been increased as MemLego also managesguest paging through shared memory (see next Section).If not, it indicates that the working set of the VMmight still fit in memory, expand malloc() invokes theoriginal malloc() from glibc, which allocates memoryin a traditional manner via guest OS. If an increasedutilization of shared memory swap partition is observed,it indicates that the VM may not fit 100% of its workingset in memory and allocating memory via the defaultmalloc() will incur more severe guest memory pagingevents, and the VM may suffer large performance loss.Therefore, expand malloc() calls the shm malloc() toallocate memory from the shared memory pool. We showthrough experiments in Section VI that memExpandenables efficient memory sharing across VMs through

5

host coordinated shared memory. Using memExpand,the VM memory can be expanded just in time, withoutpaying the overhead of monitoring the working set sizefor each VM, so that the amount of memory swappingtraffic is effectively reduced.

Shared Memory Access Protection. In the firstprototype implementation, the memory region allocatedby the host is shared among all co-hosted VMs, thougheach VM can locate only the data in its own sharedmemory pool through metadata such as list descriptorand slab descriptor. To mitigate the risk that a maliciousattacker may eavesdrop or manipulate the data of otherVMs, one approach is to use Trust grouping and allocatefor each trust group a different shared memory regionfrom the host. Thus, VMs in one group cannot accessthe shared memory pools used by VMs in other groups.By allowing VMs that have mutual trust to be put intoa trust group, MemLego can provide efficient memorysharing to each trust group instead all co-hosted VMs.Another alternative and also complimentary mechanismis to audit the access to the shared memory slabs toconstrain a VM to access only its own list descriptor andslab descriptor metadata through metadata encryption.This eliminates risks of any targeted attack due toeavesdropping or malicious manipulation.

IV. SHARED MEMORY PAGING OPTIMIZATION

We have shown via MemLego design how toleverage shared memory to enable memory expandingand memory sharing across VMs. In this section, wedescribe another shared memory based optimization forimproving memory paging efficiency. Recall Figure 1 inSection II, VM memory paging happens when it can nolong fit its working set fully in memory. Throughputsuffers significantly in the presence of excessive VMmemory swapping (thus thrashing), even though idlememory are observed on other VMs at the host.

In MemLego, we provide a shared memory basedswap facility, MemSwap, as an integral part of thesystem. It intercepts VM paging traffics and redirectsthe swap-out and swap-in requests to MemSwap. Thisshared memory based swap optimization is novelin two aspects: First, it provides a hybrid memoryswapping model, which treats a fast but small sharedmemory swap partition as the primary swap areawhenever it is possible, and strategically transits tothe conventional disk-based VM swapping on demand.Second, it provides a fast swap-in optimization, whichenables the VM to proactively swap in the pages from theshared memory using an efficient batch implementation.Concretely, a dedicated shared memory pool is createdand initialized for each VM by MemSwap, independent

of the shared memory region reserved for memExpand.Figure 5 gives an overview of MemSwap design. In thecurrent implementation of MemLego, shared memoryused for swap optimization is managed by MemSwap andMemExpand is dedicated for dynamic shared memoryallocation. Our experiments show that using MemLegowith both MemExpand and MemSwap enabled, VMexecution performance outperforms the performance ofusing MemLego with only MemExpand turned on.

Fig. 5: MemLego Shared Memory Paging via MemSwap

When the shared memory pool for swap-out isreaching a pre-defined capacity, MemSwap will triggerthe hybrid swap process. We use the least recent pages todisk policy, which puts older pages to the disk swap areawhen the shared memory swap partition is full, enablingthe most recent VM swapped out pages to be kept in theshared memory. We maintain a shm start pointer anda shm end pointer pointing to the swapped pages withthe smallest and largest offset in the buffer respectively.If a page fault comes with an offset between these twopointers, all the accesses can be served from the sharedmemory. Otherwise, the conventional disk based swappath will be invoked. In this design, a separate workingthread is standing by and ready to be triggered to flushthe pages from the shared memory to the disk swaparea. In order to parallel the disk I/O and page swapoperations, the working thread starts flushing the pagewhen the shared memory is partially full, say m% full.More details can be found in [37], [38].

Traditional OS provides two mechanisms to swap inpages from the disk swap area: page faults and swapoff.It is known that relying on the page faults to swapin the pages from the swap disk is expensive, andsometimes results in unacceptable delay in applicationperformance recovery. Therefore, MemLego implementsthe proactive swap-in by extending the swapoffmechanism. Concretely, swapoff is an OS syscall thatenables to swap-in pages from the disk swap area in abatched manner. However, directly applying swapoff will

6

not provide the speed-up required for fast applicationperformance. Since in order to swap in a page X, whichhas a corresponding entry PTE (X pte) in the pageglobal directory (PGD), the OS has to start from thePGD, and traverse through all the PUD, PMD, andcompare with all the PTEs until the X pte is found.

An efficient implementation of proactive swap-in isto maintain some meta data in the swap area, which canquickly locate the corresponding PTEs in the page tablewithout lookup the entire PGD. Concretely, when a pageis swapped out, the address of the corresponding PTE iskept as the metadata in front of the swapped out pagein the shared memory swap partition. For those sharedpages which have multiple PTEs, we allocate a specificarea in the shared memory as PTE store. In this case,the first byte of the meta data specifies the number ofPTEs related to this page, while the lasts three bytes isan index pointing the first related PTE in the PTE store.When a page is swapped in, MemSwap is able to quicklylocate the PTE(s) that needs to be updated by referringto this metadata without the need to scan the page table.Thus, the time spent on accessing the PTE of a page tobe swapped in from the shared memory is only one-timememory access, and it will not increase as the size ofthe system wide page table grows [37], [38].

V. SHARED MEMORY BASED COMMUNICATION

To further improve VM execution performance in thepresence of inter-VM communication, we implementMemPipe, a shared memory based optimization formore efficient inter-VM communication, on top ofShmManager in MemLego. Network packets amongco-located VMs will be transferred through theshared memory channel. There are several advantagesof utilizing shared memory via MemPipe overthe traditional network communication: (1) Thecommunication path is shorter by using the sharedmemory. Since packets will be intercepted inside theVM and redirected to the shared memory, the networkstack in both VMs can be skipped. (2) Since theshared memory establishes a channel for inter-VMcommunication, the hypervisor does not need to beinvolved to help transfer the packets, thus the contextswitch between the VMs and the hypervisor can beavoided. (3) The network packets do not have to becopied between the VMs and the host.

Figure 6 gives an overview of MemPipe design. Foreach pair of sender-receiver VMs, MemPipe uses anevent driven and decentralized approach to detect if apair of sender VM and receiver VM are co-located.If yes, it will create a piece of shared memory byusing the interfaces of ShmManager, which allows an

Fig. 6: MemLego Shared Memory Pipe via MemPipe

appropriate amount of shared memory slabs to beallocated initially and be expanded depending on theirnetwork traffic. Within a VM, all its outgoing packetswill be intercepted and the destination address will bechecked for co-location by the packets interceptor. If apacket is going to a co-located VM, it will be redirectedto the shared memory, instead of going through the VMnetwork driver. At the same time, the events managerin each VM will send a notification to the VM that thepacket is heading to, so that the receiver VM can fetchthe packet from the shared memory.

To further improve the performance of shared memorypipes for inter-VM communication, we introduce socketbuffer redirection to allow the sender VM’s packetsto be directly copied from the user space to theshared memory, skipping the VM kernel buffer. Weintroduce anticipated time window (ATW) to optimizethe per packet notification by a ATW based notificationgrouping with bounded delay by tuning time window andbatch size, which can effectively reduce the number ofnotifications between sender VM and receiver VM, andsignificantly cut down the amount of software interruptsto be handled in both sending and receiving VMs.

VI. EVALUATION

The experiments are conducted on an Intel Xeonbased server provisioned from a SoftLayer cloud [39]with two 6-core Intel Xeon-Westmere X5675 processors,24GB DDR3 physical memory available for the guestVMs, 1.5 TB SCSI hard disk, and 1Gbit Ethernetinterface. The host machine runs Ubuntu 14.04 withkernel version 4.1.0, and uses KVM 1.2.0 with QEMU2.0.0 as the virtualization platform. The guest VMs alsorun Ubuntu 14.04 with kernel version 4.1.0. 4 VMs aresimultaneously running on the host. We measure the

7

(a) Write only (b) Write intensive (c) Read only (d) Read intensive

Fig. 7: Effectiveness of MemLego on Memcached

(a) Redis (b) Memcached

Fig. 8: Effectiveness of MemLego v.s. Balloon driver

performance of MemLego on unmodified Redis[4] andMemCached [5].

A. MemLego with MemExpand

We first measure the effectiveness of MemLego corewith only MemExpand turned on and MemSwap andMemPipe disabled. We use YCSB [36] to generate theclient workloads for Redis, and Memtier [40] to createthe client workloads for Memcached since YCSB doesnot support Memcached. In this set of experiments, 4VMs are running on the host machine, and each isinitialized with 4GB memory and 4GB disk swap area.When measuring MemCached (or Redis) performance,we set VM1 as the Redis server. We measure theperformance of the server VM with four types ofworkloads: write only, write intensive, read only, andread intensive. The first two workloads inject 6GB datainto the server VM, while the latter two workloadsuniformly read the injected data. The read-write ratiois 100:1 in the read intensive workloads and ,1:100 inthe write intensive workloads.

Figure 7 shows the result for the 4 typical types ofworkloads on Memcached. We make two observations.First, for write only and write intensive workloads,MemLego and the native system (without MemLego)

start with similar performance at the beginning, becauseMemCached server VM has sufficient memory andevery write operation corresponds to a single memoryaccess. However, the performance of the native casestarts to drop drastically around the 33th second. Thisis because there is insufficient memory in MemCachedserver VM, and memory swapping starts to increase,which cause many write data to be swapped out thedisk swap partition. In comparison, MemLego handlessuch surge of memory pressure calmly and smoothly.The performance of both write only and write intensiveworkload is not affected much by the memory pressureexperienced in VM1 and it benefits from the just intime addition of memory from the shared memory poolto VM1 by MemExpand. For the write only workload,its throughput stays the same, around 20MB/sec. Inaddition, MemLego shows performance improvement ofread only and read intensive workloads over the nativecase as well. For the read intensive workload, MemLegoimproves its throughput by 75%, from around 12MB/secto 21MB/sec. In the native case, the 6GB data bulkloaded to Memcached server (VM1) with only 4GBDRAM. Thus, about 2GB data is put into the swapdisk mounted to the VM1 in native case. Thus, for readworkloads with uniformly reading over the 6GB data,some read requests have to fetch data from disk, whichincurs higher overhead due to the slow disk IOs involved.This explains the results that the performance of the readoperations in the native case is consistently worse thanthat of MemLego. In addition, we show that, for the readonly and read intensive workloads, overhead of swappingfrom the shared memory swap area is negligible. Similarresults are observed from performing the same set ofexperiments on Redis. Due to space constraint, we omitthem here. Readers may refer to [37], [38] for detail.

Next, we compare the performance of MemLego tothat of the native system with balloon driver enabled,using Redis and Memcached. Following the suggestionby [29], the periodic interval of 30 seconds is used

8

(a) Insert (b) Update (c) Read (d) Scan

Fig. 9: Throughput of Redis server measured by YCSB workloads

Fig. 10: Performance of Memcached under differentMemLego configurations

as trigger interval. Figure 8 shows similar trends forboth Redis and Memcached workloads: The Nativecase performs the worst due to large paging trafficunder memory pressure. Balloon driver enabled nativesystem improves the performance soon after ballooningis triggered (see the middle curve), throughput of Redisstarts to recover after 30th seconds and throughput ofMemcached starts to recover around 60th second. Incomparison, MemLego responds to the same memorypressure more gracefully, maintaining the throughputperformance of both Redis and Memcached aroundthe peak performance stably, showing that allocatingadditional shared memory from the shared memory poolto VM1 by MemExpand incurs negligible overhead andtrivial delay compared to ballooning.

B. MemLego Shared Memory Paging Optimization

We compare MemLego with MemSwap enabled withthe native system on Redis workloads. In this set ofexperiments, 4 VMs are running on the host, eachinitialized with 4GB memory. In order to evaluate theeffectiveness of MemLego Shared memory paging, weturn off MemExpand and let both systems using Balloondriver to move idle memory from other 3 VMs. VM1running Redis with 5GB data pre-loaded in memory,

while the other 3 VMs are idle. When another 7GBworkloads are added to the Redis server VM, it increasesthe total memory demand of VM1 to 12GB, thus VM1can fit about only one third of its working set in memory.

Figure 9 displays the throughput results. First, theRedis throughput drops significantly at around the 10thsecond for both systems, due to the increased memorypressure. Also MemLego has MemExpand disabledand MemSwap has only 1GB in the shared memoryswap area. Second, MemLego shows slightly smallerperformance drop and its throughput performance isable to quickly recover, significantly faster than thenative system with ballooning. Consider the Readworkload in Figure 9(c), its throughput drops from 17813OP/sec to 7049 OP/sec at the 10th second, but it onlytakes MemLego about 5 seconds to fully recover thethroughput, whereas it took more than 80 seconds for theNative system to recover to the previous peak throughputperformance.

C. MemLego with Different ConfigurationsIn this section we evaluate MemLego with

three settings: MemExpand enabled, MemExpandand MemSwap both enabled, and MemLego withMemExpand, MemSwap and MemPipe all enabled. Wemeasure how much performance improvement that eachof the three components can contribute to the overalleffectiveness of MemLego. Two VMs are running on thesame host. Each VM is initialized with 4GB memory.3GB host shared memory area is reserved and equallydivided into three regions (1GB for MemExpand, 1GBfor MemSwap, 1GB for MemPipe) and shared acrossthe VMs. One of the VM is running as a Memcachedserver, while the other VM is running as a client. Thewrite only and write intensive workload will inject 6GBdata to the Memcached server, while the read only andread intensive will read uniformly over these data.

Figure 10 shows the results. The performancecomparison results are consistent: MemLego with

9

all three components enabled delivers the highestthroughput for all four MemCached workloads,following by MemLego with both memExpand andMemSwap, and then MemLego with memExpand.The native system with ballooning has the lowestthroughput. Compared to the native case, MemLegowith only MemExpand enabled can improve the averagethroughput of the write only workload from 8MB/sto 14MB/s. Turning on MemSwap further improvethe performance by 50%, since by adding MemSwap,another 1GB shared memory can be used by the serverVM as its swap area. Finally, by integrating MemPipe,the throughput of the write intensive workload isimproved from 21MB/s to 27MB/s, as MemPipeaccelerates the data transfer between the client VM andthe MemCached server VM.

VII. RELATED WORK

The transcendent memory (tmem) on Linux byOracle and the active memory sharing on AIX byIBM PowerVM are the two representative efforts ofdynamic memory balancing by host-guest coordination.The transcendent memory [41] allows the VM to directlyaccess a free memory pool in the host, which canbe used by Guest operating system(OS)to invoke thehost OS services and by the host OS to obtain thememory usage information of the guest VM [42]. For theapplications that implement their own memory resourcemanagement, such as database engines and Java virtualmachines (JVMs), [30] proposes to use application-levelballooning mechanisms to reclaim and free memory.However, most of the proposals in this thread relyon some serious changes to guest OS or applications,making the solutions harder for wide deployment.

Some researches focus on using host coordinatedballooning for VM memory overcommitment. The mainidea is to embed a driver module into the guest OSto reclaim or recharge VM memory via the host. Theballoon driver, proposed in 2002 [29], has been widelyadopted in mainstream virtualization platforms, such asVMware [43], KVM [44], Xen [45]. A fair amountof research has been devoted to periodic estimation ofVM working set size because an accurate estimationis essential for dynamic memory balancing usingthe balloon driver. For example, VMware introducedstatistical sampling to estimate the active memoryof VMs [29], [46]. Alternatively, [47] developed alight-weight, accurate and transparent prediction basedmechanism to enable more customizable and efficientballooning policies for rebalancing memory resourcesamong VMs.

The other relevant research efforts are sharedmemory based performance optimization. The mostrepresentative work is distributed shared memorymanagement (DSM) [48], [49], [50], [51], which canbenefit the applications by providing them a globallyshared virtual memory even though they execute onseparated physical nodes. MemPipe [52] designed adynamic shared memory management system for highperformance network I/O among virtual machines (VMs)located on the same host.

To the best of our knowledge, MemLego is thefirst effort that explores the efficient use of host idlememory as shared memory resource to provide lightweight and on demand memory allocation, fast andhybrid memory swapping, and shared memory pipe foroptimizing co-resident inter-VM communication.

VIII. CONCLUSION

Dynamic and demand driven memory slicing is animportant problem for improving memory efficiency andenhancing VM execution performance in the presenceof imbalanced memory utilization and idle memoryacross VMs on the same host or even VMs across acluster. In this paper, we presented MemLego, a lightweight, shared memory based optimization system todemonstrate the potential of efficient shared memorymanagement towards demand driven memory slicing. Wepresent three shared memory based mechanisms throughMemLego, and demonstrate how shared memory can beefficiently orchestrated to provide on-demand memoryslicing and its impact on improving memory efficiencyand VM execution performance for memory-intensiveapplications. MemExpand offers elastic shared memoryallocation and deallocation in the presence of changingworkloads. MemSwap offers shared memory pagingfacility that can significantly improve memory swappingefficiency and speed up the throughput recovery ofVM execution. MemPipe provides shared memorypipes for high-performance communication betweenco-resident VMs. All three shared-memory basedmechanisms demonstrate the feasibility and effectivenessof non-intrusive and demand-driven memory slicing andits pragmatic impact on improving memory efficiencyand VM execution performance for memory intensiveapplications.

REFERENCES

[1] K. Lim, J. Chang, T. Mudge, P. Ranganathan, S. K. Reinhardt,and T. F. Wenisch, “Disaggregated memory for expansionand sharing in blade servers,” in ACM SIGARCH ComputerArchitecture News, vol. 37, no. 3. ACM, 2009, pp. 267–278.

10

[2] “Advances in memory management in a virtual environment,”https://oss.oracle.com/projects/tmem/dist/documentation/presentations/MemMgmtVirtEnv-LPC2010-Final.pdf.

[3] O. Mutlu and L. Subramanian, “Research problems andopportunities in memory systems,” Supercomputing Frontiers andInnovations, vol. 1, no. 3, p. 19, 2014.

[4] “Redis,” https://redis.io/.[5] “Memcached,” https://memcached.org/.[6] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker,

and I. Stoica, “Spark: Cluster computing with working sets.”HotCloud, vol. 10, pp. 10–10, 2010.

[7] S. Hoffman, Apache Flume: Distributed Log Collection forHadoop. Packt Publishing Ltd, 2013.

[8] N. Garg, Apache Kafka. Packt Publishing Ltd, 2013.[9] Y. Chen, A. S. Ganapathi, R. Griffith, and R. H. Katz, “Analysis

and lessons from a publicly available google cluster trace,”EECS Department, University of California, Berkeley, Tech. Rep.UCB/EECS-2010-95, vol. 94, 2010.

[10] C. Reiss, A. Tumanov, G. R. Ganger, R. H. Katz, and M. A.Kozuch, “Heterogeneity and dynamicity of clouds at scale:Google trace analysis,” in Proceedings of the Third ACMSymposium on Cloud Computing. ACM, 2012, p. 7.

[11] K. K.-W. Chang, D. Lee, Z. Chishti, A. R. Alameldeen,C. Wilkerson, Y. Kim, and O. Mutlu, “Improving dramperformance by parallelizing refreshes with accesses,” in HighPerformance Computer Architecture (HPCA), 2014 IEEE 20thInternational Symposium on. IEEE, 2014, pp. 356–367.

[12] Y. Kim, V. Seshadri, D. Lee, J. Liu, and O. Mutlu, “A case forexploiting subarray-level parallelism (salp) in dram,” in ComputerArchitecture (ISCA), 2012 39th Annual International Symposiumon. IEEE, 2012, pp. 368–379.

[13] H. David, C. Fallin, E. Gorbatov, U. R. Hanebutte, and O. Mutlu,“Memory power management via dynamic voltage/frequencyscaling,” in Proceedings of the 8th ACM international conferenceon Autonomic computing. ACM, 2011, pp. 31–40.

[14] J. Dean and L. A. Barroso, “The tail at scale,” Communicationsof the ACM, vol. 56, no. 2, pp. 74–80, 2013.

[15] Q. Deng, D. Meisner, L. Ramos, T. F. Wenisch, and R. Bianchini,“Memscale: active low-power modes for main memory,” ACMSIGARCH Computer Architecture News, vol. 39, no. 1, pp.225–238, 2011.

[16] D. Lee, Y. Kim, V. Seshadri, J. Liu, L. Subramanian, andO. Mutlu, “Tiered-latency dram: A low latency and low costdram architecture,” in High Performance Computer Architecture(HPCA2013), 2013 IEEE 19th International Symposium on.IEEE, 2013, pp. 615–626.

[17] D. Lee, Y. Kim, G. Pekhimenko, S. Khan, V. Seshadri,K. Chang, and O. Mutlu, “Adaptive-latency dram: Optimizingdram timing for the common-case,” in High PerformanceComputer Architecture (HPCA), 2015 IEEE 21st InternationalSymposium on. IEEE, 2015, pp. 489–501.

[18] G. Pekhimenko, V. Seshadri, O. Mutlu, T. C. Mowry,P. B. Gibbons, and M. A. Kozuch, “Base-delta-immediatecompression: A practical data compression mechanism foron-chip caches,” PACT-21, 2012.

[19] G. Pekhimenko, T. C. Mowry, and O. Mutlu, “Linearlycompressed pages: A main memory compression frameworkwith low complexity and low latency,” in Proceedings of the21st international conference on Parallel architectures andcompilation techniques. ACM, 2012, pp. 489–490.

[20] G. Pekhimenko, T. Huberty, R. Cai, O. Mutlu, P. B. Gibbons,M. A. Kozuch, and T. C. Mowry, “Exploiting compressed blocksize as an indicator of future reuse,” in High PerformanceComputer Architecture (HPCA), 2015 IEEE 21st InternationalSymposium on. IEEE, 2015, pp. 51–63.

[21] “Hp the machine project,” https://www.labs.hpe.com/the-machine.

[22] “Intel rack scale architecture in action,” https://www.intel.com/content/www/us/en/architecture-and-technology/rsa-demo-x264.html.

[23] P. X. Gao, A. Narayan, S. Karandikar, J. Carreira, S. Han,R. Agarwal, S. Ratnasamy, and S. Shenker, “Networkrequirements for resource disaggregation.” in OSDI, vol. 16,2016, pp. 249–264.

[24] S. Han, N. Egi, A. Panda, S. Ratnasamy, G. Shi, and S. Shenker,“Network support for resource disaggregation in next-generationdatacenters,” in Proceedings of the Twelfth ACM Workshop onHot Topics in Networks. ACM, 2013, p. 10.

[25] M. R. Hines, A. Gordon, M. Silva, D. Da Silva, K. D. Ryu, andM. Ben-Yehuda, “Applications know best: Performance-drivenmemory overcommit with ginkgo,” in Cloud ComputingTechnology and Science (CloudCom), 2011 IEEE ThirdInternational Conference on. IEEE, 2011, pp. 130–137.

[26] S. T. Jones, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau,“Geiger: monitoring the buffer cache in a virtual machineenvironment,” in ACM SIGOPS Operating Systems Review,vol. 40, no. 5. ACM, 2006, pp. 14–24.

[27] P. Zhou, V. Pandey, J. Sundaresan, A. Raghuraman, Y. Zhou,and S. Kumar, “Dynamic tracking of page miss ratio curvefor memory management,” in ACM SIGOPS Operating SystemsReview, vol. 38, no. 5. ACM, 2004, pp. 177–188.

[28] W. Zhao, Z. Wang, and Y. Luo, “Dynamic memory balancingfor virtual machines,” ACM SIGOPS Operating Systems Review,vol. 43, no. 3, pp. 37–47, 2009.

[29] C. A. Waldspurger, “Memory resource management in vmwareesx server,” ACM SIGOPS Operating Systems Review, vol. 36,no. SI, pp. 181–194, 2002.

[30] T.-I. Salomie, G. Alonso, T. Roscoe, and K. Elphinstone,“Application level ballooning for efficient server consolidation,”in Proceedings of the 8th ACM European Conference onComputer Systems. ACM, 2013, pp. 337–350.

[31] N. Amit, D. Tsafrir, and A. Schuster, “Vswapper: A memoryswapper for virtualized environments,” in Proceedings of the19th international conference on Architectural support forprogramming languages and operating systems. ACM, 2014,pp. 349–366.

[32] J. Ahn, C. Kim, J. Han, Y.-r. Choi, and J. Huh, “Dynamic virtualmachine scheduling in clouds for architectural shared resources,”in Presented as part of the, 2012.

[33] J. Rao, K. Wang, X. Zhou, and C.-Z. Xu, “Optimizing virtualmachine scheduling in numa multicore systems,” in HighPerformance Computer Architecture (HPCA2013), 2013 IEEE19th International Symposium on. IEEE, 2013, pp. 306–317.

[34] C.-C. Lin, P. Liu, and J.-J. Wu, “Energy-aware virtual machinedynamic provision and scheduling for cloud computing,”in Cloud Computing (CLOUD), 2011 IEEE InternationalConference on. IEEE, 2011, pp. 736–737.

[35] “Frontswap,” https://www.kernel.org/doc/Documentation/vm/frontswap.txt.

[36] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, andR. Sears, “Benchmarking cloud serving systems with ycsb,” inProceedings of the 1st ACM symposium on Cloud computing.ACM, 2010, pp. 143–154.

[37] Q. Zhang, “Dynamic shared memory architecture, systems, andoptimizations for high performance and secure virtualized cloud,”Ph.D. dissertation, Georgia Institute of Technology, 2017.

[38] Q. Zhang, L. Liu, G. Su, and A. Iyengar, “Memflex: A sharedmemory swapper for high performance vm execution,” IEEETransactions on Computers, vol. 66, no. 9, pp. 1645–1652, 2017.

[39] “Softlayer,” http://www.softlayer.com/.[40] “Memtier benchmark,” https://github.com/RedisLabs.[41] D. Magenheimer, C. Mason, D. McCracken, and K. Hackel,

“Transcendent memory and linux,” in Proceedings of the LinuxSymposium, 2009, pp. 191–200.

11

[42] J. R. Lange and P. Dinda, “Symcall: Symbiotic virtualizationthrough vmm-to-guest upcalls,” in ACM SIGPLAN Notices,vol. 46, no. 7. ACM, 2011, pp. 193–204.

[43] M. Rosenblum, “Vmware’s virtual platform,” in Proceedings ofhot chips, vol. 1999, 1999, pp. 185–196.

[44] A. Kivity, Y. Kamay, D. Laor, U. Lublin, and A. Liguori, “kvm:the linux virtual machine monitor,” in Proceedings of the LinuxSymposium, vol. 1, 2007, pp. 225–230.

[45] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho,R. Neugebauer, I. Pratt, and A. Warfield, “Xen and the art ofvirtualization,” ACM SIGOPS Operating Systems Review, vol. 37,no. 5, pp. 164–177, 2003.

[46] “Understanding Memory Resource Management in VMwarevSphere 5.0,” VMware Technical White Paper, August 2011.

[47] Q. Zhang, L. Liu, J. Ren, G. Su, and A. Iyengar, “iballoon:Efficient vm memory balancing as a service,” in Web Services(ICWS), 2016 IEEE International Conference on. IEEE, 2016,pp. 33–40.

[48] C. Amza, A. L. Cox, S. Dwarkadas, P. Keleher, H. Lu,R. Rajamony, W. Yu, and W. Zwaenepoel, “Treadmarks: Sharedmemory computing on networks of workstations,” Computer,vol. 29, no. 2, pp. 18–28, 1996.

[49] M. J. Feeley, W. E. Morgan, E. Pighin, A. R. Karlin, H. M. Levy,and C. A. Thekkath, Implementing global memory managementin a workstation cluster. ACM, 1995, vol. 29, no. 5.

[50] K. L. Johnson, M. F. Kaashoek, and D. A. Wallach,CRL: High-performance all-software distributed shared memory.ACM, 1995, vol. 29, no. 5.

[51] M. Kistler and L. Alvisi, “Improving the performance ofsoftware distributed shared memory with speculation,” Paralleland Distributed Systems, IEEE Transactions on, vol. 16, no. 9,pp. 885–896, 2005.

[52] Q. Zhang and L. Liu, “Workload adaptive shared memorymanagement for high performance network i/o in virtualizedcloud,” IEEE Transactions on Computers, vol. 65, no. 11, pp.3480–3494, 2016.

12

1 efﬁcient shared memory orchestration · swapping can be effectively reduced, compared to the...

Documents