accelerating the cloud with heterogeneous...

5

Click here to load reader

Upload: doanphuc

Post on 11-Jun-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Accelerating The Cloud with Heterogeneous …static.usenix.org/event/hotcloud11/tech/final_files/Suneja.pdfAccelerating The Cloud with Heterogeneous Computing Sahil Suneja, Elliott

Accelerating The Cloud with Heterogeneous ComputingSahil Suneja, Elliott Baron, Eyal de Lara, Ryan Johnson

University of Toronto

AbstractHeterogeneous multiprocessors that combine multipleCPUs and GPUs on a single die are posed to becomecommonplace in the market. As seen recently from thehigh performance computing community, leveraging aGPU can yield performance increases of several ordersof magnitude. We propose using GPU acceleration togreatly speed up cloud management tasks in VMMs.This is only becoming possible now that the GPU ismoving on-chip, since the latency across the PCIe buswas too great to make fast, informed decisions aboutthe state of a system at any given point. We explorevarious examples of cloud management tasks that cangreatly benefit from GPU acceleration. We also tackletough questions of how to manage this hardware in amulti-tenant system. Finally, we present a case studythat explores a common cloud operation, memory dedu-plication, and show that GPU acceleration can improvethe performance of its hashing component by a factor ofover 80.

1 IntroductionOver the last decade, Graphical Processing Units (GPU)have been extensively used to speed up the performanceof applications from a wide range of domains beyondimage processing, including bioinformatics, fluid dy-namics, computational finance, weather and ocean mod-eling, data mining, analytics and databases, among oth-ers.

We argue that GPUs could be used to greatly acceler-ate common systems and management tasks in cloud en-vironments, such as page table manipulation during do-main creation and migration, memory zeroing, memorydeduplication, and guest domain virus scanning. Thisclass of tasks has so far not been amenable to GPU accel-eration due to the need to perform DMA transfers overthe PCIe bus associated with traditional discrete GPUs.We argue that the rollout of heterogeneous architectures,such as the AMD Fusion and Intel Sandy Bridge, whichinclude a GPU on-socket with direct access to mainmemory (as shown in Figure 1), is a game changer thatmotivates a re-evaluation of how system-level tasks areimplemented in cloud environments.

With heterogeneous architectures rapidly becomingmainstream, we will soon begin to see these processorsin data centers. The GPU cores they provide will cer-

Main Memory

GPUCores

CPUCores

PCIe Bus

MemoryController

SystemBus

AMD Fusion APU

&

Figure 1: AMD Fusion architecture [1].

tainly not be used to display graphics in a data center,and are unlikely to be explicitly programmed for gen-eral purpose computation by a client’s software. Thismeans that these GPU cores are likely to be idle orunder-utilized in data centers. We argue that these GPUcores can be leveraged to offload administrative tasksin the data center. Previous work demonstrates that in-telligently scheduling a task’s execution between bothGPU and CPU cores can yield significantly better per-formance and energy efficiency than strictly using CPUor GPU cores alone [8]. Additionally, since the GPUcores will have their own cache hierarchy, any memory-intensive tasks will not pollute the CPUs’ caches.

In this paper, we give examples of system level op-erations in a virtualization-based cloud that can benefitfrom GPU acceleration, and we discuss the challengesassociated with efficiently sharing and managing theGPU in a cloud environment. To illustrate the potentialbenefits of GPU acceleration of common cloud manage-ment tasks, we conduct a case study which uses GPU ac-celeration to speed up the hashing component of a mem-ory deduplication task. Experiments with two off-the-shelf discrete GPU cards (a heterogeneous chip with anon-socket GPU, such as the AMD Fusion, was not com-mercially available at the time we ran our experiments)show that when memory transfer time is not included,it is possible to achieve speedups of more than 80 timesover sequential CPU hashing. While not definitive, theseresults give a good indication of the potential perfor-mance that an on-socket GPU with direct access to mainmemory could achieve. Even with the current memorytransfer limitations, we observed a 6 fold speedup.

1

Page 2: Accelerating The Cloud with Heterogeneous …static.usenix.org/event/hotcloud11/tech/final_files/Suneja.pdfAccelerating The Cloud with Heterogeneous Computing Sahil Suneja, Elliott

2 Accelerating The CloudThis section details several examples of data parallelcloud management tasks that a VMM routinely per-forms, which can benefit from GPU acceleration.

Memory Cleanup: When a VMM destroys a VM, itis expected that none of the potentially sensitive infor-mation residing in the VM’s memory is leaked when itis reused. Thus, at some point before the memory is tobe allocated to a new VM, it should be cleared out. Us-ing a GPU with direct access to the memory in question,a large number of memory pages could be zeroed in par-allel. This would free the CPU to perform other hypervi-sor management tasks or service guest VMs, and allowthese pages to be reused much sooner.

Batch Page Table Updates: These updates involveremapping a guest OS’ pseudo-physical page framenumbers to the actual machine page frame numbers asthe guest VM is transported to a different physical host,such as in VM migration and cloning, or is suspendedand resumed at a later point in time either on the same ora different host machine. This mapping process is alsorequired upon the creation of a fresh VM. These pagetable updates are all independent of each other, and canbe accelerated via the SIMD (Single Instruction, Multi-ple Data) processing capabilities of a GPU.

Memory Hashing: Offloading compute-intensivehashing to a GPU provides a double benefit. Not onlydoes it lower CPU overhead, but the GPU can processorders of magnitude more data per unit time; this accel-eration would prove advantageous to at least two aspectsof the cloud infrastructure: effective memory dedupli-cation through page sharing, and improved flexibility inVM migration and cloning.

Page sharing [2, 5, 10] allows over-provisioning ofphysical memory in a virtualized environment by ex-ploiting homogeneity in cloud workloads, i.e. multipleco-located VMs running the same operating system andsimilar application stacks. Page sharing eliminates allbut one copy of each duplicated page, modifying guestaddress translations to point to that single master copy.One popular technique scans the system periodically andbuilds a hash table of all memory pages [2, 10]; pageswhich hash to the same hash table index are candidatesfor page sharing. Faster hashing allows more frequentscans, thereby discovering more opportunities for pagesharing.

Similarly, accelerated page hashing (digest genera-tion) enables an efficient realization of content address-able storage (CAS), which can be used to accelerate VMmigration and cloning. Through the high speed pagematching offered by CAS, it may not be necessary to re-quest the network transfer of all memory pages duringthe process of rebuilding the working memory set of acloned or migrated VM. These might be locally avail-

able on the host system, and a rapid local retrieval maythus satisfy the page requirement.

Memory Compression: Another technique to ob-serve memory savings is compressing infrequently ac-cessed pages in memory. The data parallel nature ofthis operation at the memory page level makes this yetanother candidate for GPU acceleration, allowing extramemory to be available faster.

Virus Signature Scanning: Searching for virus sig-natures in physical memory or in incoming network datapackets is another service that a hypervisor may provideto all its guests. Scanning memory or packet data forpattern matching of known signatures can be acceleratedconsiderably by GPU processing, leading to more ag-gressive security monitoring [7, 9].

3 Managing a Virtualized GPUThis section outlines various hardware managementchallenges that arise in heterogeneous processors. Weshow that a VMM can address these challenges elegantlyon a multi-tenant system. While on-chip graphics arethe present, this discussion is certainly relevant to otherkinds of accelerators we may see in future processors.

Exploiting processor heterogeneity for hypervisorlevel management tasks will usually require escalatedprivileges. Thus, an important challenge lies with shar-ing hardware resources effectively between privilegedsystems code (hypervisor and control-domain functions)and a non-privileged guest’s applications. Further, anysuch sharing must maintain security and performanceisolation across multiple guest VMs. In particular, arogue or buggy process must not interfere with the confi-dentiality, integrity, or availability of another which hap-pens to share the same GPU.

An easy, but suboptimal solution is to restrict accessof the accelerator just to the hypervisor or a privilegedcontrol-domain, such as Xen’s Dom0. Security wouldnot be a concern as the GPU hardware is not even ex-posed to unprivileged guests, but these resources willlikely remain underutilized.

A better alternative would expose the GPU to theguests in time slices. The VMM can use traditional CPUscheduling techniques to manage execution on the GPU,with fine-grained control over the amount of time it al-locates for itself and its guests. Furthermore, tasks can-not interfere with each other as the device is allocatedto only one task at a time. Again, the GPU may go un-derutilized if that task does not make full use of the re-sources it was granted. Although only one task runs at atime, it is also worth noting that the VMM will still needto impose memory protection on this task. For instance,the VMM must prevent tasks from accessing data leftbehind by previously completed or preempted tasks.

A superior policy might introduce space multiplexing,

2

Page 3: Accelerating The Cloud with Heterogeneous …static.usenix.org/event/hotcloud11/tech/final_files/Suneja.pdfAccelerating The Cloud with Heterogeneous Computing Sahil Suneja, Elliott

partitioning GPU resources concurrently among tasksfrom several guests. Space-based multiplexing couldbe implemented in several ways. If hardware supportfor virtualization were to become available on the GPU,then individual GPU slices could be mapped directlyonto guests. For this to be possible, the GPU would haveto police memory accesses itself. The VMM would thenbe in charge of managing slice allocation.

Given the current lack of hardware support for virtual-ization on existing GPUs, an alternative software-basedapproach would be where the VMM, or rather its priv-ileged domain, arbitrates access to the GPU by virtual-izing the General Purpose GPU (GPGPU) API. How-ever, the VMM would also have to provide mechanismsto support simultaneous execution of multiple computekernels. CUDA 3.0 [6], e.g., enables simultaneous ex-ecution of tasks, but provides no memory protection toisolate them from each other.

In a software based implementation, the guest VMwould access the GPU by sending the code of the ker-nel to run and any associated parameters to the VMM,which would then determine which of the kernels thathave been sent to it can execute simultaneously. To per-form this evaluation, the VMM must be able to deter-mine which regions of memory the kernel attempts toaccess. This will likely require support from the kernelcompiler, and may also demand a more restrictive APIfor the guest (e.g. disallowing arbitrary memory accessvia pointer arithmetic). Once the VMM is certain of thememory the kernel will access, it can make an appro-priate mapping of kernels to the GPU’s execution unitsthat is both safe and that will give high occupancy. Fi-nally, the VMM must then enqueue the kernels using theGPGPU API. If this API supports concurrent kernel ex-ecution, then the VMM will simply enqueue the kernelsusing the API. If the API does not support concurrentexecution, a possible workaround could be for VMM torecompile the kernels into a single “meta-kernel.” Themeta-kernel uses thread indexes to execute each kernel’scode on a subset of the hardware.

An advantage of using the VMM to control access tothe GPU is that the VMM has a global view of the sys-tem. As such, the VMM can make informed decisionsabout scheduling time and space on the GPU to achievebetter system performance. In contrast, in the case ofan entirely hardware-managed virtualized GPU, whereeach guest is assigned a slice, these slices may go unusedif the guest does not have work to perform at that time.

4 Case Study: VM Page SharingTo illustrate the benefits and challenges of acceleratinghypervisor functions, we evaluated the use of GPUs tooptimize the MD5 hashing component of a page shar-ing task. Since a heterogeneous chip with an on-socket

Work-item 1

PrivateMemory

Work-item N...

Compute Unit 1

Local Memory

Compute Unit M

...

Compute Device

Compute Device Memory

Global Memory

PrivateMemory

...

...

Work-item 1

PrivateMemory

Work-item N

Local Memory

PrivateMemory

Figure 2: The OpenCL memory hierarchy. Work-groupsare scheduled to compute units, each with their own low-latency shared local memory. Global memory is off-chipand has much higher latency.

GPU, e.g., AMD Fusion, was not commercially avail-able at the time of writing, we conducted our case studyusing two off-the-shelf discrete GPU cards. While thisexperimental setup is not ideal, we believe that by iso-lating the memory access and computation componentsof the task we can get a rough estimate of the potentialperformance of an on-socket GPU with direct access tomain memory.

4.1 General Purpose GPU ComputingWe use OpenCL, a vendor-neutral heterogeneous com-puting framework for GPGPU computing. In OpenCL,a compute kernel defines the data parallel operation wewish to apply to the input data. Kernels are written us-ing C functions, but when invoked are executed acrossa range of GPU threads or work-items grouped togetherinto work-groups. All work-items within a work-groupcan access a faster shared on-chip local memory, see Fig-ure 2. A kernel can be thought of as the body of a nestedloop, where the outer loop corresponds to work-groupsand the inner loop corresponds to work-items.

4.2 GPGPU Page SharingWe ported the MD5 implementation by Juric [4] fromCUDA to OpenCL. The MD5 algorithm performs a se-quence of 64 operations on each 512-bit (64-byte) chunkof the input. The resulting hash of one chunk is thenfed as an input to the algorithm as it processes the nextchunk. Each 4K memory page can be viewed as 64chunks of 64 bytes each.

The need to transfer the hash output of one chunkto the next severely constrains the granularity of paral-lelism, limiting the assignment of a single work-item permemory page. To improve parallelism, we adopted op-

3

Page 4: Accelerating The Cloud with Heterogeneous …static.usenix.org/event/hotcloud11/tech/final_files/Suneja.pdfAccelerating The Cloud with Heterogeneous Computing Sahil Suneja, Elliott

1 2 3 4 5 6 7 8 61 62 63 64

...thread 1 thread 2 thread 16

02efba...34

9 10 11 12 13 14 15 16

thread 3 thread 4

1 2 3 4

thread 1

ab9802...1f

kernel 1

kernel 2

Figure 3: Our parallel hashing algorithm operating ona single 4K page. Kernel 1 hashes 64-byte chunks intointermediate values. Kernel 2 hashes the intermediatevalues into the final value.

timizations proposed by Hu et al. [3] for hierarchicalhashing. The optimized algorithm has a much finer gran-ularity of one 64 byte chunk per work-item, enabling 64work-items to operate in parallel on one memory page.While the value of the hash generated for a page is differ-ent from the standard MD5 hash, the encryption strengthis maintained.

Assigning 1 chunk per work-item leads to 3 roundsof hashing: 4KB is reduced to 64 16-byte hashes (1024bytes), which in turn reduced to 256- and finally 64-bytehashes. The number of successive kernel (hashing) callscan be reduced by using multiple chunks per work-item.Figure 3 demonstrates hashing a page with 4 chunks perwork-item in 2 kernel rounds.

We present two versions of the kernel as it is the GPUarchitecture that drives the optimizations to extract max-imum parallelization benefits, and no single implemen-tation performed best on both the GPUs we used. Thekernel versions are: glmem, which accesses the globaldevice memory directly; and shmem, which makes useof faster per-work-group local (shared) memory for op-timizations.

4.3 Hypervisor Integration

Integration of the accelerated hashing scheme with a vir-tualization environment is left for future work. We an-ticipate that this task would be best realized outside thehypervisor, as a user-space program in the privilegeddomain. This would avoid expanding the hypervisor’strusted code base and having hypervisor depend on anyGPGPU framework. The user-space process would beresponsible for managing hash tables and driving GPUcomputation and communication, while the actual shar-ing of pages via pseudo-physical address to machine ad-dress remapping, and unsharing using copy-on-write se-mantics would still be handled by the hypervisor.

GTX 280 HD 5870Processing cores 240 1600Global memory size 1 GB 1 GBMax work-group size 512 256Wavefront size 32 64Local memory size 16KB 32KB

Table 1: Testbed specifications.

5 ExperimentsWe conducted our experiments on two GPUs: NVIDIAGeForce GTX 280 (GT200) and ATI Radeon HD5870. The system hosting the NVIDIA GPU containsa 2.83GHz Intel Core 2 Quad CPU (Q9550). The ATIGPU is hosted on a system with a 2.80GHz Intel Corei5-760 CPU. Table 1 presents a few relevant specifica-tions of the two GPUs we used.

The benefit of offloading hash computation to theGPU is two-fold. First, faster hashing allows for agreater memory scan frequency, thereby exploiting moreopportunities for page sharing. Secondly, the computa-tion offload results in a reduced load on the CPU, allow-ing it to better service the guest VMs.

5.1 SpeedupWe measure the time it takes to hash 100 MB of ran-domly generated data consisting of 25,600 4KB pages ofmemory. For comparison, our baseline sequential MD5implementation takes 346 ms and 314 ms to completewhen running on the main CPUs of the systems hostingthe NVIDIA and ATI cards, respectively.

Tables 2 and 3 show the speedups we obtained overthe sequential CPU hashing implementation for theNVIDIA and ATI cards respectively, when the data to behashed is pre-copied to the GPUs global memory, i.e.,the memory transfer time is not included.

Implemen-tation

Work-group size

Chunks perwork-item

Speedup

glmem 512 1 19.5xshmem 240 1 40xglmem 512 4 38.5x

Table 2: Results on NVIDIA GeForce GTX 280.

Implemen-tation

Work-group size

Chunks perwork-item

Speedup

glmem 256 1 63xshmem 64 1 54xglmem 256 2 87x

Table 3: Results on ATI Radeon HD 5870.

On the NVIDIA GPU, the best speedup achieved was40x over sequential hashing using the shmem kernel im-

4

Page 5: Accelerating The Cloud with Heterogeneous …static.usenix.org/event/hotcloud11/tech/final_files/Suneja.pdfAccelerating The Cloud with Heterogeneous Computing Sahil Suneja, Elliott

plementation, while a glmem kernel yielded the high-est speedup of 87x on the ATI GPU. The differencein performance between the two GPUs is a result ofarchitecture-specific bottlenecks.

When the memory transfer times are included into ourtotal running times, the speedups observed are about 6xfor the NVIDIA GPU, and 2x for the ATI GPU. The re-duced speedup is due to the overhead of data transferfrom the host (CPU) memory to the device (GPU) mem-ory, before the GPU kernel initiates computation. Pro-cessors like AMD Fusion, which house a CPU and GPUon a single die, will enable the GPU to access host mem-ory directly. On the other hand, we expect the integratedGPUs to be less powerful that their discrete off-chipcounterparts, at least for the initial architectures. Thus,we expect the on-chip performance to lie somewhere be-tween these lower (2x-6x) and upper bounds (40x-87x).

5.2 Hashing OverheadTable 4 shows the overhead of the CPU and GPU ver-sions of the hashing process when they execute con-currently with a computationally intensive process, cpu-heavy, on the machine hosting the ATI GPU. The pro-cess cpu-heavy performs 400 million floating point mul-tiplications. Concurrently, in the background we hash1.5 GB of memory. The experiment gpu-hash-i firstcopies the input data to the GPU, and only then doescpu-heavy begin its execution. Thus, gpu-hash-i doesnot include memory transfer overhead. The other ex-periment, gpu-hash-ii, begins executing cpu-heavy im-mediately and thus includes the CPU-to-GPU memorytransfer overhead. We expect the performance of aFusion-like architecture to lie somewhere between whatis reflected in experiments gpu-hash-i, and gpu-hash-iiwhich represents the current state of the art.

Experiment CPUversion

GPUversion

Relativedifference

gpu-hash-i 50.0% 10.7% 78.5%gpu-hash-ii 50.3% 24.8% 50.8%

Table 4: Runtime overhead on the cpu-heavy process.

As the results indicate, offloading the background pro-cess of hash computation to the GPU greatly reducesthe load on the CPU. The lower overhead frees the mainCPUs to better service the guest VMs, while the GPU fa-cilitates a much more aggressive page sharing and thusefficient memory deduplication in the cloud.

6 ConclusionWith new heterogeneous multiprocessors on the hori-zon, we argue their potential to benefit various aspectsof virtualization driven clouds. To summarize, incorpo-rating GPU processing to the cloud infrastructure bene-

fits the cloud service providers by accelerating hypervi-sor level management tasks, increasing the flexibility ofthe cloud infrastructure by facilitating improved VM mi-gration and cloning, and providing better resource uti-lization. This enables higher server consolidation ratios,lower costs by shutting down idle servers and could de-crease the size of the data center. The users experienceimproved VM performance and less usage cost due tobetter resource consolidation in the cloud hosts (pay-per-use cloud service model).

Our case study explores in detail the benefits of usingheterogeneous hardware for memory deduplication incloud via virtual machine page sharing. While we exper-iment with discrete graphics processors, we expect com-parable results for on-chip graphics processing withoutthe latency issues of a discrete processor.

We are working towards integrating our GPGPUdriven memory hashing implementation as a scanning-based page sharing module on top of Satori’s copy-on-write disk sharing implementation in Xen. In the future,we plan to validate our findings on a heterogeneous ar-chitecture like Fusion, as well as focus on the variousother candidates for specialized hardware accelerationinside the cloud infrastructure. Finally, we will researchmechanisms for managing heterogeneous hardware.

References[1] Advanced Micro Devices. AMD Fusion Family of APUs

Technology Overview.http://sites.amd.com/us/Documents/48423B fusion whitepaper WEB.pdf, 2010.

[2] D. Gupta, S. Lee, M. Vrable, et al. Difference Engine:Harnessing Memory Redundancy in Virtual Machines. In 8thUSENIX Conference on Operating Systems Design andImplementation, 2008.

[3] G. Hu, J. Ma, and B. Huang. High Throughput Implementationof MD5 Algorithm on GPU. In International Conference onUbiquitous Information Technologies Applications, ICUT, 2009.

[4] M. Juric. Notes: CUDA MD5 Hashing Experiments.http://majuric.org/software/cudamd5/.

[5] G. Miłos, D. G. Murray, S. Hand, and M. A. Fetterman. Satori:Enlightened Page Sharing. In USENIX Annual TechnicalConference, 2009.

[6] NVIDIA. CUDA Toolkit 3.2.http://developer.nvidia.com/object/cuda 3 2 downloads.html.

[7] E. Seamans and E. Alexander. Fast Virus Signature Matchingon the GPU. In GPU Gems 3, chapter 35. 2008.

[8] M. Silberstein and N. Maruyama. An exact algorithm for energy-efficient acceleration of task trees on CPU/GPU architectures. In4th Annual International Systems and Storage Conference, 2011.

[9] R. Smith, N. Goyal, J. Ormont, et al. Evaluating GPUs forNetwork Packet Signature Matching. In IEEE InternationalSymposium on Performance Analysis of Systems and Software,2009.

[10] C. A. Waldspurger. Memory Resource Management in VMwareESX Server. In 5th Symposium on Operating Systems Designand Implementation, 2002.

5