[ieee 2008 ieee/ifip international conference on embedded and ubiquitous computing (euc) - shanghai,...

Analysis and Enhancement for Interactive-Oriented Virtual Machine Scheduling

Yubin Xia, Yan Niu, Yansong Zheng, Ning Jia, Chun Yang, Xu ChengMicroprocessor Research and Development Center

Peking University

Abstract

Recently, there has been increasingly interest in hostingdesktop applications in virtual machine environment andaccessing them with thin-client devices. However, the in-teractive performance in such scenario has not been fullyinvestigated yet. We thus measure the performance of theinteractive desktop system hosted by Xen VMM. Our ex-perimental results show that with heave workload VMs co-existing on the same physical server, one guest domain mayget poor interactive performance for the existence of occa-sional large ”latency-peak”.

We argue that for interactive operation, the larger vari-ance of response latency, instead of the longer latency it-self, is the most important performance problem for thevirtualized approach. We enhance the Credit scheduler ofXen to minimize both the height and frequency of latency-peak. The results exhibited improvements in average heightof latency-peak of up to 97.4% and in frequency of latency-peak of up to 94.2% for a variety of consolidation scenarios.

1. Introduction

Recently, Virtual Machine Monitors (VMMs) are gain-ing popularity in enterprise environments as a foundationof shared computing infrastructures. This has led to anemergence of interest in hosting desktop applications byVMMs and accessing them with thin-client devices overnetworks, such as VMware’s VDI (Virtual Desktop Infras-tructure) [14] and Citrix’s XenDesktop [4] etc. Using a re-mote display protocol such as X [12], RDP [5], VNC [10],SLIM [13] and THINC [1] , a thin-client device connects toa virtual desktop server in the pool, transmits user input tothe server and receives screen updates of the user interfacefrom the server. For large enterprises, such virtual machinebased thin-client computing approach offers a more central-ized, secure and manageable computing platform for desk-top applications, with the promises of fault containment,session migration, and application compatibility.

Previous studies on VMM [2] have shown that virtu-alization of the underlying physical machine can be effi-ciently implemented. Barham et al.[2] reported that the per-formance overhead to host commodity operating systemsby Xen is negligible, which is at most a few percent com-pared with the un-virtualized case. However, most designsand evaluations of VMMs take server applications as targetworkloads, and the interactive performance of concurrentdesktop sessions in the virtualized environment has not beenemphasized as far as server applications are concerned.

For users under virtual desktop systems, the most con-cern is superior interactive experience. Endo et al. [7] em-phasized that response latency is the critical performancecriterion for interactive systems. Our previous work [15]has showed that stable response latency is necessary for in-teractive system, while larger variance of response latencywill hurt interactive end-users seriously.

The scheduler within the VMM plays a key role in deter-mining the overall fairness and performance characteristicsof VDI system. Traditionally, VMM schedulers have fo-cused on fairly sharing the processor resources among do-mains while leaving the scheduling of I/O resources as asecondary concern[9]. However, this may cause long and/orunpredictable response latency, which makes virtualizationless desirable for interactive applications.

This paper explores the most significant problem of in-teractive performance in current Xen’s scheduler: the la-tency peak of operation. We conclude that scheduler is themain reason for latency peak and presents several enhance-ments to improve it. The results exhibited improvementsin average height of latency-peak of up to 97.4% and infrequency of latency-peak of up to 94.2% for a variety ofconsolidation scenarios.

The reset of the paper is organized as follows. Section 2discusses the related work. We present background materialon Xen VMM and its scheduler in Section 3. Section 4 de-scribes and analyzes the latency-peak phenomenon. Section5 discusses modifications to the Credit scheduler to enhanceinteractive performance. We describe our evaluation in Sec-tion 6. Finally, we present concluding remarks in Section 7.

2008 IEEE/IFIP International Conference on Embedded and Ubiquitous Computing

978-0-7695-3492-3/08 $25.00 © 2008 IEEE

DOI 10.1109/EUC.2008.143

393


978-0-7695-3492-3/08 $25.00 © 2008 IEEE

DOI 10.1109/EUC.2008.143

393


978-0-7695-3492-3/08 $25.00 © 2008 IEEE

DOI 10.1109/EUC.2008.143

393

2. Related Work

Desktop-server consolidation has been more and morepopular recently, e.g. VMware’s VDI [14] and Citrix’s Xen-Desktop [4]. Interactive performance is important in suchscenario. There are several research on interactive-orientedscheduling, however, few of them take consideration of vir-tualization environment, which our work is mainly focusedon.

Cherkasova et al. [3] studied the three different sched-ulers of Xen. Their study differs from this one in thatthey evaluate how the scheduler divides the processing re-sources between the guest domain and the driver domain.In contract, this paper focuses on the interactive perfor-mance of guest domain, especially when co-exists withCPU-intensive domains.

VSched developed by Bin Lin et al. [8] has similar mo-tivation with us. While VSched is developed based on EDFto meet real-time constrain, our approach is focused on en-hancement of Xen’s default Credit scheduler and we don’thave any real-time constrains.

Diego Ongaro et al. [9] offers insight into the key prob-lems in VMM scheduling for I/O. Our work is mainly mo-tivated by their paper. They pointed out the problem oflatency-peak but didn’t provide the solution to address theproblem. Our work can be seen as a complementary.

3. Background

We use the Xen VMM[2] in our research and conduct theremaining discussion in its context. Xen effectively consistsof two elements: the hypervisor which provides an abstrac-tion layer between the guest operating systems running intheir own domains and the actual hardware, and the driverdomain which provides access to the actual hardware I/Odevices.

3.1. Xen’s Network I/O Virtualization

In thin-client system, all user inputs and screen updatesare transferred through the network. Therefore, the interac-tive performance of VDI system is highly dependent on theefficiency of network virtualization.

In Xen, each guest domain implements a driver for itsvirtual NIC that is called its frontend driver. Driver domainimplements a backend driver which acts as an intermediarybetween frontend drivers and the device driver for the phys-ical NIC. Event channel is used to communicate virtual in-terrupts. Events are also used for inter-domain communica-tion. After a domain sends an event to another domain, thelatter one will see and process the event when it is scheduledto run. Figure1 demonstrates the process of guest domain

� � � � � � � � �

��

��

� � � � � � �

� � � � � � � � � � � � �

� � � � � ! � � � � � � � � ! � � �

" # $ �

%&

��

� � ' � � � � � � ( � � � �

) � � * � � � � � � �

� � � � � � � � �

+ ,-./

0.

12

34556 7 89

:7;<

=

> ?@

A0

.1

Figure 1. Network Virtualization in Xen

replying a request. In this paper, we use domain0 as thedriver domain by default.

Whenever an event is sent to a domain, the hypervisorwakes the target domain if it is idle. Then, the hyper-visortickles the scheduler for domain switching. Tickling thescheduler potentially reduces the communication latencyfor both hardware interrupts and inter-domain notificationsby immediately running the domain receiving the event.

3.2. Xen’s Credit Scheduler

Over the evolution of Xen, three schedulers are intro-duced in turn: BVT (Borrowed Virtual Time)[6], SEDF(Simple Earliest Deadline First), and Credit Scheduler[11].The current version of the Xen virtual machine monitoruses the Credit scheduler as default, which performs betterin scheduling on multiprocessors and provides better QoScontrols[9].

Credit scheduler works as follows. Domains have 3states: OVER, UNDER and BOOST. UNDER state rep-resents one domain has credits remaining. OVER staterepresents one domain has gone over its credit allocation.The priority of UNDER-domains is higher than OVER-domains. Domains in the same state are simply run in afirst-in, first-out manner. When a domain is scheduled torun, it is allowed to run for three scheduling intervals (for atotal of 30ms) as long as it has sufficient credits to do so.

When a domain receives an event over an event channelwhile it is blocked, it is waked up and enters the BOOSTstate which has the highest priority. This prevents the do-main from entering the run queue at the tail and havingto wait for all other active domains before being executed.Since the scheduler is tickled when an event is sent, thismeans that if the domain receiving the event is boosted, itwill very likely preempt the current domain and begin run-ning immediately.

394394394

0.1

1

10

100

1000

0 1000 2000 3000 4000 5000

Late

ncy

(ms)

Ping Number

(a) Ping latency of one idle domain with other 7 CPU-domains.

0

0.5

1

1.5

2

2.5

3

3.5

4

0 4 8 12 16 20 24 28 32 0

50

100

150

200

250

300

350

400

Late

ncy

Pea

k Fr

eque

ncy

(%)

Late

ncy

Hei

ght A

vera

ge (m

s)

Number of Domain (domain0 not included)

Latency Height AverageLatency Peak

(b) Frequency and height of latency-peak increases as the number ofdomains increases.

Figure 2. Peak of ping latency

4. Latency Peak

In this section, we use several experiments to demon-strate the phenomenon of latency peak, and analyze the rea-sons of such phenomenon in detail.

4.1. Phenomenon of Latency Peak

In virtual desktop-server environment, we found that thelatency of operations differs from the one in PC environ-ment. Occasionally, the responses of mouse click or key-board pressing have measurably lag, especially with otherguest domains running currently.

We analyze this phenomenon quantitatively by designfollowing experiment. First, we run only one guest domain(domain0 not counted). The guest domain keeps idle andwaits for some other machine’s ping request. Another ma-chine pings the idle guest domain every 1s, and the pinglatencies are recorded. We found that the latencies are incertain region, which we call normal region. This is thebaseline environment.

Next, we run several guest domains concurrently on thesame server. One of them keeps idle, waiting for anothermachine’s ping request, while other domains running CPUintensive application to keep CPU busy. The result is shownin Figure2(a). The x axis represents ping number, while they axis represents the length and frequency of ping latency.As the figure shows, there are a lot of ”peak” of ping latencyin the figure, which we call latency-peak. We define thelatency-peak as if the latency is outside the normal regionin baseline environment.

As figure2(b) shows, both the height and frequency oflatency-peak increase as the number of CPU intensive do-mains increase. We argue that the latency-peak is highly de-pendent on scheduler, which is responsible for performance

isolation between guest domains. The existence of otherCPU intensive domains will amplify this problem.

4.2. Reasons of Latency Peak

There are 3 situations when latency peak occurs:

• No-boost: When an event is sent to a domain, the do-main is blocked and is not qualified to boost whenwake up. In Credit scheduler, if a domain is in UN-DER state when it wakes up, it won’t get the BOOSTpriority, no matter how long it has been blocked. Thisis unfair because if a domain is occasionally run outof its credit, it should not be punished to wait in runqueue for a long time after a long block time.

• Already-in-runq: When an event is sent to domain, thedomain is waiting in run queue. Therefore, the arrivalof event does not has any effect on scheduling of thetarget domain. To minimize the delay of the event de-livery, the target domain should be scheduled just afterthe event arrives. It should be given a short time sliceto process the event, which is borrowed from its ownCPU time from the next time slice. Also, the next timeslice should be adjusted to keep its original CPU share.

• Preempted: When a domain is processing some event,it is preempted by another domain. The processingis usually quite simple. However, if it happens to docredit account(whose period is 30ms) in such short pe-riod of time, the domain will be inserted into the runqueue and wait. This situation is relatively rare in prac-tice that we don’t take consideration of it in this paper.

395395395

0.1

1

10

100

1000

0 2000 4000 6000 8000 10000

Late

ncy

(ms)

Times (100ms x 10000)

SEDFCredit

Figure 3. Compare Latency-Peak of SEDFwith Credit Scheduler

4.3. Latency Peak in SEDF Scheduler

As figure3 shows, SEDF scheduler can control latencyvariance better than Credit scheduler does. However, theaverage of latency under SEDF scheduler is about 10 timeslonger than Credit. Considering the other benefits thatCredit scheduler brings, we will focus on enhancing Creditscheduler instead of SEDF.

5. Scheduler Enhancement

According to the analysis of latency-peak, we make fol-lowing 4 enhancements to Credit scheduler to improve re-sponse latency.

5.1. Credit Bonus

Under the Credit scheduler, I/O-intensive domains willoften consume their credits more slowly than CPU-intensivedomains. In fact, an I/O-intensive domain will not be deb-ited any credits if it happens to block before the periodicscheduler interrupt, which occurs every 10ms. However,once it happens, the domain has to wait in run queue forhundreds of milliseconds to recharge its credit before run-ning, which is the main reason of latency peak. In orderto reduce such latency, we add a little credit every time aVCPU wakes as a bonus. Longer one domain blocks, morecredit bonus it will get. The block time is tracked by a vari-able called block avg. When a VCPU is woken up fromblock, its total block time is added to its block avg variable.When a VCPU gives up the CPU, voluntarily or involuntar-ily, the time the current VCPU spent running is subtractedfrom its block avg. The higher a VCPU’s block avg is, thehigher credit bonus it will get.

5.2. Sort the Run Queue

Under the Credit scheduler, I/O-intensive domains willoften consume their credits more slowly than CPU-intensivedomains. In fact, an I/O-intensive domain will not be deb-ited any credits if it happens to block before the periodicscheduler interrupt, which occurs every 10ms. However,when it later becomes runnable, its remaining credits haveonly a limited effect on its place in the run queue. Specifi-cally, as described in Section 3.1, the number of remainingcredits only determines the domains state. The domain isalways enqueued after the last domain in the same state. In-tuitively, sorting the run queue by the number of remainingcredits that each domain possesses could reduce the latencyfor an I/O-intensive domain to be executed. The Creditscheduler was modified to insert domains into the run queuebased on their remaining credits in order to evaluate this op-timization.

5.3. Borrow Time from Future

This enhancement is aimed at reducing latency in situa-tion 2 described in section4.2. When events arrive, insteadof waiting in run queue for a long and unpredict time, thetarget domain should be scheduled to handle event immedi-ately. However, it should not use any ”extra time” beyondits own. Therefore, it can borrow time from its own, e.g.next time slice, to get a chance to run ahead. The borrowedtime slice is short and will be subtracted from the next timeslice to ensure the fairness.

There are several issues of this enhancement. First is thelength of borrowed time slice. If it is too short, it will not beenough to handle the event and have no effect at all. If it istoo long, the target domain may use it to do CPU intensivetask, which is unfair to other domains. There is a tradeoffbetween efficiency and fairness.

5.4. More Fine-Grained Scheduling

In case that one domain has to wait in run queue, a simpleway to reduce the waiting time is to use finer grain timeslice. In Credit scheduler, the default tick time is 10ms,while the value is 1ms in Linux kernel-2.6. Usually thisvalue can be tuned to adapt to different environment, longerfor server usage and shorter for desktop. We tune the ticktime to 1ms to reduce latency peak’s height. The experimentresults show that it has significant effect on reducing theheight of latency peak, while not so much on the frequency.

6. Experimental Results

In this section we evaluate the effects of our enhance-ments of Credit scheduler on latency-peak.

396396396

0

50

100

150

200

250

300

350

default bs 1ms 1ms-bs

Late

ncy

Pea

k he

ight

(ms)

Xen Configuration

(a) Latency peak height in the test.

0

1

2

3

4

5

6

default bs 1ms 1ms-bs

Late

ncy

peak

per

cent

age

%

Xen Configuration

2.78

0.16

4.93

1.01

(b) Latency peak frequency in the test.

Figure 4. The height and frequency of latency-peak.

6.1. Benchmarks

Two simple microbenchmarks were used to characterizethe behavior of the VM scheduler across workloads:

cpubomb This micro-benchmark attempts to fully utilize aguest domain’s processor resource. It simply runs aninfinite computation loop within the guest to consumeas many processor resources as the VMM will allow.

udping This micro-benchmark attempts to achieve low re-sponse latency. The guest runs server program whichreplies any package it receives with the same content.A remote system runs client program which send pack-age to server. The client waits for the last reply beforeit sends new package.

These benchmarks are simple representatives of aprocessor-intensive workload and a latency-sensitive work-load. In the following experiments, there are 8 domains run-ning currently, of which 7 domains running cpubomb and 1domain idle for udping.

6.2. Scheduler Configurations

Table1 shows the 4 scheduler configurations that wereevaluated. The ”default” configuration is the default config-uration in Xen 3.2.0. The other three use different combi-nation of scheduler enhancements.

6.3. Experimental System

All experiments were run using Xen 3.2.0 on an AMDOpteron-based system. The system includes an Opteron246 processor, 2GB of memory, and one 100Mbps ethernetnetwork interface. An PC is used as the other endpoint for

Label Scheduler Bonus & Sort 1msdefault Credit off off

bs Credit on off1ms Credit off on

1ms+bs Credit on on

Table 1. Various configurations of the sched-uler tested

the udping tests which was never the bottleneck in any ex-periments. The driver domain and all guest domains ran theFedora 8 Linux distribution with the Linux 2.6.18 kernel.All of the guest domains were identical for all experiments.

6.4. Latency-Peak Height

Figure4(a) presents the height of latency-peak under dif-ferent configurations described in section6.2. As the fig-ure shows, credit bonus and run queue sort can reduce theheight of latency from 152ms to 52ms, which is 34% of thedefault configuration. Meanwhile, tuning time tick to 1msreduce the height of latency from 152ms to 18ms, whichis only 12% of the default configuration. The combinationof both enhancement achieves 4ms, which is 3% of the de-fault configuration. The data shows that credit bonus, runqueue sort and time tick tuning are effective on reducingthe height of latency height, and their effect are orthogonalto each other.

6.5. Latency-Peak Frequency

Figure4(b) presents the frequency of latency peak un-der different configurations. As the figure shows, tuningtick time to 1ms is the worst, which is 4.93%. It means

397397397

0

50

100

150

200

250

300

350

400

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Late

ncy

(ms)

Time(100ms x 10000)

defaultbs

Figure 5. Latency comparison.

that although finer grain scheduling can reduce the heightof latency-peak, it cannot reduce the frequency of latency-peak. Meanwhile, credit bonus and run queue sort have sig-nificant effect on lowering the frequency of latency-peak,which reduce 94% of the ping peak in 10ms-tick and 80%in 1ms-tick configuration. As we can see, the frequency of1ms configuration is higher than default. This is because thechance to debit the credit of guest domain is higher in finergrain scheduler. Therefore, the guest domain gets morechance to wait in run queue, which leads to more latency-peak.

The experiment results show that our enhancement havesignificant effect on reducing both the frequency and heightof the abnormal latency peak. Figure5 demonstrates the la-tency under original Xen and enhanced-Xen.

7. Conclusions and Future Work

Advances in virtualization technologies have created alot of interest among large companied to exploit features ofVMMs for cost-cutting and ubiquitous computing via im-proved desktop-server consolidation. We find and analyzethe most significant problem in current Xen’s scheduler: thelatency peak of operation. Several enhancements of sched-uler have been applied and the experiment results show thatthe latency-peak decreases as much as 97.4% and the fre-quency decreases as much as 94.2%, comparing with thedefault scheduler in Xen.

Currently, we are working on the enhancement of time-borrowing. In our future work, we plan to use real-life sce-nario to measure the response latency to replace udping. Wealso plan to study more about the performance interferenceof inter-domains and intra-domain.

References

[1] R. Baratto, L. Kim, and J. Nieh. Thinc: A virtual dis-play architecture for thin-client computing. In Proceed-ings of the 20th Symposium on Operating Systems Prin-ciples (SOSP’05), pages 277–290, Brighton, United King-dom, 2005.

[2] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris,A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen andthe art of virtualization. In Proceedings of the 19th Sym-posium on Operating Systems Principles (SOSP’03), pages164–177, Bolton Landing, New York, USA, 2003.

[3] L. Cherkasova, D. Gupta, and A. Vahdat. Comparison ofthe three cpu schedulers in xen. ACM SIGMETRICS Perfor-mance Evaluation Review, 35(2):42–51, 2007.

[4] Citrix. Xendesktop: http://www.citrix.com.[5] B. C. Cumberland, G. Carius, and A. Muir. Microsoft Win-

dows NT Server 4.0 Terminal Server Edition: Technical Ref-erence. Microsoft Press, Redmond, WA, 1999.

[6] K. J. Duda and D. R. Cheriton. Borrowed-virtual-time(bvt) scheduling: Supporting latency-sensitive threads in ageneral-purpose scheduler. In Proceedings of the 17th Sym-posium on Operating Systems Principles (SOSP’99), pages261–276, Kiawah Island Resort, South Carolina, USA,1999.

[7] Y. Endo, Z. Wang, J. B. Chen, and M. Seltzer. Using latencyto evaluate interactive system performance. In Proceedingsof the 2nd Symposium on Operating Systems Design and Im-plementation (OSDI’96), 1996.

[8] B. Lin and P. A. Dinda. Vsched: Mixing batch and interac-tive virtual machine using periodic real-time scheduling. InProceedings of the ACM/IEEE SC2005 Conference on HighPerformance Networking and Computing, page 8, Seattle,WA, USA, 2005.

[9] D. Ongaro, A. L. Cox, and S. Rixner. Scheduling i/o invirtual machine monitors. In Proceedings of the 4th In-ternational Conference on Virtual Execution Environments(VEE’08), pages 1–10, Seattle, WA, USA, 2008.

[10] T. Richardson, Q. Stafford-Fraser, K. R. Wood, and A. Hop-per. Virtual network computing. IEEE Internet Computing,2(1):33–38, 1998.

[11] C. Scheduler. http://wiki.xensource.com/xenwiki/creditscheduler.[12] R. W. Scheifler and J. Gettys. The x window system. ACM

Transaction on Graphics (TOG), 5(2):79–109, 1986.[13] B. K. Schmidt, M. S. Lam, and J. D. Northcutt. The inter-

active performance of slim: A stateless, thin-client architec-ture. In Proceedings of the 17th Symposium on OperatingSystems Principles (SOSP’99), pages 32–47, 1999.

[14] VMware. Vdi: A new desktop strategy. Technical report,VMware, 2006.

[15] C. Yang, Y. Niu, Y. Xia, and X. Cheng. Performance anal-ysis of interactive desktop applications in virtual machineenvironment. The Chinese Journal of Electronics, To be pre-sented.

398398398

[ieee 2008 ieee/ifip international conference on embedded and ubiquitous computing (euc) - shanghai,...

Documents