forgoing hypervisor fidelity for measuring virtual...
TRANSCRIPT
-
Forgoing hypervisor fidelity formeasuring virtual machine
performance
Oliver R. A. Chick
Gonville and Caius College
This dissertation is submitted for the degree of Doctor of Philosophy
http://orcid.org/0000-0002-6889-8561
-
FORGOING HYPERVISOR FIDELITY FOR MEASURING VIRTUAL MACHINE PERFORMANCE
OLIVER R. A. CHICK
For the last ten years there has been rapid growth in cloud computing, which
has largely been powered by virtual machines. Understanding the performance
of a virtual machine is hard: There is limited access to hardware counters, tech-
niques for probing have higher probe effect than on physical machines, and per-
formance is tightly coupled with the hypervisors scheduling decisions. Yet, the
need for measuring virtual machine performance is high as virtual machines are
slower than physical machines and have highly-variable performance.
Current performance-measurement techniques demand hypervisor fidelity:
They execute the same instructions on a virtual machine and physical machine.
Whilst fidelity has historically been considered an advantage as it allows the hy-
pervisor to be transparent to virtual machines, the use case of hypervisors has
changed from multiplexing access to a single mainframe across an institution to
forming a building block of the cloud.
In this dissertation I reconsider the argument for hypervisor fidelity and show
the advantages of software that co-operates with the hypervisor. I focus on pro-
ducing software that explains the performance of virtual machines by forgoing
hypervisor fidelity. To this end, I develop three methods of exposing the hy-
pervisor interface to performance measurement tools: (i) Kamprobes is a tech-
nique for probing virtual machines that uses unprivileged instructions rather
than interrupt-based techniques. I show that this brings the time requires to
fire a probe in a virtual machine to within twelve cycles of native performance.
(ii) Shadow Kernels is a technique that uses the hypervisors memory manage-
ment unit so that an operating system kernel can have per-process specialisation,
which can be used to selectively fire probes, with low overheads (835354 cyclesper page) and minimal operating system changes (340 LoC). (iii) Soroban uses
machine learning on the hypervisors scheduling activity to report the virtualisa-
tion overhead in servicing requests and can distinguish between latency caused
by high virtual machine load and latency caused by the hypervisor.
Understanding the performance of a machine is particularly difficult when
executing in the cloud due to the combination of the hypervisor and other virtual
-
machines. This dissertation shows that it is worthwhile forgoing hypervisor
fidelity to improve the visibility of virtual machine performance.
-
DECLARATION
This dissertation is my own work and contains nothing which is the outcome
of work done in collaboration with others, except where specified in the text.
This dissertation is not substantially the same as any that I have submitted for a
degree or diploma or other qualification at any other university. This dissertation
does not exceed the prescribed limit of 60 000 words.
Oliver R. A. Chick
November 30, 2015
http://orcid.org/0000-0002-6889-8561
-
ACKNOWLEDGEMENTS
This work was principally supported by the Engineering and Physical Sciences
Research Council [grant number EP/K503009/1] and by internal funds from the
University of Cambridge Computer Laboratory.
I should like to pay personal thanks to Dr Andrew Rice and Dr Ripduman So-
han for their countless hours of supervision and technical expertise, without
which I would have been unable to conduct my research. Further thanks to
Dr Ramsey M. Faragher for encouragement and help in wide-ranging areas.
Special thanks to Lucian Carata and James Snee for their efforts in cod-
ing reviews and being prudent collaborators, as well as Dr Jeunese A. Payne,
Daniel R. Thomas, and Diana A. Vasile for proof reading this dissertation.
My gratitude goes to Prof. Andy Hopper for his support for the Resourceful
project.
All members of the DTG, especially Daniel R. Thomas and other inhabitants
of SN14 have provided me with both wonderful friendships and technical assis-
tance, which has been invaluable throughout my Ph.D.
Final thanks naturally go to my parents for their perpetual support.
http://orcid.org/0000-0002-4677-8032http://orcid.org/0000-0003-0740-8650http://orcid.org/0000-0003-0740-8650http://orcid.org/0000-0001-7445-1136http://orcid.org/0000-0001-7607-9729http://orcid.org/0000-0001-8936-0683http://orcid.org/0000-0001-8936-0683
-
CONTENTS
1 Introduction 15
1.1 Defining forgoing hypervisor fidelity . . . . . . . . . . . . . . . . 16
1.2 Limitations of hypervisor fidelity in performance measurement tools 17
1.3 The case for forgoing hypervisor fidelity in performance measure-
ment tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4 Kamprobes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.5 Shadow Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.6 Soroban . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.7 Scope of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.7.1 Xen hypervisor . . . . . . . . . . . . . . . . . . . . . . . . 23
1.7.2 GNU/Linux operating system . . . . . . . . . . . . . . . . 23
1.7.3 Paravirtualised guests . . . . . . . . . . . . . . . . . . . . . 24
1.7.4 x86-64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.8 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2 Background 27
2.1 Historical justification for hypervisor fidelity . . . . . . . . . . . . 28
2.2 Contemporary uses for virtualisation . . . . . . . . . . . . . . . . 29
2.3 Virtualisation performance problems . . . . . . . . . . . . . . . . 33
2.3.1 Privileged instructions . . . . . . . . . . . . . . . . . . . . 33
2.3.2 I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.3 Networking . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.4 Increased contention . . . . . . . . . . . . . . . . . . . . . 34
2.3.5 Locking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.6 Unpredictable timing . . . . . . . . . . . . . . . . . . . . . 35
2.3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4 The changing state of hypervisor fidelity . . . . . . . . . . . . . . 35
2.4.1 Historical changes to hypervisor fidelity . . . . . . . . . . 35
2.4.2 Recent changes to hypervisor fidelity . . . . . . . . . . . . 36
2.4.3 Current state of hypervisor fidelity . . . . . . . . . . . . . 38
-
2.4.3.1 Installing guest additions . . . . . . . . . . . . . 38
2.4.3.2 Moving services into dedicated domains . . . . . 38
2.4.3.3 Lack of transparency of HVM containers . . . . 39
2.4.3.4 Hypervisor/operating system semantic gap . . . . 39
2.4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.5 Rethinking operating system design for hypervisors . . . . . . . . 40
2.6 Virtual machine performance measurement . . . . . . . . . . . . . 41
2.6.1 Kernel probing . . . . . . . . . . . . . . . . . . . . . . . . 41
2.6.2 Kernel specialisation . . . . . . . . . . . . . . . . . . . . . 42
2.6.3 Performance interference . . . . . . . . . . . . . . . . . . . 43
2.6.3.1 Measurement . . . . . . . . . . . . . . . . . . . . 43
2.6.3.2 Modelling . . . . . . . . . . . . . . . . . . . . . . 44
2.6.3.3 Summary . . . . . . . . . . . . . . . . . . . . . . 45
2.7 Application to a broader context . . . . . . . . . . . . . . . . . . 46
2.7.1 Containers . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.7.2 Microkernels . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3 Kamprobes: Probing designed for virtualised operating systems 49
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2 Current probing techniques . . . . . . . . . . . . . . . . . . . . . 51
3.2.1 Linux: Kprobes . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.2 Windows: Detours . . . . . . . . . . . . . . . . . . . . . . 52
3.2.3 FreeBSD, NetBSD, OS X: DTrace function boundary tracers 53
3.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3 Experimental evidence against virtualising current probing tech-
niques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3.1 Cost of virtualising Kprobes . . . . . . . . . . . . . . . . . 54
3.3.2 Cost of virtualised interrupts . . . . . . . . . . . . . . . . . 57
3.3.3 Other causes of slower performance when virtualised . . . 58
3.4 Kamprobes design . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.5.1 Kamprobes API . . . . . . . . . . . . . . . . . . . . . . . . 60
3.5.2 Kernel module . . . . . . . . . . . . . . . . . . . . . . . . 61
3.5.3 Changes to the x86-64 instruction stream . . . . . . . . . 61
-
3.5.3.1 Inserting Kamprobes into an instruction stream . 61
3.5.3.2 Kamprobe wrappers . . . . . . . . . . . . . . . . 62
3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.6.1 Inserting probes . . . . . . . . . . . . . . . . . . . . . . . . 69
3.6.2 Firing probes . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.6.3 Kamprobes executing on bare metal . . . . . . . . . . . . . 74
3.7 Evaluation summary . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.8.1 Backtraces . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.8.2 FTrace compatibility . . . . . . . . . . . . . . . . . . . . . 76
3.8.3 Instruction limitations . . . . . . . . . . . . . . . . . . . . 76
3.8.4 Applicability to other instruction sets and ABIs . . . . . . 76
3.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4 Shadow kernels: A general mechanism for kernel specialisation in exist-
ing operating systems 79
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.2.1 Shadow Kernels for probing . . . . . . . . . . . . . . . . . 82
4.2.2 Per-process kernel profile-guided optimisation . . . . . . . 84
4.2.3 Kernel optimisation and fast-paths . . . . . . . . . . . . . 84
4.2.4 Kernel updates . . . . . . . . . . . . . . . . . . . . . . . . 85
4.3 Design and implementation . . . . . . . . . . . . . . . . . . . . . 86
4.3.1 User space API . . . . . . . . . . . . . . . . . . . . . . . . 86
4.3.2 Linux kernel module . . . . . . . . . . . . . . . . . . . . . 87
4.3.2.1 Module insertion . . . . . . . . . . . . . . . . . . 88
4.3.2.2 Initialisation of a shadow kernel . . . . . . . . . 88
4.3.2.3 Adding pages to the shadow kernel . . . . . . . . 89
4.3.2.4 Switching shadow kernel . . . . . . . . . . . . . 89
4.3.2.5 Interaction with other kernel modules . . . . . . 90
4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.4.1 Creating a shadow kernel . . . . . . . . . . . . . . . . . . 91
4.4.2 Switching shadow kernel . . . . . . . . . . . . . . . . . . . 93
4.4.2.1 Switching time . . . . . . . . . . . . . . . . . . . 93
4.4.2.2 Effects on caching . . . . . . . . . . . . . . . . . 95
-
4.4.3 Kamprobes and Shadow Kernels . . . . . . . . . . . . . . 97
4.4.4 Application to web workload . . . . . . . . . . . . . . . . 102
4.4.5 Evaluation summary . . . . . . . . . . . . . . . . . . . . . 103
4.5 Alternative approaches . . . . . . . . . . . . . . . . . . . . . . . . 103
4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.6.1 Modifications required to kernel debuggers . . . . . . . . . 105
4.6.2 Software guard extensions . . . . . . . . . . . . . . . . . . 105
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5 Soroban: Attributing latency in virtualised environments 107
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.2.1 Performance monitoring . . . . . . . . . . . . . . . . . . . 110
5.2.2 Virtualisation-aware timeouts . . . . . . . . . . . . . . . . 110
5.2.3 Dynamic allocation . . . . . . . . . . . . . . . . . . . . . . 111
5.2.4 QoS-based, fine-grained charging . . . . . . . . . . . . . . 111
5.2.5 Diagnosing performance anomalies . . . . . . . . . . . . . 112
5.3 Sources of virtualisation overhead . . . . . . . . . . . . . . . . . . 112
5.4 Effect of virtualisation overhead on end-to-end latency . . . . . . 116
5.5 Attributing latency . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.5.1 Justification of Gaussian processes . . . . . . . . . . . . . 121
5.5.2 Alternative approaches . . . . . . . . . . . . . . . . . . . . 122
5.6 Choice of feature vector elements . . . . . . . . . . . . . . . . . . 123
5.7 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.7.1 Xen modifications . . . . . . . . . . . . . . . . . . . . . . 126
5.7.1.1 Exposing scheduler data . . . . . . . . . . . . . . 126
5.7.1.2 Sharing scheduler data between Xen and its vir-
tual machines . . . . . . . . . . . . . . . . . . . . 127
5.7.2 Linux kernel module . . . . . . . . . . . . . . . . . . . . . 127
5.7.3 Application modifications . . . . . . . . . . . . . . . . . . 128
5.7.3.1 Soroban API . . . . . . . . . . . . . . . . . . . . 128
5.7.3.2 Using the Soroban API . . . . . . . . . . . . . . . 129
5.7.4 Data processing . . . . . . . . . . . . . . . . . . . . . . . . 129
5.8 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.8.1 Validation of model . . . . . . . . . . . . . . . . . . . . . 130
-
5.8.1.1 Mapping scheduling data to virtualisation over-
head . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.8.1.2 Negative virtualisation overhead . . . . . . . . . 133
5.8.2 Validating virtualisation overhead . . . . . . . . . . . . . . 137
5.8.3 Detecting increased-load from the cloud-provider. . . . . . 140
5.8.4 Performance overheads of Soroban . . . . . . . . . . . . . 141
5.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.9.1 Increased programmer burden of program annotations . . 142
5.9.2 Scope of performance isolation considered by Soroban . . 143
5.9.3 Limitation to uptake . . . . . . . . . . . . . . . . . . . . . 143
5.9.4 Improvements to machine learning . . . . . . . . . . . . . 143
5.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6 Conclusion 145
6.1 Kamprobes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.2 Shadow Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.3 Soroban . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.4 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.4.1 Kamprobes . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.4.2 Shadow Kernels . . . . . . . . . . . . . . . . . . . . . . . . 149
6.4.3 Soroban . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.4.4 Other performance measurement techniques that forgo hy-
pervisor fidelity . . . . . . . . . . . . . . . . . . . . . . . . 150
6.5 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
-
CHAPTER 1
INTRODUCTION
The recent emergence of cloud computing is largely dependent on the populari-
sation of high-performance and secure x86-64 virtualisation. By using a hypervi-
sor cloud operators are able to multiplex their hardware, with high performance
and strong data isolation, between multiple competing users. This multiplexing
allows cloud providers to increase machine utilisation and increase service scal-
ability. Moreover, the hypervisor eases system management with maintenance
features such as snapshotting and live migration.
Yet, despite the advantages of virtual machines they remain slower than phys-
ical machines and have highly-variable performance [60]. Whilst efforts have
improved both the raw performance and performance isolation of virtual ma-
chines, the increased indirection and additional complexity in virtualising privi-
leged instructions makes it unlikely that we shall achieve parity of performance.
Developers therefore need techniques to help them measure how much slower
their applications execute in a virtual machine than they would have done on
bare metal. Furthermore, they need to be able to diagnose and fix performance
issues that occur in virtualised production systems.
However, using current techniques it is difficult to measure the performance
of software when it executes in virtual machines. Many of the methods used
to measure the performance of software when executing on bare metal, such as
raw access to performance counters, processor tracing, and visibility of hard-
ware performance metrics are not directly accessible [18], expensive [105], or
inaccurate [105, 71] when executing in a virtual machine. The combination of
less predictable performance and unavailability of performance-debugging tech-
niques makes it hard to measure the performance of an application executing in
a virtual machine.
One technique is to optimise software on bare metal, where access to more
hardware features is available, and then to virtualise the software. However,
15
-
this is a poor approach as virtualisation has different performance impacts on
different operations.1
Currently, the main virtualisation techniques used by hypervisors either have
guests execute unmodified code, relying on hardware virtualisation extensions
to emulate bare-metal hardware from the point-of-view of the guest or exe-
cute paravirtualised guests whereby the virtual machines are made aware that
they are executing on a hypervisor and issue hypercalls, as opposed to execut-
ing privileged instructions. But such paravirtualisation of mainstream operating
systems only applies to the low-level hardware interfaces, typically restricted to
the architecture-dependent (arch/) code. As such, performance measurement
techniques that execute on a virtual machine exhibit hypervisor fidelity: They
execute without consideration of the fact that they are executing in a virtual ma-
chine. They are therefore unable to access the same set of counters that they can
on physical machines and are unable to explain performance issues, such as CPU
starvation of the entire operating system, that do not exist on physical machines.
Slower and less-predictable performance of software executing in a virtual
machine are two of the greatest disadvantages of executing software using a
virtual machine, yet current techniques for measuring this performance do not
consider the role of the virtualisation in slow performance. In this dissertation I
argue the benefits of forgoing hypervisor fidelity to measure performance. That
is, given the importance of measuring the performance of virtual machines we
should turn to forgoing fidelity, in the same way as we have previously forgone
fidelity to ameliorate previous problems with virtualisation, such as slow perfor-
mance and the difficulties in virtualising classical x86.
I show that by forgoing hypervisor fidelity it is possible to build performance-
analysis techniques that reduce the probe effect of measuring virtual machines
and explain performance characteristics of software that one cannot measure
without considering the role of the hypervisor in executing software.
1.1 Defining forgoing hypervisor fidelity
Hypervisor fidelity is a well-defined concept [115]. However, the concept of for-
going hypervisor fidelity is less well defined. In this dissertation I define forgoing
1Indeed, I show in Chapter 4 and Chapter 5 that depending on the operation performed,virtualisation overheads can vary to the extent of changing the shape of a distribution.
16
-
hypervisor fidelity as a property of software that is designed for execution on a
virtual machine and makes use of the properties of the hypervisor.
1.2 Limitations of hypervisor fidelity in performance
measurement tools
Hypervisors date back to early work by IBM in the 1960s, where they were
initially used to multiplex access to a scarce, expensive mainframe. However,
the current trend of using hypervisors to virtualise cloud infrastructure has its
roots in the renaissance that followed fast and secure techniques to virtualise the
x86-64 instruction set. The re-emergence of paravirtualisation, addition of hard-
ware virtualisation extensions, and plentiful memory and CPU capacity servers
throughout the 2000s made it possible to execute many virtual machines on a
single server to increase utilisation. This, combined with a consumer movement
to performing computations and storing data on servers, made virtualisation at-
tractive to industry as virtualisation is cheaper and more scalable than executing
on dedicated machines.
The rise of cloud computing in recent years has been impressive. Amazon
EC2 alone has grown from nine million to twenty eight million public IP ad-
dresses in the past two years [143]. This number is clearly an underestimate for
the actual use of virtual machines as it doesnt include other cloud providers, or
non-public IP addresses.
However, the performance of virtual machines executing in the cloud is highly-
variable [39, 49], with cloud providers now competing on the predictability of
their services [9]. Despite this, the tools available to users to measure the per-
formance of their virtual machines have not kept up with the growth in cloud
computing. Given the difficulty in correctly virtualising all hardware counters
and eliminating performance interference, I show how by forgoing hypervisor
fidelity we can build tools that aid with measuring the performance of a virtual
machine.
17
-
1.3 The case for forgoing hypervisor fidelity in perfor-
mance measurement tools
Forgoing hypervisor fidelity to ameliorate problems in the virtualisation domain
has been repeatedly used in the past. I now explore previous times that we have
forgone hypervisor fidelity to improve the utility of virtual machines and argue
that contemporary problems mean that it is time to forgo hypervisor fidelity of
performance measurement techniques.
The concept of forgoing hypervisor fidelity is almost as old as virtualisation
itself. The early literature relating to OS/360 and OS/370 considers the role
of pure against impure virtual machines, whereby an impure virtual machines
executes differently as it has been virtualised. The advantage of impure virtual
machines was that they could execute faster than pure virtual machines. In the
end, pure virtual machines became the dominant virtual machine type, although
techniques such as paravirtualisation borrow from the ideas of impure virtual
machines.
More recently, forgoing hypervisor fidelity has been used to overcome classi-
cal limitations of the x86 instruction set that meant it was not virtualisable in a
way that provided both security and performance. By adopting paravirtualisa-
tion to overcome the limitations of classical x86, Xen forgoes hypervisor fidelity
since virtual machines execute with knowledge of the hypervisor and issue hy-
percalls rather than executing non-virtualisable instructions.
Even today, we forgo hypervisor fidelity to overcome performance problems
with virtualisation. One problem that virtual machines face is the possibility of
not being scheduled when they need to execute, for instance after packets have
arrived for the virtual machine. In order for the hypervisor to more-favourably
schedule the virtual machine when it has work to do, under Xen there are two
hypercalls that allow guests to deschedule themselves: yield and block. When a
guest is waiting for I/O or the network they can execute the block hypercall, pa-
rameterised on the event that they are waiting for. The hypervisor then preempts
the guest until the corresponding event is placed on the guests event channel, at
which point the hypervisor wakes the guest. The advantage in this case of the
guest acknowledging the presence of the hypervisor is that by blocking when it
cannot make progress the scheduling algorithm stops consuming credit from the
18
-
domain. Therefore, when the guest is able to execute the scheduling algorithm
will be more-favourable to the domain. Similarly, the yield hypercall allows
guests to relinquish their slot on the CPU, without parameterisation, such that
they will later be scheduled more favourably. Both the block and yield hypercalls
improve the performance of the guest, through forgoing hypervisor fidelity.
Even with the advent of hardware virtualisation that allows unmodified vir-
tual machines to execute, we still forgo hypervisor fidelity in the drivers on vir-
tual machines to improve performance. On hardware virtual machines (HVM)
the emulation of connected devices (which a tool such as QEMU can provide)
is slow, therefore HVM guests that need more performance are often converted
to PV on HVM guests, using virtualisation drivers that replace the emulated
devices with a driver that directly issues hypercalls. This allows guests to use the
hardware-assisted virtualisation interface when this is fastest, such as executing
a system call since the lack of rings one and two on x86-64 require all pure-
paravirtualised system calls to perform a context switch through the hypervisor,
and use the paravirtualised interface when this is faster, such as avoiding hard-
ware emulation. This is an example of the virtual machine forgoing hypervisor
fidelity to improve the performance of a virtual machine.
As we have seen, forgoing hypervisor fidelity is an oft-used technique to
solving problems in the virtualisation domain, in particular for solving perfor-
mance issues. A significant issue facing virtualisation today is that performance
is variable and yet techniques for measuring the performance of virtual machines
have lower utility than techniques for measuring the performance of physical ma-
chines. I propose rethinking where we forgo hypervisor fidelity in a mainstream
operating system, designed to execute in a contemporary cloud environment.
In this dissertation I show that by building performance measurement tools
that dont have strict hypervisor fidelity it is possible to mitigate many of the
issues of measuring the performance of a virtual machine. Forgoing hypervisor
fidelity should not be controversial given the trend of forgoing hypervisor fidelity
to solve performance-related issues.
In the remainder of this chapter I introduce three key methods by which
forgoing hypervisor fidelity allows software to report better performance mea-
surements when virtualised. Later, I present each contribution in detail.
19
-
1.4 Kamprobes
Current kernel probing mechanisms are built without forgoing hypervisor fi-
delity. That is, developers execute the same types of probes on virtual machines
as they do on physical machines. However, these methods usually rely on set-
ting software interrupts in an instruction stream. Whilst these generally execute
well on physical hardware, I show in Chapter 3 that interrupts on a virtual ma-
chine are 1.81 times more expensive than interrupts on hardware (3.3.2), as thehypervisor has to execute.
Probes are a common technique for measuring the performance of computer
software. By allowing developers to add additional code at a programs runtime,
probes allow developers to execute code that measures wall-clock time, cycles,
or other resources used by a piece of code without the burden of modifying the
softwares source code, recompiling and re-executing the software. However,
a problem with probes is that when they fire they consume resources, thereby
affecting the performance of the application that they try to measure.
Whilst this probe effect impacts both physical machines and virtual machines,
the overheads are 2.28 times higher on virtual machines than physical machines (3.3).Moreover, virtualisation increases the standard deviation of the number of cycles
requires to fire a probe from 8 cycles to 869 cycles (3.6.2).By having higher overheads, probing mechanisms on virtual machines ex-
acerbate the probe effect. This makes it harder to identify the cause of poor
performance of applications on virtual machines.
Kamprobes is a technique for probing virtual machines that only uses unpriv-
ileged instructions, such that the hypervisor is not involved in a probe firing and
avoids other operations that are expensive in a virtual machine such as hold-
ing locks. Kamprobes forgoes hypervisor fidelity by being designed to execute
with maximum performance on a virtual machine. For instance, by only using
non-privileged instructions, the design of Kamprobes forgoes hypervisor fidelity.
There is only a modest difference between executing in a virtual machine and
on a physical machine on the number of cycles (twelve cycles) and the variability
(two cycles of standard deviation). Moreover, Kamprobes execute much faster
than Kprobes (the current state-of-the-art in Linux kernel probing), with a Kam-
probe taking 6916 cycles to execute, whereas a Kprobe takes 6980869 cycles
20
-
to execute (3.6.2). Furthermorewhilst not an issue of virtualisationwhenKprobes determines which handler to execute it performs a lookup that scales
with O(n), with the number of probes inserted. The technique that Kamprobes
uses does not need to perform a lookup, and so scales linearly (O(1)). Kam-
probes can therefore be used in circumstances that require many probessuch
as for a function boundary tracerfor which Kprobes is too slow.
1.5 Shadow Kernels
Whilst Kamprobes are a low-overhead technique for probing virtual machines, if
they are usedeven with empty probe handlerson hot codepaths the overhead
of them repeatedly firing can significantly reduce performance. In principle this
shouldnt be an issue because much of the time developers want to measure the
performance of one particular processs interactions with the kernel in isolation.
But there is no current way of setting kernel probes that only fire when one
particular process executes.
Shadow Kernels is a technique I developed by which specialisation, such as
setting probes, can be applied to a kernel instruction stream on a fine-grained
basis such that the specialisation applies to a subset of the processes or system
calls executing on the system. Currently, specialising the operating system kernel
makes changes to the kernel instruction stream that affect all processes executing
on the system. This is because whenever the kernel instruction stream is modified
the address space of every process is modified as each processes maps the shared
kernel into its own address space. The underlying issue is that modifications to
the instruction stream of the kernel are a global operation, in that the shared
instruction stream is executed by all processes. I therefore show that the effect
of this is to reduce the performance of all processes executing on the system,
regardless of if their interaction of the kernel were the target of specialisation.
Shadow Kernels requires co-operation of virtual machines with the hypervi-
sor since the virtual machines execute hypercalls that cause the hypervisor to
modify the physical-to-machine memory mappings such that the virtual mem-
ory containing the kernel instruction stream maps to different machine-physical
memory depending on the calling context.
Shadow Kernels is a technique that utilises the indirection of virtualised page
21
-
tables such that multiple copies of the kernel instruction stream co-exist within
a single domain. This allows processes that are not the target of instrumentation
to execute their original kernel instruction stream, whilst applications whose
interaction with the kernel is the target of specialisation execute a specialised
instruction stream.
Building Shadow Kernels without a hypervisor would be challenging: Oper-
ating systems are designed with a memory layout such that the kernel resides at
a fixed offset in physical memory. However, with Shadow Kernels there are mul-
tiple copies of pages that include the kernel instruction stream, with the memory
management unit changing which page virtual addresses resolve to. Therefore,
there is no-longer a fixed mapping between physical and virtual pages in the ker-
nel instruction stream. Furthermore, the hypervisor-based approach is easy to
port Shadow Kernels to other operating systems.
1.6 Soroban
A key issue with executing software in the cloud is that applications often exe-
cute more slowly and sometimes with performance interference from other vir-
tual machines [60]. For latency-sensitive applications, in particular, this virtuali-
sation overhead prevents users from switching to virtual machines [142]. How-
ever, current application monitoring systems are built with hypervisor fidelity, in
that they report the same metrics if they execute on a physical machine or a vir-
tual machine. As the performance of an application is affected by the hypervisor
in a way that is hard to predict, it is currently difficult to measure how much of
the latency of a program executing in the cloud is caused by the overheads of
virtualisation and how much is due to other causes, such as a high load on the vir-
tual machine. Soroban is a technique that forgoes hypervisor fidelity to measure
how much of the latency of a request is due to the overheads of virtualisation.
By forgoing hypervisor fidelity throughout the software stack, up to the appli-
cation, Soroban reports the additional latency imposed on servicing individual
requests in a request-response system. This allows developers to measure the
additional overheads that their application experiences due to executing in a
virtual machine, as opposed to executing on bare metal. By reporting the virtu-
alisation overhead, developers can decide whether the additional overheads are
22
-
worthwhile.
Soroban uses a modified version of Xen that shares with each domain the
activity performed it by the scheduler, such as timestamps of when the virtual
machine is scheduled in and out. Soroban then trains a Gaussian process on the
relationship between these variables and the response time of a request-response
system. The result of the learning phase is a model that when given a feature
vector of scheduling activity on a domain it reports the impact that these events
have on the event response time.
I evaluate Soroban, showing that the technique can be applied to a web server
and measure the increase in latency due to virtualisation in servicing requests. I
demonstrate that as more virtual machines execute concurrently, Soroban in-
creases the latency increased with virtualisation, but when the web server exe-
cutes requests slowly due to high load, Soroban does not increase the measure
of virtualisation overhead.
1.7 Scope of thesis
In this dissertation, I primarily focus on Xen, executing paravirtualised GNU/Linux
on x86-64 hardware. I now justify this choice.
1.7.1 Xen hypervisor
Xen is the hypervisor used by Amazon EC2 [139], which as of May 2015 is ten
times larger than the combined size of all its competitors [85]. Given the clear
dominance of Xen in the cloud, solutions to problems of measuring performance
when virtualised with Xen have a high impact. However, the key contributions
of my thesis can be ported to other hypervisors.
1.7.2 GNU/Linux operating system
As of December 2014, 75% of enterprises report using Linux as their primary
cloud platform [52], with the market share of Linux virtual machines increasing.
Most of the remainder is Windows virtual machines, however the number of
these is falling.
23
-
1.7.3 Paravirtualised guests
Presently there are two main techniques to virtualising an operating system in
the cloud. (i) Hardware extensions allow an unmodified operating system to ex-
ecute on a hypervisor (HVM). This is a common way of virtualising proprietary
operating systems, such as Microsoft Windows. (ii) Modifying the guest oper-
ating system such that it is aware that is executing in a virtualised environment
and directly issues hypercalls rather than performing privileged instructions.
The performance of paravirtualised guests is comparable with the perfor-
mance of hardware virtual machines, with regular changes as to which one is
the faster form of virtualisation.
In this dissertation I use paravirtualised virtual machines as they have an ex-
isting interface with the hypervisor, through which virtual machines can issue
hypercalls. As this dissertation proposes forgoing hypervisor fidelityas such
creating paravirtualised performance measurement techniquesit is more nat-
ural to build these on paravirtual virtual machines. However, hardware virtual
machines often have a paravirtualised interface through which drivers can oper-
ate, so many of the ideas could be ported to hardware virtual machines.
1.7.4 x86-64
Whilst instruction sets other than x86-64 are virtualisable, Intel currently has
a 98.5% market share in server processors (as measured by number of proces-
sors) [77], with much of the remainder being taken by AMD x86-64 processors.
As such, I do not consider other instruction sets.
The contributions of Kamprobes in Chapter 4 are particularly tightly-coupled
with the x86-64 instruction set. However, the fundamental idea of using unpriv-
ileged instructions to build a probing system hold true across other instruction
sets. Indeed, on a fixed-width instruction set, such as ARM, this technique is
both easier to implement and can be used on more opcodes than on x86-64.
Both Shadow Kernels and Soroban are less reliant on any particular instruc-
tion set.
24
-
1.8 Overview
In summary, the key contributions of this dissertation are:
Kamprobes. Current probing techniques are built to execute on a physical ma-
chine and as such rely on interrupts to obtain an execution context. How-
ever, on a virtual machine interrupts are a privileged instruction, so are
expensive. Kamprobes is a low-overhead probing technique for x86-64
virtual machines that execute with near-native performance in a virtual
machine.
Shadow Kernels. By forgoing hypervisor fidelity, virtual machines can remap
their text section to allow virtual machines to specialise shared text re-
gions, in particular the kernel. Whilst I focus on the use case of scoping
kernel probes, the technique can be applied to other types of kernel text
specialisation, such as profile-guided optimisation.
Soroban. A key concern that prevents the uptake of virtualisation is the impact
of the virtualisation overhead. I show that by building software that ac-
knowledges the presence of the hypervisor in its own monitoring, it is pos-
sible to measure the virtualisation overhead of fine-grained activities, such
as serving an HTTP request.
The remainder of this dissertation is structured as follows. I explore the back-
ground for my thesis in Chapter 2, arguing that the requirement of hypervisor
fidelity for performance measurement techniques is a relic of classical hypervisor
use cases and can be forgone for contemporary operating systems. In Chapter 3
I introduce Kamprobes, a probing technique for virtualised x86-64 operating
systems. In Chapter 4 I propose Shadow Kernels as a solution for specialisation,
such as scoping the firing of probes. In Chapter 5 I present Soroban, a technique
for using machine learning to report, for each request-response, the additional
latency added by executing on the hypervisor.
25
-
26
-
CHAPTER 2
BACKGROUND
In their 1974 paper Popek and Goldberg state the classical definition of a hyper-
visor as having three properties: Fidelity, performance and safety [115].
Fidelity. Fidelity represents the concept that a hypervisor should portray an accu-
rate representation of the underlying hardware, such that software can exe-
cute on the hypervisor without requiring modification, or being aware that
it executes in a virtualised environment. As such, the results of software
executing in a virtualised environment must be identical to those obtained
when executing on physical hardware, barring any effects of different tim-
ing whilst executing on virtualised hardware.
Performance. The performance of a virtual machine must not be substantially
slower than when executing on physical hardware. In particular, most in-
structions that execute must run unmodified, without trap-and-emulation
techniques (trap-and-emulation is the only virtualisation technique that
Popek and Goldberg consider).
Safety. Virtual machines must act independently, without the ability to interfere
with other domains executing on the system. Particularly, virtual machines
should not have direct access to shared hardware, with which they can
modify the state of another virtual machine in a way that would not be
expected of that machine executing on physical hardware.
In this dissertation I propose performance-analysis techniques that are de-
signed to complement virtualisation, by either using code that virtualises well or
by using techniques that interact with the hypervisor. As such, this work breaks
the traditional definition of a hypervisor in that it no longer offers fidelity. In this
chapter, I consider related work to argue that the difficulty of measuring the per-
formance of virtual machines is exacerbated by the requirement of fidelity and
27
-
that this should be relaxed given the changing uses of hypervisors. Throughout
the rest of this dissertation I use this argument to justify techniques that require
performance-analysis techniques that are tightly-coupled with the hypervisor.
2.1 Historical justification for hypervisor fidelity
In this section I consider the historical justification for hypervisors, especially for
hypervisor fidelity. I later show that the use cases of hypervisors has changed
and as such we should reconsider the hypervisors original design principles.
The concept of hypervisor fidelity, whilst formalised in 1974 [115], dates
back to the start of research into virtual machines by IBM. IBM built early hy-
pervisors that allowed multiple users to concurrently execute on a rare and ex-
pensive mainframe with the illusion of being the only user of the machine. That
is, each user had the illusion of being the sole user of the machines hardware,
with their operating system being the only one executing. The key issues that
early hypervisors attempt to fix are that OS/360 uses thenow-common [75]
architecture of a machine executing a single kernel that is shared with every
process executing on the system: (i) Different users are unable to execute dif-
ferent operating system versions. Due to the lack of availability of mainframes,
users were unable to obtain another machine to execute their own operating
system version. (ii) Users cannot develop new operating system features in isola-
tion from other users. For instance, if a developer were to extend the operating
system, but their code contains a bug, with OS/360 it is not possible to prevent
this from affecting concurrent users. As traditional abstractions are lower-level
than contemporary abstractions, it was commonplace for developers to regularly
need to modify or extend their operating system.
CP-40 is considered to be the first hypervisor, being released in 1967 and able
to concurrently execute fourteen virtual machines. As the complexity of hard-
ware increased through the 1970s the use of hypervisors became more practical
and feature in the development of OS/360 and OS/370 [62, 127]. Behind all
IBM work is the control program (CP), which allows concurrent execution of
operating systems, each of which has the illusion of executing on physical hard-
ware [56]. The original versions of CP allow an unmodified operating system to
execute in a virtualised environment in which CP configures the hardware such
28
-
that whenever a virtual machines executes a privileged instruction the hardware
induces a trap, which CP catches, decodes and emulates in a safe way. There
were other early hypervisors, such as the FIGARO system, which was part of the
Cambridge Multiple-Access System that have similar design goals [147]. As such,
these early hypervisors do provide fidelity, in that the software that executes on
them has the same side effectsignoring timing effectson both physical and
virtual hardware.
2.2 Contemporary uses for virtualisation
Having shown the historical justification for hypervisor fidelity, I now argue that
the use case for virtualisation is now different to in the 1960s and 1970s. As such,
it is time to reassess the requirement of virtual machine fidelity, in particular to
aid in helping developers measure the performance of their virtual machines.
Rather than building performance tools that explain a subset of what can be
viewed on a physical machine, due to limited access to performance counters,
we should forgo hypervisor fidelity by building performance analysis techniques
that are designed to execute on a virtual machine.
Compared with when hypervisors were pioneered, hardware is now cheaper
and more readily-available, as such the original requirements for virtualisation
no longer hold: (i) In contemporary computing users have access to many ma-
chines, as such they are usually able to execute an operating system of choice
on a different computer. (ii) The influx of additional hardware also means that
development of operating system features can be performed on dedicated devel-
opment hardware. Indeed, executing production services on the same hardware
that is used for operating system development, even when a hypervisor is used,
would be unconventional in the current era. In comparison to when virtualisa-
tion was pioneered, it is standard practice to have fleets of physical machines just
testing changes to operating system source code. Moreover, higher-level abstrac-
tions reduce the requirement of most development work to involve modifying
the operating system.
In the last ten years virtualisation has underpinned the move to cloud com-
puting, which in turn has revolutionised computing [6]. A lower bounds indica-
tor of the growth of cloud computing is that Amazon AWS alone has increased
29
-
from nine million to twenty eight million EC2 public IP addresses in the past two
years [143]. The key benefit of the hypervisor in these cloud computing environ-
ments is allowing operators to provide virtual machines to their customers, so
that multiple customers can share the same physical server without interference.
In particular, hypervisors give a number of advantages to cloud providers:
Higher machine utilisation. By co-hosting virtual machines on a physical server
the utilisation of the physical server increases when compared with execut-
ing each service on a dedicated physical machine. Whilst higher utilisation
was a key factor in the early work on hypervisors, this was because the
mainframes that they executed on were scarce and highly-contested. How-
ever for cloud providers, servers are readily-available, but higher utilisa-
tion decreases power consumption, cooling, maintenance and real-estate
expenditure. In order to increase utilisation, hypervisors now offer fea-
tures such as memory overcommitting through ballooning [144] and pre-
allocation [94]. Although such higher utilisation has remained a benefit of
using a hypervisor, the reasons for desiring higher utilisation have changed,
as such the role of the hypervisor has changed. The downside to higher
utilisation is that it risks starving virtual machines of resources, thereby
reducing their performance. Operating system starvation is not a problem
that exists when executing on bare metal therefore tools that do not forgo
hypervisor fidelity cannot report this effect.
Creating virtual machines is fast and cheap. Users can spawn a new, booted vir-
tual machine in less than one second [84]. This is not possible without
a hypervisor, since fast boot up is achieved by forking an already-booted
virtual machine, such that the two have the same state. With physical
machines, the closest alternatives are techniques such as PXE that aid in
reducing the time between connecting a server and it being fully-booted.
However, for most use cases the main time cost in running a new physical
server is actually in finding server hosting and obtaining a physical server.
With hypervisors, there is no need for most users to purchase physical host-
ing and servers, as they can simply pay for a virtual machine from their
cloud provider. Moreover, the economics of cloud computing often make
it cheaper to execute in the cloud than building a data centre [136]. This
clearly differs from the original use case of a hypervisor in which being
30
-
able to rapidly spawn a new machine was not a desired feature.
Scalability to near-infinite computing resource on demand. Usage patterns of Internet-
connected applications are highly-variable [119]. In order to respond to
spikes in demand they need to be elastic, in that they need to execute using
more machines during spikes to maintain a quality of service. Hypervisors
allow scalability of virtual machines to up to 3 000 virtual machines in a
32 host pool [61]. In cloud computing environments, where hypervisor
pools are less common, the bottleneck on the number of virtual machines
that can execute is bounded by economic factors. As virtual machines are
fast to spawn, users can build more scalable software that responds to
changes in demand by creating more virtual machines. Such requirements
were never present in the early forms of virtualisation, as they operated
before the creation of the Internet, so contemporary issues such as the
slashdot effect and viral trends did not exist. Furthermore, the original
workloads that execute on a hypervisor were non-interactive batch jobs,
therefore they had different performance requirements to contemporary
clouds, where request-latency is a key metric.
Live migration of virtual machines. Modern hypervisors can transparently mi-
grate virtual machines between physical hosts [124] without downtime [34]
and similarly migrate and load-balance [59] storage between repositories
without downtime [94]. This allows system administrators to perform
maintenance on physical machines without disrupting a service executing
on the virtual machines, since they can first migrate the instance onto an-
other host. As organisations rarely had more than one mainframe when
hypervisors were initially designed, this was not a use case of the pioneer-
ing work. The downside of live migration is that if the virtual machine
is migrated onto a highly-loaded or less powerful host then it may exe-
cute more slowly However, this decrease in performance is from the cloud
provider, so is hard to detect with existing techniques.
High isolation compared with other virtualisation techniques. Kernel security vul-
nerabilities only affect the domain in which the vulnerability is used. Be-
tween 20112013 there where 147 such exploits for Linux [2]. Compared
with other virtualisation techniques that share the same kernel, hypervi-
31
-
sor exploits are more rare, with Xen having had just one privilege escala-
tion vulnerability from paravirtualised guests [140]. Since the invention of
hypervisors this requirement has increased: Attack vectors are now more
readily exploited and there are more commercial requirements for isolation
of services.
Backup and restore. There are advantages to providing backup and restore from
outside of a domain [152], since it is fast [37] and does not require operat-
ing system co-operation for access locked files and cannot be disabled by
malicious software. Backup and restore was not a concern for hypervisor
design in the 1960s.
Accountability. Accountable virtual machines allow users to audit the software
executing on remote hosts by having the software execute on top of a
hypervisor that performs tamper-evident logging [64]. Using virtualisation
for accountability is a new use-case for hypervisors that they were not
originally designed for.
Emulating legacy software. Windows 7 and later versions contain a hypervisor
to execute Windows XP. When the Windows instance is a virtual machine
the emulator then executes using nested virtualisation [66]. Whilst nested
virtual machines were considered in early work [147], this was mainly a
point of academic enlightenment.
Emulating advances in time. As hypervisors emulate wall-clock time to their guests,
they can be used to discover how software will behave at a future point in
time [35] or when executing under future, faster hardware [109]. Emulat-
ing changes in time was not an original design goal of hypervisors.
I have described a number of ways in which hypervisors are used as part of
mainstream cloud-computing environments. In particular, I have shown how
the use cases for the hypervisor in 2015 differ from those in the 1960s and
1970s when the classical definition of the hypervisor was developed. Due to
this change in use case, it is reasonable to argue that strict adherence to an out-
dated definition of the hypervisor should be challenged. One of the recurring
themes is the change related to moving from serving a batch-processing work-
load, to a request-response system in which users need high scalability, and low
32
-
latency in their serving of requests. Concurrently, virtual machines now execute
in a less predictable environment, with untrusted parties, malicious actors and
automated scheduling all acting in ways that affect the performance of virtual
machines, in a way that early virtual machines did not experience. As such, the
importance of measuring performance has increased, such that fidelity now has
lower utility than measuring the performance of virtual machines.
2.3 Virtualisation performance problems
Despite its popularity, a particular problem with virtualisation is that the per-
formance of virtual machines is slower and more variable than the performance
of physical machines yet it is difficult to measure the performance of a virtual
machine.
As well as contention for shared resources [117] there are other sources of
slow performance, which I now explore.
2.3.1 Privileged instructions
Under virtualisation certain instructions become more expensive, such as vmexit,
which increases by a factor of between five and twenty-five under virtualisa-
tion [122]. Also, as AMD64 only has two rings, paravirtualised guests have a
user space and kernel space that both execute in ring one and the hypervisor has
to mediate every system call. This makes system calls more expensive in virtual
machines than on physical machines, although by how much varies depending
on hardware [31].
2.3.2 I/O
I/O on virtual machines involves a longer data path than on physical machines
since the hypervisor has to map blocks from the virtual disks exposed to its
guests to physical blocks on storage, that is often remote. I/O operations are a
regular source of slow performance [57, 26, 100] which are around 20% slower,
depending on configuration. Furthermore, the hypervisors batching of I/O re-
quests can lead to extreme arrival patterns [22].
33
-
2.3.3 Networking
Networking in virtual machines can be unpredictable [98]: When executing on
a CPU-contended host compared with a CPU-uncontended host, throughput
can decrease by up to 87% and round trip time can increase from 10 ms to
67 ms [129]. On Xen, two causes of this are the back end of the split-driver
being starved of CPU resource as the driver domain is not scheduled or the front
end of the split-driver being starved as the scheduler in the virtual machine does
not schedule the driver during its scheduling quanta.
The effect of poor networking performance is that there are significant reduc-
tions in quality of service as observed by end-users in throughput and delay [26].
2.3.4 Increased contention
When executing as a virtual machine there is higher contention, caused by two
sources: Other virtual machines being scheduled and the hypervisor/domain zero
executing. The hypervisor increases contention when executing as a virtual ma-
chine due to switches to the hypervisor, through executing a vm-exit instruction
that needs to save the state of the virtual machine and restore the state of the
next domain [3]. Other virtual machines also cause performance interference,
especially for micro virtual machines, which execute on physical hosts with low
priority to use the spare CPU cycles left by other virtual machines. Such mi-
cro virtual machines are serviced poorly and to get maximum performance
for the instance typevirtual machines need to inject delays to be scheduled
favourably [146].
2.3.5 Locking
Locking has long been known to be problematic on virtual machines. When
operating systems are designed programmers often protect data structures with
mutexes, and assume that they hold the mutex for a short period of time as
holding a mutex on a shared data structure for a long time is expensive [114].
However, when executing in a virtual machine there is the possibility of a vCPU
being preempted whilst it holds a mutex, preventing other threads from making
progress [40]. Another problem is lock scalability, which unless modified to
perform better under a hypervisor, scales poorly with the number of vCPUs [76].
34
-
2.3.6 Unpredictable timing
When executing inside a virtual machine, time becomes unpredictable as virtu-
alised time sources are unreliable and behave poorly under live migration [19].
Also, operations that one expects to have a constant time can take an unpre-
dictable amount of time. For instance, techniques such as kernel same page
merging can help reduce the memory overhead of executing in virtual machines
by sharing identical pages between virtual machines [101]. However, when a vir-
tual machine modifies a shared page the hypervisor traps and creates a copy of
the page specifically for that virtual machine to modify. This makes page access
times unpredictable from within the virtual machine [135].
2.3.7 Summary
Despite many advances, virtual machines remain slower and less predictable
than physical machines. As it is unlikely that these issues will be completely
removed, it is important that users of virtual machines are able to measure the
performance of their virtual machine.
2.4 The changing state of hypervisor fidelity
Given the performance overhead of executing in a virtualised environment and
the difficulty in measuring this performance in a virtual machine, I propose that
virtual machines should forgo hypervisor fidelity for performance measurement
techniques. Rather than treating the hypervisor as a physical machine for every-
thing except the lowest layers of the kernel, performance measurement tools
should be designed to execute well in a virtual environment and should co-
operate with the hypervisor to maximise visibility of performance. Whilst this
does involve changing the accepted use of the interface between virtual machines
and hypervisors, I now show that changes to this interface have previously been
used to ameliorate performance problems in the virtualisation domain.
2.4.1 Historical changes to hypervisor fidelity
Even in the earliest work on hypervisors, there was acceptance that pure-virtualisation
may not be practical. A concern with the early versions of CP is that it performed
35
-
slowly, which is largely attributable to using trap-and-emulate to prevent virtual
machines from executing privileged instructions and causing them to execute an
emulated version. To address this, the evolution into OS/370 introduces the idea
of a hypercall [150], in which the virtualised operating system sets up some state
to communicate with CP and then uses the DIAGNOSE instruction to transition
context into CP [36]. By introducing the concept of a hypercall, IBM acknowl-
edge that building operating systems that are entirely-strict to the definition of
fidelity is not necessary. Rather, in cases where full emulation of physical hard-
ware has a high cost, it is better to forgo fidelity by making the virtual machine
aware that it is executing on a hypervisor and execute a hypercall rather than
perform the expensive operation.
I argue that we have the same issue today, whereby current techniques for
measuring the performance of a virtual machine execute the same code on vir-
tual machines as they do on physical machines. Therefore, performance measure-
ment tools have lower utility on virtual machines than physical machines as they
use code that virtualises poorly and cannot report the cost incurred due to virtual-
isation. As such, we should reconsider whether applying the technique employed
by IBM in 1973 to solve the problems of the daynamely poor performance
can solve the contemporary issue of it being difficult to measure the performance
of virtual machines. In particular, we should consider using paravirtualised per-
formance measurement techniques.
The invention of the hypercall created a debate that continues throughout
the 1970s [56] regarding pure vs impure virtual machines, in which a pure vir-
tual machine is a guest that runs unmodified code, whereas an impure virtual
machine runs modified code. In particular, there is consideration of position of
the hypervisor interface, since the hypervisor can either simulate high-level ac-
tions, such as reading a line, or can simulate the individual instructions involved
in performing the high level action [16].
2.4.2 Recent changes to hypervisor fidelity
With the popularisation of (early versions of) x86, virtualisation became harder
as the instruction set does not provide trap-and-emulate ability for privileged in-
structions such as SIDT, SGDT and SSL [121]. Therefore, to virtualise traditional
x86, one has to use binary translation, the process by which the instruction
36
-
stream is scanned and privileged instructions are rewritten with function calls to
emulating functions. Performing full binary translation is a slow process [78], so
early x86-64 hypervisors were either slow or unsecure [121]. Those that are slow
fail the hypervisor definition as they do not provide the performance property.
Furthermore, as Popek and Goldbergs hypervisor definition is tightly-coupled
with trap-and-emulate techniques in its formalisation of fidelity, such that vir-
tual machines cannot execute a modified instruction stream, binary rewriting is
not considered classically virtualisable [1].
To resolve the issues of virtualisation on traditional x86, Baraham et al. built
Xen, a hypervisor that uses paravirtualisation to emulate x86 with performance,
strong-isolation and unreduced functionality [12]. In using paravirtualisation,
Xen requires that operating systems be modified to issue hypercalls rather than to
execute with true-fidelity when issuing privileged instructions. One contribution
of Xen was to paravirtualise the memory management unit, in that guests page
tables are mapped read-only and the guest has to issue a hypercall to update
them. This design allows virtual machines to directly map virtual addresses to
the addresses of the memory on the physical server (machine physical frames),
rather than have shadow page tables that give the illusion of executing in an
independent address space. In overcoming the shortcomings of x86, by forgoing
hypervisor fidelity, Xen is much like my proposal of forgoing hypervisor fidelity
to overcome the shortcomings of performance measurement of virtual machines.
More recent advances in the x86-64 instruction undeniably restore a degree
of fidelity to the hypervisor by allowing unmodified virtual machines to execute
in a hardware virtual machine (HVM) container [141]. HVM containers extend
the architecture of x86-64 so as to provide a privileged mode [123] in which
the hypervisor executes and to which calls to privileged instructions cause the
processor to enter a protected mode (sometimes considered negative rings), in
which the hypervisor executes. Whilst this increase in fidelity does create some
advantages, for instance operating systems can migrate between executing as a
physical and virtual instance [83], I nevertheless still argue that this increase in
fidelity only came when hardware had advanced sufficiently (for instance with In-
tel VT-X) such that fast and secure x86 virtualisation was no longer problematic.
Should future hardware allow virtual machines to measure their performance to
same degree as physical machines, then restoring fidelity to measuring the perfor-
mance of virtual machines may be reasonable. There is already limited evidence
37
-
of hardware advances increasing the ability of a virtual machine to measure its
performance [104].
2.4.3 Current state of hypervisor fidelity
Despite the increase in hardware virtualisation, I still argue that it is common-
place for the software stacks that execute on the hypervisor to not exhibit strict
fidelity. This is principally due to the process of re-hosting an application on in-
frastructure as a service, during which developers are encouraged to make use of
properties of the cloud, such as the scalability of virtual machines [103]. How-
ever, within virtual machines there are differences to the software stack when
compared with physical machines. As such, forgoing hypervisor fidelity in per-
formance measurement techniques is not a radical move.
2.4.3.1 Installing guest additions
All high-performance hypervisors that use hardware virtualisation techniques
still provide extensions to improve the performance of their guests: XenServer
Guest Tools, VirtualBox Guest Additions, and VMware tools are some exam-
ples. These typically provide drivers that allow the guest operating system to
communicate directly with the hypervisor so that full emulation of devices is not
required. However, installing such extensions reduces the fidelity of the virtual
machine, since by using different drivers, the virtual machine executes differently
on physical and virtual hardware.
2.4.3.2 Moving services into dedicated domains
There is a growing trend to use virtual machine introspection to provide ser-
vices that would traditionally have been provided by processes or operating sys-
tem [24]. For example, Bitdefender performs malware detection from outside
a separate, privileged domain, which prevents malware from attacking the mal-
ware detection program, as it is commonplace for viruses to attack antivirus
mechanisms [88]. Furthermore, most commercial hypervisors now support vir-
tual machine snapshotting, a feature typically performed by the filesystem. There
are also proposals to move monitoring into a separate domain [82]. Given the
trend of separating services out such that they execute outside of the original do-
38
-
main, I argue that hardware virtualisation does not achieve full fidelity, since if
those operating systems were to execute on physical hardware, they would need
reconfiguring such that they execute processes to perform all of these features.
2.4.3.3 Lack of transparency of HVM containers
Even when executing inside a hardware virtual machine container, which is sup-
posed to provide fidelity, the interface with the hypervisor still differs from that
provided by exclusive use of hardware. One demonstration of this difference in
interface is malware that detects the presence of a hypervisor through irregular-
ities in the availability of resources, such as CPU cycles, caches, and the TLB,
and refuses to execute a payload [149]. Furthermore, the timing properties of
a virtual machine differ from physical machines, both due to virtualisation over-
head and changes to the time required to access hardware that is emulated by
the hypervisor, and hidden page faults caused by access to hypervisor-protected
pages, different timing between virtualised instructions (such as cpuid) and non-
virtualised instructions (NOP) [54]. Given that the interface to the hypervisor is
leaky, I argue that we should acknowledge this difference throughout the soft-
ware stack, rather than maintaining fidelity.
2.4.3.4 Hypervisor/operating system semantic gap
The performance of a virtual machine can be improved if the hypervisor is better
able to predict the virtual machines actions. There are two main techniques of
improving the prediction rates: Monitoring the virtual machine, with knowledge
of its data structures so-as to be able to improve decisions and policies, which
can increase the cache hit ratio of a virtual machine by up to 28% [73] or moving
functionality into the hypervisor from the guest [87]. The latter reduces fidelity
and the former requires co-operation, therefore we observe deviation from the
standard definition of a hypervisor.
2.4.4 Summary
I have now shown that since the advent of the hypervisor, forgoing hypervisor
fidelity has been a common solution to solving problems in the realms of virtu-
alisation. Even today, with hardware virtual machines, virtual machines do not
39
-
strictly provide fidelity. My demonstration that forgoing hypervisor fidelity has
successfully been used to solve past problems with virtualisation confirms my
thesis that the use of the interface should change so as to improve the utility of
performance measurement tools.
2.5 Rethinking operating system design for hypervi-
sors
There is considerable research literature that reconsiders the role of the operating
system, when executing in the cloud, from the ground-up, which often forgoes
fidelity to increase utility.
Library operating systems, such as OSv recognise that in a typical cloud soft-
ware stack there is a hypervisor, operating system and a language runtime [80].
Each of these performs abstraction and protection, at the cost of an increased
footprint and performance overhead, such as a 22% impact on the throughput
of lighttpd. Library operating systems replace everything that executes above
the hypervisor with a single binary, so that the hypervisor performs abstraction
and protection [80]. Similarly, Mirage is designed to execute on a hypervisor
only, making use of the small hypervisor interface [91], thereby improving on
Linux in terms of boot time, I/O throughput and memory footprint.
SR-IOV increases fidelity by letting operating systems directly interact with
the network interface card, with the hardware ensuring isolation [41]. However,
SR-IOV can be used in unconventional ways: Dune is a hypervisor-like project
that uses hardware virtualisation features to allow usespace direct access to safe
hardware features, such as ring protection, page tables and the TLB [14]. Belay
et al. achieve this by using hardware extensions built for virtualisation but have
their lowest layer of software still expose an abstraction of a process, rather than
hardware. Furthermore, Arrakis [113] and IX [15] use SR-IOV to separate the
control and data plane so-as to increase networking throughput of commodity
hardware.
The work that I present in this dissertation focuses on applying performance
measurement techniques to mainstream operating systems in the cloud. As re-
search operating systems are not yet mainstream I do not explicitly show the
benefits that they would receive. However, the key techniques in all three of
40
-
my contributions could be applied to such operating systems, without causing
diverting behaviour between virtual and physical machines.
2.6 Virtual machine performance measurement
Having argued that the requirement for hypervisors to exhibit fidelity is overly-
restrictive and that forgoing hypervisor fidelity has been previously used to solve
problems in the virtualisation domain, I now explore work related to virtual
machine performance.
2.6.1 Kernel probing
Probing has a rich history that goes back to the dawn of computers. The first
use of probing is believed to have been to used by Maurice Wilkes to insert sub-
routines into code executing on the EDSAC. These sub-routines would print dis-
tinctive symbols at intervals throughout a program so that the operator could
determine an error [55]. Later computers, starting with the UNIVAC M-460
included programs such as DEBUG that let operators specify addresses to insert
additional code that could be used for debugging [47].
Contemporary operating systems have a probing system to allow users to de-
bug their software and measure its performance. Linux uses Kprobes [107] and
Microsoft Windows uses Detours [69]. NetBSD [106], FreeBSD [96], and OS
X all use DTrace, which embodies a probing system in a wider instrumentation
system. There has been further work on these systems to optimise them [68]
as the benefits of fast probing have long been known [79]. However, with the
exception of Windows Detours, these all use interrupt-based probing techniques.
Previous work has shown another technique for probing, based on jumps,
which are often faster than executing interrupts [137, 138]. Windows Detours
was the first of these jump-based probing systems that preserves the semantics
of the target function as a callable subroutine [69]. However, whilst there is
some benefit from using jump-based techniques on physical machines, I show
that their utility when applied to virtual machines is much higher. This is due to
interrupt-based techniques virtualising poorly.
There has been some consideration of changing the nature of operating sys-
tem probing in the virtualised environment by disaggregating probe handlers
41
-
into a separate domain [118]. However, this hasnt received popular uptake.
2.6.2 Kernel specialisation
Kernel specalisation is not a new concept: Early work on the synthesis kernel
pioneers kernel specialisation by generating efficient kernel code that acts as fast-
paths for applications [116]. The advantages of kernel specialisation are well
known [23, 17]: Profile-guided optimisation of Linux improves the kernel per-
formance by up to 10% [151] and exokernels [45] remove kernel abstractions
so that applications interact with hardware through fewer layers of indirection,
thereby reducing kernel overheads. For instance, Xok is an operating system
with an exokernel whereby a specialised web server has over four times the
throughput of a non-specialised web server [74]. Indeed, the benefits of spe-
cialisation are a key feature of Barrelfish, an opearting system redesign to allow
kernel specialisation such that cores run different kernels [125] and Dune for al-
lowing applications access to privileged CPU features [14]. Another possible op-
erating system redesign to allow kernel specialisation is using microkernels, since
only a small set of features are then executed by an operating system mapped
into every process, rather user space services can provide competing specialised
implementations of features [86].
In Chapter 4 I introduce Shadow Kernels, a technique that allows per-process
kernel specialisation by having applications that acknowledge the presence of the
hypervisor and execute code that causes the hypervisor to switch the underlying
memory of the domains kernel. The key benefit of Shadow Kernels is to allow
multiple kernel instruction streams to execute on a single machine. There do
indeed exist techniques of executing multiple kernels already, however they all
differ from Shadow Kernels. Executing processes inside virtual machines allows
multiple kernels to execute on a single machine [36]. However, each kernel
will still typically support multiple processes executing on it, whereas Shadow
Kernels can target individual processes.
The technique used in Shadow Kernels of modifying kernel instruction streams
is well-established. For instance, KSplice modifies the kernel instruction stream
to binary patch security updates into a kernel without rebooting the machine [7],
but this is a global change that affects all processes, whereas Shadow Kernels
can restrict that patch to an individual process. Furthermore, malware can use
42
-
memory management tricks to hide itself from detection by unmapping memory
containing the rootkit [133]. Shadow Kernels differs in that rather than hid-
ing malware it allows multiple kernel instruction streams to coexist. Similarly,
Mondrix uses changes to the MMU to provide isolation between Linux kernel
modules [148], albeit with a performance overhead of up to 15%.
2.6.3 Performance interference
A key concern with executing virtual machines in the cloud is performance in-
teference, whereby two or more virtual machines compete for resources. Hyper-
visors are designed to have strong performance isolation guarantees, by having
coarse-grained scheduling and no sharing of data structures between virtualisa-
tion domains [12]. In particular, many services in the cloudas well as in other
circumstances [48]are latency-sensitve in that they require low and predictable
latency [32]. However achieving predictable latency without performance isola-
tion is hard. This lack of perfect performance isolation makes it difficult to virtu-
alise some workloads [67]. Whilst executing in the cloud allows some detection
of performance anomalies before deploying some services [134], this remains an
unsolved problem in the general case.
2.6.3.1 Measurement
Researchers have long-studied methods of reducing performance interference of
operating systems, in particular with the rise of latency-sensitive applications
such as video-streaming [66]. With the rise of hypervisors, there has been fur-
ther work in reducing performance interference, whilst increasing utilisation of
hardware by using a custom scheduler that limits the resources consumed by vir-
tual machines in their domain and in driver domains, such as domain zero [63].
However, in current cloud deployments, virtual machine workloads can in-
terfere badly with each other, for instance the IOPS available to a virtual ma-
chine can fluctuate wildly depending on other virtual machines executing [60]
and poor scheduling causes performance interference, for instance colocating a
random and a sequential load reduces performance for the sequential load [58].
Some work improves on the performance guarantees in the cloud, for example
with virtual datacentres that have guaranteed throughput. An implementation
of a virtual datacentre is Pulsar, which modifies the hypervisors in the cloud to
43
-
use a leaky bucket per virtual machine on shared resources to guarantee perfor-
mance [4].
Whilst guaranteeing performance isolation is preferable, whenever the ma-
chine is saturated by its virtual machines there is necessarily performance interfer-
ence, in which case monitoring and reporting the performance is possible. There
are many ways of measuring the performance of an operating system. Modern
operating systems, such as Linux, have a wealth of tools to help measure operat-
ing system performance. For instance, Linux has FTrace, perf, SystemTap [43],
KLogger [46] and numerous domain-specific tools. Another method, originally
implemented on a modified Digital UNIX 4.0D kernel, reports the resource con-
sumption of resource containers, rather than of processes and threads [11].
However, all of these methods do not distinguish poor application perfor-
mance from the overheads of virtualistion. That is, these tools are unable to
report if the virtual machine is starved of resources. Not only do these tools
not inform users of virtualisation overhead, they often are unable to access the
same set of hardware features as a physical machine to accurately report per-
formance to domains.1 Xenoprof is currently the only attempt to provide Xen
virtual machines with a way of measuring performance [99]. However, Xeno-
prof is incompatible with recent versions of Xen. The technique that I present
in Chapter 5 differs in that it requires developers to annotate their programs to
indicate the processing of requestsmuch like is required by X-trace [51]but
then reports the overheads of virtualisation, rather than the performance of the
virtual machine, and gives these details on a per-request basis. Calculating this
overhead requires applications to have information about how the virtual ma-
chine in which they execute is scheduled. Having a hypervisor expose its inner
state is similar to how Infokernels expose kernel internals across the interface
with applications [8].
2.6.3.2 Modelling
There has been work performed by the modelling community that looks into
performance interference between virtual machines. This work largely models
which workloads interact badly with each other in order to build better virtual
machine placement algorithms. This differs from the technique that I present
1vPMU is an upcoming (as of 2015-09-17) feature for Xen and Linux.
44
-
in Chapter 5, which is a measurement technique for helping to measure the
performance of clouds as they execute. An example of modelling performance
interference is hALT, which uses machine learning trained on a dataset from
Google [120], to model which workloads cause performance interference [28].
Q-Clouds models CPU-bound virtual machines using a multiple-input multiple-
output model whereby they take online feedback from an application and use
this as an input to the model and use the output to place virtual machines more
effectively [102]. TRACON is similar to Q-Clouds, but focusses on I/O-intensive
workloads [28]. Casale et al. produce models of virtual machine disk perfor-
mance, based on monitoring the hypervisors batching of I/O requests and the
arrival queue [22]. CloudScope improves on modelling the performance of vir-
tual machine interference by doing away with the need for machine learning or
queuing-based models by modelling virtual machine performance using Markov
chains to achieve a low-error model that is not tightly coupled with an applica-
tion [25].
This work all differs from Soroban in that it is modelling the performance
of an entire virtual machine. The virtual machine being modelled is typically
assumed to be in a steady-state for a prolonged period of time (perhaps several
minutes in length) and the model finds the best placement of virtual machines to
minimise performance interference. However, Soroban is a measurement tech-
nique that reports the additional latency incurred in servicing a single request in
a request-response system. That is, Soroban measures if during the servicing of
a request the virtual machine were scheduled out and reports the corresponding
cost of this.
2.6.3.3 Summary
I have shown that there is a field of work that considers how to instrument
and measure the performance of