shoot4u: using vmm assists to optimize tlb operations on preempted vcpus

27
VEE’16 04/02/2016 Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs Jiannan Ouyang, John Lange University of Pittsburgh Haoqiang Zheng VMware Inc.

Upload: jiannan-ouyang

Post on 22-Jan-2018

605 views

Category:

Software


1 download

TRANSCRIPT

Page 1: Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs

VEE’16

04/02/2016

Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs

Jiannan Ouyang, John Lange University of Pittsburgh

Haoqiang Zheng VMware Inc.

Page 2: Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs

CPU Consolidation in the Cloud

2

CPU Consolidation: multiple virtual CPUs (vCPUs) share the same physical CPU (pCPU). Motivation: Improve datacenter utilization.

Figure 1. Average activity distribution of a typical shared Google clusters including Online Services, each containing over 20,000 servers, over a period of 3 months [Barroso 13].

Page 3: Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs

Problems with Preempted vCPUs

3

P P P P

A

B

A

B

B

A

A

B

A vCPU of VM-A B vCPU of VM-B

Preempted

Running

pCPU P

Performance problems: Busy-waiting based kernel synchronization operations

�  Lock Holder Preemption problem �  Lock Waiter Preemption problem �  TLB Shootdown Preemption problem

Page 4: Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs

Lock Holder Preemption

4

Lock holder preemption [Uhlig 04, Friebel 08] �  A preempted vCPU is holding a spinlock �  Causes dramatically longer lock waiting time

�  context switch latency + CPU shares allocated to other vCPUs

Scheduling Techniques �  co-scheduling, relaxed co-scheduling [VMware 10] �  Adaptive co-scheduling [Weng HPDC11] �  Balanced scheduling [Sukwong EuroSys11] �  Demand-based coordinated scheduling [Kim ASPLOS13]

Hardware Assisted Techniques �  Intel Pause-Loop Exiting (PLE) [Riel 11]

Page 5: Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs

Lock Waiter Preemption [Ouyang VEE13]

5

Linux uses a FIFO order fair spinlock, named ticket spinlock

i i+1 i+2 i+3

Lock waiter preemption � A lock waiter is preempted, and blocks the queue �  P(waiter preemption) > P(holder preemption)

0 T 2T Timeout: 3T

Preemptable Ticket Spinlock

� Key idea: proportional timeout

Page 6: Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs

TLB Shootdown Preemption

6

KVM Paravirt Remote Flush TLB [kvmtlb 12] �  VMM maintains vCPU preemption states and shares with the

guest. �  Use conventional approach if the remote vCPU is running. �  Defer TLB flush if the remote vCPU is preempted. �  Cons: preemption state may change after checking. TLB shootdown IPIs as scheduling heuristics [Kim ASPLOS13] Shoot4U �  Goal: eliminate the problem �  Key idea: invalidate guest TLB entries from the VMM

Page 7: Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs

Contributions

7

�  An analysis of the impact that various low level synchronization operations have on system benchmark performance.

�  Shoot4U: A novel virtualized TLB architecture that ensures consistently low latencies for synchronized TLB operations.

�  An evaluation of the performance benefits achieved by Shoot4U over current state-of-art software and hardware assisted approaches.

Page 8: Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs

Performance Analysis

8

Page 9: Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs

Overhead of CPU Consolidation

9

0

2

4

6

8

10

12

14

16

18

20

blackscholes

bodytrack

canneal

dedup

ferret

freqmine

raytrace

streamcluster

swaptions

vips

x264

Slowdown

70.6

max ideal slowdown

PARSEC Runtime with co-located VM over running alone (12-core VMs, measured on Linux/KVM, with PLE disabled)

Page 10: Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs

CPU Usage Profiling (perf)

10

0

10

20

30

40

50

60

70

80

90

100

blackscholes

bodytrack

canneal

dedup

ferret

freqmine

raytrace

streamcluster

swaptions

vips

x264

blackscholes

bodytrack

canneal

dedup

ferret

freqmine

raytrace

streamcluster

swaptions

vips

x264

Percentage (%)

1VM 2VM

k:lock k:tlb k:other u:*

Page 11: Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs

CDF of TLB Shootdown Latency (ktap)

11

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

100

101

102

103

104

105

106

Cumulative Percent

Latency (us)

1VM2VM

Page 12: Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs

How TLB Shootdown Works in VMs

12

�  TLB (Translation Lookaside Buffer) �  a per-core hardware cache for page table translation results

�  TLB coherence is managed by the OS �  TLB shootdown operations: IPI + invlpg

�  Linux TLB shootdown is busy-waiting based

VMM

P P P P

A A B A

1. vIPIs

3. pIPIs

2. trap 4. inject virtual interrupts

5. vCPU is scheduled (TLB Shootdown Preemption)

B B A B 6. invalidation & ACK

Page 13: Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs

Shoot4U

13

Page 14: Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs

Shoot4U

14

Observation: modern hardware allows the VMM to invalidate guest TLB entries (e.g. Intel invpid) Key idea: invalidate guest TLB entries from the VMM

�  Tell the VMM what TLB entries and vCPUs to invalidate (hypervall) �  The VMM invalidates and returns, no interrupt injection and waiting

2. pIPIs

1.  hypercall <vcpu set, addr range> 3. invalidation and ACK VMM

P P P P

A A B A

B B A B

Page 15: Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs

Implementation

15

KVM/Linux 3.16, ~200 LOC (~50 LOC guest side) � https://github.com/ouyangjn/shoot4u

Guest �  use hypercall for TLB shootdowns

VMM �  hypercall handler: vCPU set => pCPU set, and send IPIs �  IPI handler: invalidate guest TLB entries with invpid

kvm_hypercall3(unsignedlongKVM_HC_SHOOT4U,unsignedlongvcpu_bitmap,unsignedlongstart_addr,unsignedlongend_addr);

Shoot4U API

Page 16: Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs

Evaluation

16

Dual-socket Dell R450 server �  6-core Intel “Ivy-Bridge” Xeon processors with hyperthreading �  24 GB RAM split across two NUMA domains. �  CentOS 7 (Linux 3.16)

Virtual Machines �  12 vCPUs, 4G RAM on the same socket �  Fedora 19 (Linux 4.0) �  VM1: PARSEC Benchmark Suite, VM2 sysbench CPU test

Schemes �  baseline: unmodified Linux kernel �  kvmtlb [kvmtlb 12] �  Shoot4U �  Pause-Loop Exiting (PLE) [Riel 11] �  Preemptable Ticket Spinlock (PMT) [Ouyang VEE ’13]

Page 17: Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs

TLB Shootdown Latency (Cycles)

17

Order of magnitude lower latency

Page 18: Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs

TLB Shootdown Latency (CDF)

18

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

100

101

102

103

104

105

106

Cumulative Percent

Latency (us)

shoot4u-1VMshoot4u-2VMkvmtlb-1VMkvmtlb-2VM

baseline-1VMbaseline-2VM

Page 19: Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs

Parsec Performance (2-VMs)

19

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1

blackscholes

bodytrack

canneal

dedup

ferret

freqmine

raytrace

streamcluster

swaptions

vips

x264N

ormalized Execution Time

baseline

ple

pmt

pmt+kvmtlb

pmt+shoot4u

ple+pmt+shoot4u

Page 20: Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs

Revisiting Performance Slowdown

20

0

2

4

6

8

10

12

14

16

18

20

blackscholes

bodytrack

canneal

dedup

ferret

freqmine

raytrace

streamcluster

swaptions

vips

x264

Slowdown70.6

baseline ple+pmt+shoot4u

Page 21: Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs

Revisiting CPU Usage Profiling

21

0

10

20

30

40

50

60

70

80

90

100

blackscholes

bodytrack

canneal

dedup

ferret

freqmine

raytrace

streamcluster

swaptions

vips

x264

blackscholes

bodytrack

canneal

dedup

ferret

freqmine

raytrace

streamcluster

swaptions

vips

x264

Percentage (%)

baseline 2VM ple+pmt+shoot4u 2VM

k:lock k:tlb k:other u:*

Page 22: Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs

Conclusions

22

�  We conducted a set of experiments in order to provide a breakdown of overheads caused by preempted virtual CPU cores, showing that TLB operations can have a significant impact on performance with certain workloads.

�  We Shoot4U, an optimization for TLB shootdown operations that internalizes TLB shootdowns in the VMM and so no longer requires the involvement of a guest’s vCPUs.

�  Our evaluation demonstrates the effectiveness of our approach, and illustrates how under certain workloads our approach is dramatically better than state-of-the-art techniques.

Page 23: Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs

23

https://github.com/ouyangjn/shoot4u

Page 24: Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs

Q & A

24

Jiannan Ouyang Ph.D. Candidate

University of Pittsburgh [email protected]

http://www.cs.pitt.edu/~ouyang/

The Prognostic Lab University of Pittsburgh

http://www.prognosticlab.org

Pisces Co-Kernel

Kitten Lightweight Kernel

Palacios VMM

Page 25: Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs

References

25

�  [Ouyang 13] Jiannan Ouyang and John R. Lange. Preemptable Ticket Spinlocks: Improving Consolidated Performance in the Cloud. In Proc. 9th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE), 2013.

�  [Uhlig 04] Volkmar Uhlig, Joshua LeVasseur, Espen Skoglund, and Uwe Dannowski. Towards scalable multiprocessor virtual machines. In Proceedings of the 3rd conference on Virtual Machine Research And Technology Symposium - Volume 3, VM’04, 2004.

�  [Friebel 08] Thomas Friebel. How to deal with lock-holder preemption. Presented at the Xen Summit North America, 2008.

�  [Kim ASPLOS’13] H. Kim, S. Kim, J. Jeong, J. Lee, and S. Maeng. Demand- based Coordinated Scheduling for SMP VMs. In Proc. Inter- national Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2013.

Page 26: Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs

26

�  [VMware 10] VMware(r) vSphere(tm): The cpu scheduler in vmware esx(r) 4.1. Technical report, VMware, Inc, 2010.

�  [Barroso 13] L. A. Barroso, J. Clidaras, and U. Holzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse- Scale Machines. Synthesis Lectures on Computer Architec- ture, 2013.

�  [Weng HPDC’11] C. Weng, Q. Liu, L. Yu, and M. Li. Dynamic Adaptive Scheduling for Virtual Machines. In Proc. 20th International Symposium on High Performance Parallel and Distributed Computing (HPDC), 2011.

�  [Sukwong EuroSys’11] O. Sukwong and H. S. Kim. Is Co-scheduling Too Expensive for SMP VMs? In Proc. 6th European Conference on Com- puter Systems (EuroSys), 2011.

Page 27: Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs

27

�  [Riel 11] R. v. Riel. Directed yield for pause loop exiting, 2011. URL http://lwn.net/Articles/424960/.

�  [kvmtlb 12] KVM Paravirt Remote Flush TLB. https://lwn.net/Articles/500188/.