ser2724bu extreme performance series: or distribution · mark achtemichuk, vcdx, staff engineer,...

Mark Achtemichuk, VCDX, Staff Engineer, VMwareReza Taheri, Principal Engineer, VMwareValentin Bondzio, Senior Staff TSE, VMware

SER2724BU

#VMworld #xPerfSeries #SER2724BU *

Extreme Performance Series:

Performance Best Practices

VMworld 2017 Content: Not fo

r publication or distri

bution

• This presentation may contain product features that are currently under development.

• This overview of new technology represents no commitment from VMware to deliver these features in any generally available product.

• Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.

• Technical feasibility and market demand will affect final delivery.

• Pricing and packaging for any new technologies or features discussed or presented have not been determined.

Disclaimer

#SER2724BU CONFIDENTIAL 2



bution

Agenda

1 The Baseline

2 vNUMA

3 Keeping Things Up To Date

4 Power Management

5 Hyper-threading

3



bution

Baseline

4



bution

Baseline Best Practices

• * Use the most current release: vSphere, VCSA, VM Tools, vHW, OS, BIOS, Firmware

• HW selection makes a difference ex: bandwidth, offloads, processor architectures

• Refer to existing best practice documentation ex: SQL BPs, Latency Sensitive BPs

• * Rightsize your workloads, size into a pNUMA node, correct vCPU presentation

• * Evaluate your power management policy

• Use resource management properly, or not at all!

• * Keep Hyper-threading enabled

• Use DRS to manage contention

5



bution


• Monitor oversubscription ex: pCPU:vCPU, memory reclamation, via vROPs

• Use paravirtualized drivers: vmxnet3, pvscsi

• Evaluate disabling interrupt coalescing, lower latency, higher cost

• Storage design needs to be optimized for flash, map app -> disk

• Understand what the workload is – java apps are different than databases

• Define and monitor application level KPIs

• You can’t compare Apples to Oranges

6



bution


• What’s Important to Monitor:

– Compute – Contention – Ready, Co-Stop

– Memory – Oversubscription – Balloon, Swap (in-guest difficult)

– Storage – Service Time – Device and Kernel Latency

– Network – Health - Throughput

7



bution


• Performance Best Practices for vSphere 6.5

https://www.vmware.com/techpapers/2017/Perf_Best_Practices_vSphere65.html

• Application Specific Best Practice Guides (SQL, Oracle, etc)

https://www.vmware.com/solutions/business-critical-apps.html

• VROOM! Blog

https://blogs.vmware.com/performance/

• Performance Community

https://communities.vmware.com/community/vmtn/performance

8



bution

https://www.vmware.com/techpapers/2017/Perf_Best_Practices_vSphere65.html

https://www.vmware.com/solutions/business-critical-apps.html

https://blogs.vmware.com/performance/


Agenda

1 The Baseline

2 vNUMA


4 Power Management

5 Hyper-threading

9



bution

Troubleshooting Scenario

• Poor NUMA Locality (N%L)

• pNUMA doesn’t match vNUMA

• I see conflicting guidance

10

vNUMA



bution


11

vNUMA – Optimal Configuration



bution


1. While there are many advanced vNUMA settings, only in rare cases do they need to be changed from defaults.

2. Always configure the virtual machine vCPU count to be reflected as Cores per Socket, until you exceed the physical core count of a single physical NUMA node.

3. When you need to configure more vCPUs than there are physical cores in the NUMA node, evenly divide the vCPU count across the minimum number of NUMA nodes.

4. Don’t assign an odd number of vCPUs when the size of your virtual machine exceeds a physical NUMA node.

5. Don’t enable vCPU Hot Add unless you’re okay with vNUMA being disabled*

6. Don’t create a VM larger than the total number of physical cores of your host*

12

vNUMA – Rules of Thumb



bution

Agenda

1 The Baseline

2 vNUMA


4 Power Management

5 Hyper-threading

13



bution


• In house developed application that is “mission critical” runs as expected on developer laptop but only about 70% of the performance on ESXi

• Analyze the problem

• Use tools to identify issue

• Lesson learned

CONFIDENTIAL 14

Why Keep Things Up-to-date



bution


The developers “laptop” The ESXi “server”

15



bution

Is There Contention?

16

10:05:56am up 2 days 43 min, 675 worlds, 1 VMs, 2 vCPUs; CPU load average: 0.05, 0.01, 0.01

PCPU USED(%): 0.3 0.3 0.1 0.3 0.4 0.0 0.1 0.1 0.1 0.2 138 0.3 0.2 0.3 0.0 0.0 0.1 15 0.2 0.1 0.0 0.3 0.1 0.1 AVG: 6.6

PCPU UTIL(%): 0.5 0.4 0.1 0.3 0.4 0.0 0.2 0.2 0.1 0.3 100 0.7 0.3 0.3 0.1 0.0 0.1 16 0.4 0.3 0.1 0.6 0.3 0.3 AVG: 5.1

CORE UTIL(%): 0.8 0.0 0.8 0.5 0.4 100 0.0 15 16 0.0 7.1 0.3 AVG: 11

ID GID NAME NWLD %USED %RUN %SYS %WAIT %VMWAIT %RDY %IDLE %OVRLP

96337 148153 vmx 1 0.01 0.01 0.00 98.23 - 0.00 0.00 0.00

96339 148153 NetWorld-VM-96338 1 0.00 0.00 0.00 98.24 - 0.00 0.00 0.00

96340 148153 NUMASchedRemapEpochInitial 1 0.00 0.00 0.00 98.24 - 0.00 0.00 0.00

96341 148153 vmast.96338 1 0.04 0.06 0.00 98.18 - 0.00 0.00 0.00

96343 148153 vmx-vthread-6 1 0.00 0.00 0.00 98.24 - 0.00 0.00 0.00

96344 148153 vmx-mks:prime95 1 0.00 0.01 0.00 98.23 - 0.00 0.00 0.00

96345 148153 vmx-svga:prime95 1 0.00 0.00 0.00 98.24 - 0.00 0.00 0.00

96346 148153 vmx-vcpu-0:prime95 1 137.13 98.24 0.00 0.00 0.00 0.00 0.00 0.06

96348 148153 vmx-vcpu-1:prime95 1 0.31 0.52 0.00 97.70 2.71 0.02 94.99 0.01

96347 148153 PVSCSI-96338:0 1 0.00 0.00 0.00 98.24 - 0.00 0.00 0.00

96350 148153 vmx-vthread-7:prime95 1 0.00 0.00 0.00 98.24 - 0.00 0.00 0.00VMworld 2017 Content: N

ot for publicatio

n or distribution

Processor Differences?

• Both processors are Haswell, very similar clock frequencies

– Intel Core i7-4700MQ (4 cores, 8 threads / 6 MB L3 Cache / 2.4 -> 3.4 GHz)

– Intel Xeon E5-2620 v3 (6 cores, 12 threads / 15 MB L3 Cache / 2.4 -> 3.2 GHz)

• Is it EVC?

17



bution

What About Virtual Hardware?

18

ESXi vHW GA (ISO 8601) ~ Intel CPU gen. level Model Name mod

5.5 10(9) 2013-09-22 Ivy Bridge E*-**** v2

6.0 11 2015-03-12 Haswell E*-**** v3

6.5 13 2016-11-15 Skylake E*-**** v5



bution

Virtual Hardware Version Changes

• Some 38 changes that were implemented in vHW 11

• Examples:

– benefit SMT FT VMs (mostly SVGA optimizations)

– reduces timer interrupts for idle Windows 2012+ VMs

• (less CPU consumption / contention when VMs are idle)

– enable RSC (LRO) for Windows 2012+ VMs

– improves some aspect of nested VM performance

– etc.

19



bution

Agenda

1 The Baseline

2 vNUMA


4 Power Management

5 Hyper-threading

20



bution


• I run performance test on my software but always get varying results within runs. Could it be Power Management?



• Lesson learned

CONFIDENTIAL 21

Power Management Impact



bution

What is Power Management?

• More Turbo bins on cores in C0 when other cores are in deep C-States

Reallocating power consumption within a processor package

22

P0

TB1

Fre

quency

C0

C-Statedepth

C6

P0

TB1

C1 C1

C1

P0

TB1

C0

P0

TB1

C6 C6

TB2 TB2

TB3 TB3



bution

Where Can I See Power Management?

23

10:15:23am up 2 days 53 min, 674 worlds, 1 VMs, 2 vCPUs; CPU load average: 0.10, 0.09, 0.03

Power Usage: 147W, Power Cap: N/A

PSTATE MHZ: 2401 2400 2300 2200 2100 2000 1900 1800 1700 1600 1500 1400 1300 1200

CPU %USED %UTIL %C0 %C1 %C2 %P0 %P1 %P2 %P3 %P4 %P5 %P6 %P7 %P8 %P9 %P10 %P11 %P12 %P13 %A/MPERF

(…)

4 0.3 0.5 0 11 88 95 0 0 0 0 0 0 0 0 0 0 0 0 4 95.2

5 0.0 0.1 0 3 97 9 0 0 0 0 0 0 0 0 1 0 0 0 91 77.8

6 0.1 0.1 0 7 93 0 0 0 0 0 0 0 0 0 0 0 0 0 100 105.5

7 0.5 0.7 1 1 99 100 0 0 0 0 0 0 0 0 0 0 0 0 0 117.1

8 2.5 2.4 2 16 81 17 0 0 0 0 0 0 0 0 0 0 0 0 83 103.9

9 0.1 0.3 0 1 98 6 0 0 0 0 0 0 0 0 0 0 0 0 94 59.7

10 0.4 0.7 1 9 90 7 0 0 0 0 0 0 0 0 0 0 1 0 92 54.8

11 129.4 100.0 100 0 0 100 0 0 0 0 0 0 0 0 0 0 0 0 0 132.5

12 3.1 3.1 3 12 85 85 0 0 0 0 0 0 0 0 0 0 0 0 15 102.4

13 0.3 0.5 0 17 83 12 0 0 0 0 0 0 0 0 0 0 0 0 88 79.8

14 0.4 0.5 1 16 84 43 0 0 0 0 0 0 0 0 0 0 0 0 57 94.1

15 0.1 0.3 0 2 97 100 0 0 0 0 0 0 0 0 0 0 0 0 0 73.0

16 3.7 3.1 3 4 93 5 0 0 0 0 0 0 0 0 0 0 0 0 95 126.0

17 0.0 0.1 0 5 95 3 0 0 0 0 0 0 0 0 0 0 0 0 97 50.8

18 0.4 0.7 1 9 90 7 0 0 0 0 0 0 0 0 0 0 1 0 92 54.8

19 1.4 1.4 1 14 85 22 0 0 0 0 0 0 0 0 0 0 0 0 78 103.3

(…)



bution

Balanced or High Performance?

• Always set BIOS to ‘OS Controlled’

– Then the policy change is dynamic

• Balanced (default) allows for Turbo opportunities

– Great for populations of small virtual machines

– Some performance variability is okay

• High Performance caps Turbo opportunities

– Best for populations that have Large VMs (greater than 8 vCPU)

– Required for Latency Sensitive workloads

CONFIDENTIAL 24

“It Depends”



bution

Agenda

1 The Baseline

2 vNUMA


4 Power Management

5 Hyper-threading

25



bution


• I ran my workload yesterday and it ran in 565 seconds; today it’s 1096 seconds. What happened?



• Lesson learned

CONFIDENTIAL 26

Interfering VMs



bution

Problem definition• The workload VM:

– 36 vCPUs

– Running the bzip2 test of SPECcpu2006

• Simple, no I/O

• But easy to see

• The Server:

– Broadwell-EP E5-2697 v4 @ 2.30GHz

– Turbo boost up to 2.8GHz for a 1.22X improvement

– 36 cores/72 HyperThreads

• Tools:

– Standard Linux tools inside the guest

– esxtop on the hypervisor

– Hardware event counters using PMC inside the guest

• Enable virtual Performance Monitoring Counters (vPMC), KB 2030221

• “perf stat” command to collect

27



bution

Troubleshooting steps

• Fast case: 565 seconds

• mpstat(1) on the guest

– 100% CPU utilization

• All other guest tools report the same stats

28

• Slow case: 1096 seconds

• mpstat(1) on the guest

– 100% CPU utilization

• All other guest tools report the same stats

• Questions:

– Why did our performance drop by half?

– Can we reconcile guest stats with ESX stats?



bution

Troubleshooting with esxtop

29

• In the fast case, we get the full utilization on all 36 cores

– The VM’s %RUN is nearly 3600%

– Turbo boost of 22% (PCPU USED% to PCPU UTIL% ratio) matches the hardware counters

• In the slow case, we have a second VM running!

– Our %RUN is about half of available cycles, matching the hardware counters

– Our %READY and %COSTOP add up to about half the available cycles

• That’s why the guest tools were fooled

• We need more CPUs!!!

– HyperThreading to the rescue!!

Two VMs active

One VM active



bution

Intel ® Hyper-Threading Technology

30

That doubles my CPUs, right? Not!

• Increases instruction level parallelism

• logic is replicated, partitioned, shared

• ~ 5% additional die size / cost

• ~ 25% more “performance”

• Most of the benefit comes from one HyperThread using the core while the other one is waiting for memory load

• Recent processors replicate some core functionality on each HyperThread



bution

No problem: Enable HyperThreading!

31

• First, compare the one-VM case with and without HyperThreading

– Twice as many “physical” processors

– With HT, we have a new field: CORE UTIL%

• esxtop fields with HT enabled

– PCPU UTIL% versus CORE UTIL% with one HyperThread in use

• One HyperThread saturated keeps the core at 100%

• PCPU USED% reflects benefits of Turbo boost

HT Enabled

HT Disabled



bution

What is PCPU USED %• A measure of how much a given HyperThread uses the core

– A metric calculated by ESX

– When it’s the only HyperThread active, it gets 100% credit

• Its twin gets 0% credit since it’s idle

– During any periods that both HyperThreads are active, they each get 50% credit

32

• The average of PCPU USED credit of the two HyperThreads cannot go over 50%

• PCPU USED % is then adjusted up or down for Turbo and frequency scaling

– So with Turbo boost, the average of PCPU USED % of a core can go to, say, 2.8GHz/2.3GHz=61%

0

10

20

30

40

50

60

70

80

90

100

Hyperthread1 Hyperthreads2

ExecutiontimeontheCPU

OneHyperthreadbusy100%

PCPUUSED

0

10

20

30

40

50

60

70

80

90

100



BothHyperthreadssaturated50%

PCPUUSED50%

PCPUUSED

0

10

20

30

40

50

60

70

80

90

100



BothHyperthreadsbusy

25%PCPUUSED

55%PCPUUSED



bution

HyperThreading with two VMs

33

HT Enabled

HT Disabled

• On this fully saturated Broadwell with HyperThreading enabled

– PCPU USED%: is 60%

• Discounted for HT; boosted for Turbo

• Our performance boost is only 1.16X despite

– CPU %RUN per VM has gone from ~1800% to ~3600%

– %READY and %COSTOP have largely disappeared

– The expected boost is 10-40%



bution

Going a layer deeper

• Can we use other tools to diagnose why the HyperThreading boost is on the low side?

– Use Performance Monitoring Counters (PMC) built into the processor hardware

• PMCs can be virtualized on vSphere

– With vPMC, each VM only sees the event counts while it was running

• Analysis common for hard-core performance engineers

CONFIDENTIAL 34



bution

Enabling virtual Performance Monitoring Counters (vPMC) in a VM

35



bution

Collecting hardware event counts in Linux

36

# perf stat -a -e cycles -e ref-cycles -e instructions -e cache-misses -e cache-references sleep 10

Performance counter stats for 'system wide':

1,009,203,198,874 cycles

828,986,892,878 ref-cycles

1,456,882,482,627 instructions # 1.44 insn per cycle

984,566,743 cache-misses # 6.782 % of all cache refs

14,516,610,550 cache-references

10.015823286 seconds time elapsed

Let’s check the math for this workload that fully saturated all vCPU:

36 𝑣𝐶𝑃𝑈𝑠 × 2.3𝐺𝐻𝑧 × 10 𝑠𝑒𝑐𝑜𝑛𝑑𝑠 = 828,000,000,000



bution

Processor PMC stats with HyperThreading, one or two VMs

• With HyperThreading Disabled, 1 VM and 2 VMs have nearly identical profiles

• With HyperThreading Enabled and 2 VMs, we get twice the cycles in VMs, but not the efficiency

• The L3 cache and Resource Stall stats tell us the VMs are interfering with each other

37

HT VMs Overall

thruput of

all VMs

PCPU

UTIL

%

PCPU

USED

%

CORE

UTIL

%

Cycles/

second

per VM

Instr/

second

per VM

IPC Total L3

cache

accesses

Total L3

cache

misses

Total

Resource

Stalls

off 1 1.0 99% 121% - 2.7G 4.11G 1.52 23.4G 2.95G 442G

off 2 1.03 99% 121% - 1.4G 2.15G 1.53 20.9G 3.01G 478G

ON 2 1.20 99% 60% 99% 2.7G 2.44G 0.90 25.7G 7.25G 998G



bution

Yes, that was complex!

• VMs interfere with each other

– I/O latency

– Cache hit rate

– Memory latency

– Core resources

• HyperThreading can give a boost

– Often 1.2-1.3X boost

• Have to use a variety of tools for a detailed analysis

– Looking just at mpstat or even CORE UTIL% and PCPU UTIL% would have been misleading

• Do not oversubscribe above available resources!

38



bution

What we did NOT say• We did not say don’t overcommit

– Just that your VMs cannot use more resources than the hardware can offer

• We did not say don’t believe Linux tools

– Just that when there is a discrepancy between mpstat and what the app claims, use esxtop to investigate

• We did not say use hardware event counters as a first resort

– It’s a microscope

– PMC gives you insight into the processor, e.g. when VMs are interfering with other

• We said: performance troubleshooting can be complex. Use the tools available at different layers of vertical stack to get a full picture

39



bution

Conclusion

• vSphere OOTB Performance is Excellent

• Tuning Required for Specific Workloads or Corner Cases

• Performance Requires Understanding HW to App

• Leverage Existing Best Practice Documentation for Support

• Performance is an Onion, Peel Back the Layers

• Links:

• https://blogs.vmware.com/performance/

• https://communities.vmware.com/community/vmtn/performance




bution



Extreme Performance Series – Las Vegas


• SER2724BU Performance Best Practices

• SER2723BU Benchmarking 101

• SER2343BU vSphere Compute & Memory Schedulers

• SER1504BU vCenter Performance Deep Dive

• SER2734BU Byte Addressable Non-Volatile Memory in vSphere

• SER2849BU Predictive DRS – Performance & Best Practices

• SER1494BU Encrypted vMotion Architecture, Performance, & Futures

• STO1515BU vSAN Performance Troubleshooting

• VIRT1445BU Fast Virtualized Hadoop and Spark on All-Flash Disks

• VIRT1397BU Optimize & Increase Performance Using VMware NSX

• VIRT2550BU Reducing Latency in Enterprise Applications with VMware NSX

• VIRT1052BU Monster VM Database Performance

• VIRT1983BU Cycle Stealing from the VDI Estate for Financial Modeling

• VIRT1997BU Machine Learning and Deep Learning on VMware vSphere

• FUT2020BU Wringing Max Perf from vSphere for Extremely Demanding Workloads

• FUT2761BU Sharing High Performance Interconnects across Multiple VMs



bution

Extreme Performance Series – Barcelona


• SER2724BE Performance Best Practices

• SER2343BE vSphere Compute & Memory Schedulers

• SER1504BE vCenter Performance Deep Dive

• SER2849BE Predictive DRS – Performance & Best Practices

• VIRT1445BE Fast Virtualized Hadoop and Spark on All-Flash Disks

• VIRT1397BE Optimize & Increase Performance Using VMware NSX

• VIRT1052BE Monster VM Database Performance

• FUT2020BE Wringing Max Perf from vSphere for Extremely Demanding Workloads



bution

Extreme Performance Series – Hand on Labs

• Don’t miss these popular Extreme Performance labs:

• HOL-1804-01-SDC: vSphere 6.5 Performance Diagnostics & Benchmarking

– Each module dives deep into vSphere performance best practices, diagnostics, and optimizations using various interfaces and benchmarking tools.

• HOL-1804-02-CHG: vSphere Challenge Lab

– Each module places you in a different fictional scenario to fix common vSphere operational and performance problems.




bution

Performance Survey

The VMware Performance Engineeringteam is always looking for feedback about your experience with theperformance of our products, ourvarious tools, interfaces and wherewe can improve.

Scan this QR code to access ashort survey and provide us directfeedback.

Alternatively: www.vmware.com/go/perf

Thank you!




bution

http://www.vmware.com/go/perf



bution

ser2724bu extreme performance series: or distribution · mark achtemichuk, vcdx, staff engineer,...

Documents