ser2724bu extreme performance series: or distribution · mark achtemichuk, vcdx, staff engineer,...
TRANSCRIPT
Mark Achtemichuk, VCDX, Staff Engineer, VMwareReza Taheri, Principal Engineer, VMwareValentin Bondzio, Senior Staff TSE, VMware
SER2724BU
#VMworld #xPerfSeries #SER2724BU *
Extreme Performance Series:
Performance Best Practices
VMworld 2017 Content: Not fo
r publication or distri
bution
• This presentation may contain product features that are currently under development.
• This overview of new technology represents no commitment from VMware to deliver these features in any generally available product.
• Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.
• Technical feasibility and market demand will affect final delivery.
• Pricing and packaging for any new technologies or features discussed or presented have not been determined.
Disclaimer
#SER2724BU CONFIDENTIAL 2
VMworld 2017 Content: Not fo
r publication or distri
bution
Agenda
1 The Baseline
2 vNUMA
3 Keeping Things Up To Date
4 Power Management
5 Hyper-threading
3
VMworld 2017 Content: Not fo
r publication or distri
bution
Baseline
4
VMworld 2017 Content: Not fo
r publication or distri
bution
Baseline Best Practices
• * Use the most current release: vSphere, VCSA, VM Tools, vHW, OS, BIOS, Firmware
• HW selection makes a difference ex: bandwidth, offloads, processor architectures
• Refer to existing best practice documentation ex: SQL BPs, Latency Sensitive BPs
• * Rightsize your workloads, size into a pNUMA node, correct vCPU presentation
• * Evaluate your power management policy
• Use resource management properly, or not at all!
• * Keep Hyper-threading enabled
• Use DRS to manage contention
5
VMworld 2017 Content: Not fo
r publication or distri
bution
Baseline Best Practices
• Monitor oversubscription ex: pCPU:vCPU, memory reclamation, via vROPs
• Use paravirtualized drivers: vmxnet3, pvscsi
• Evaluate disabling interrupt coalescing, lower latency, higher cost
• Storage design needs to be optimized for flash, map app -> disk
• Understand what the workload is – java apps are different than databases
• Define and monitor application level KPIs
• You can’t compare Apples to Oranges
6
VMworld 2017 Content: Not fo
r publication or distri
bution
Baseline Best Practices
• What’s Important to Monitor:
– Compute – Contention – Ready, Co-Stop
– Memory – Oversubscription – Balloon, Swap (in-guest difficult)
– Storage – Service Time – Device and Kernel Latency
– Network – Health - Throughput
7
VMworld 2017 Content: Not fo
r publication or distri
bution
Baseline Best Practices
• Performance Best Practices for vSphere 6.5
https://www.vmware.com/techpapers/2017/Perf_Best_Practices_vSphere65.html
• Application Specific Best Practice Guides (SQL, Oracle, etc)
https://www.vmware.com/solutions/business-critical-apps.html
• VROOM! Blog
https://blogs.vmware.com/performance/
• Performance Community
https://communities.vmware.com/community/vmtn/performance
8
VMworld 2017 Content: Not fo
r publication or distri
bution
Agenda
1 The Baseline
2 vNUMA
3 Keeping Things Up To Date
4 Power Management
5 Hyper-threading
9
VMworld 2017 Content: Not fo
r publication or distri
bution
Troubleshooting Scenario
• Poor NUMA Locality (N%L)
• pNUMA doesn’t match vNUMA
• I see conflicting guidance
10
vNUMA
VMworld 2017 Content: Not fo
r publication or distri
bution
Troubleshooting Scenario
11
vNUMA – Optimal Configuration
VMworld 2017 Content: Not fo
r publication or distri
bution
Troubleshooting Scenario
1. While there are many advanced vNUMA settings, only in rare cases do they need to be changed from defaults.
2. Always configure the virtual machine vCPU count to be reflected as Cores per Socket, until you exceed the physical core count of a single physical NUMA node.
3. When you need to configure more vCPUs than there are physical cores in the NUMA node, evenly divide the vCPU count across the minimum number of NUMA nodes.
4. Don’t assign an odd number of vCPUs when the size of your virtual machine exceeds a physical NUMA node.
5. Don’t enable vCPU Hot Add unless you’re okay with vNUMA being disabled*
6. Don’t create a VM larger than the total number of physical cores of your host*
12
vNUMA – Rules of Thumb
VMworld 2017 Content: Not fo
r publication or distri
bution
Agenda
1 The Baseline
2 vNUMA
3 Keeping Things Up To Date
4 Power Management
5 Hyper-threading
13
VMworld 2017 Content: Not fo
r publication or distri
bution
Troubleshooting Scenario
• In house developed application that is “mission critical” runs as expected on developer laptop but only about 70% of the performance on ESXi
• Analyze the problem
• Use tools to identify issue
• Lesson learned
CONFIDENTIAL 14
Why Keep Things Up-to-date
VMworld 2017 Content: Not fo
r publication or distri
bution
Troubleshooting Scenario
The developers “laptop” The ESXi “server”
15
VMworld 2017 Content: Not fo
r publication or distri
bution
Is There Contention?
16
10:05:56am up 2 days 43 min, 675 worlds, 1 VMs, 2 vCPUs; CPU load average: 0.05, 0.01, 0.01
PCPU USED(%): 0.3 0.3 0.1 0.3 0.4 0.0 0.1 0.1 0.1 0.2 138 0.3 0.2 0.3 0.0 0.0 0.1 15 0.2 0.1 0.0 0.3 0.1 0.1 AVG: 6.6
PCPU UTIL(%): 0.5 0.4 0.1 0.3 0.4 0.0 0.2 0.2 0.1 0.3 100 0.7 0.3 0.3 0.1 0.0 0.1 16 0.4 0.3 0.1 0.6 0.3 0.3 AVG: 5.1
CORE UTIL(%): 0.8 0.0 0.8 0.5 0.4 100 0.0 15 16 0.0 7.1 0.3 AVG: 11
ID GID NAME NWLD %USED %RUN %SYS %WAIT %VMWAIT %RDY %IDLE %OVRLP
96337 148153 vmx 1 0.01 0.01 0.00 98.23 - 0.00 0.00 0.00
96339 148153 NetWorld-VM-96338 1 0.00 0.00 0.00 98.24 - 0.00 0.00 0.00
96340 148153 NUMASchedRemapEpochInitial 1 0.00 0.00 0.00 98.24 - 0.00 0.00 0.00
96341 148153 vmast.96338 1 0.04 0.06 0.00 98.18 - 0.00 0.00 0.00
96343 148153 vmx-vthread-6 1 0.00 0.00 0.00 98.24 - 0.00 0.00 0.00
96344 148153 vmx-mks:prime95 1 0.00 0.01 0.00 98.23 - 0.00 0.00 0.00
96345 148153 vmx-svga:prime95 1 0.00 0.00 0.00 98.24 - 0.00 0.00 0.00
96346 148153 vmx-vcpu-0:prime95 1 137.13 98.24 0.00 0.00 0.00 0.00 0.00 0.06
96348 148153 vmx-vcpu-1:prime95 1 0.31 0.52 0.00 97.70 2.71 0.02 94.99 0.01
96347 148153 PVSCSI-96338:0 1 0.00 0.00 0.00 98.24 - 0.00 0.00 0.00
96350 148153 vmx-vthread-7:prime95 1 0.00 0.00 0.00 98.24 - 0.00 0.00 0.00VMworld 2017 Content: N
ot for publicatio
n or distribution
Processor Differences?
• Both processors are Haswell, very similar clock frequencies
– Intel Core i7-4700MQ (4 cores, 8 threads / 6 MB L3 Cache / 2.4 -> 3.4 GHz)
– Intel Xeon E5-2620 v3 (6 cores, 12 threads / 15 MB L3 Cache / 2.4 -> 3.2 GHz)
• Is it EVC?
17
VMworld 2017 Content: Not fo
r publication or distri
bution
What About Virtual Hardware?
18
ESXi vHW GA (ISO 8601) ~ Intel CPU gen. level Model Name mod
5.5 10(9) 2013-09-22 Ivy Bridge E*-**** v2
6.0 11 2015-03-12 Haswell E*-**** v3
6.5 13 2016-11-15 Skylake E*-**** v5
VMworld 2017 Content: Not fo
r publication or distri
bution
Virtual Hardware Version Changes
• Some 38 changes that were implemented in vHW 11
• Examples:
– benefit SMT FT VMs (mostly SVGA optimizations)
– reduces timer interrupts for idle Windows 2012+ VMs
• (less CPU consumption / contention when VMs are idle)
– enable RSC (LRO) for Windows 2012+ VMs
– improves some aspect of nested VM performance
– etc.
19
VMworld 2017 Content: Not fo
r publication or distri
bution
Agenda
1 The Baseline
2 vNUMA
3 Keeping Things Up To Date
4 Power Management
5 Hyper-threading
20
VMworld 2017 Content: Not fo
r publication or distri
bution
Troubleshooting Scenario
• I run performance test on my software but always get varying results within runs. Could it be Power Management?
• Analyze the problem
• Use tools to identify issue
• Lesson learned
CONFIDENTIAL 21
Power Management Impact
VMworld 2017 Content: Not fo
r publication or distri
bution
What is Power Management?
• More Turbo bins on cores in C0 when other cores are in deep C-States
Reallocating power consumption within a processor package
22
P0
TB1
Fre
quency
C0
C-Statedepth
C6
P0
TB1
C1 C1
C1
P0
TB1
C0
P0
TB1
C6 C6
TB2 TB2
TB3 TB3
VMworld 2017 Content: Not fo
r publication or distri
bution
Where Can I See Power Management?
23
10:15:23am up 2 days 53 min, 674 worlds, 1 VMs, 2 vCPUs; CPU load average: 0.10, 0.09, 0.03
Power Usage: 147W, Power Cap: N/A
PSTATE MHZ: 2401 2400 2300 2200 2100 2000 1900 1800 1700 1600 1500 1400 1300 1200
CPU %USED %UTIL %C0 %C1 %C2 %P0 %P1 %P2 %P3 %P4 %P5 %P6 %P7 %P8 %P9 %P10 %P11 %P12 %P13 %A/MPERF
(…)
4 0.3 0.5 0 11 88 95 0 0 0 0 0 0 0 0 0 0 0 0 4 95.2
5 0.0 0.1 0 3 97 9 0 0 0 0 0 0 0 0 1 0 0 0 91 77.8
6 0.1 0.1 0 7 93 0 0 0 0 0 0 0 0 0 0 0 0 0 100 105.5
7 0.5 0.7 1 1 99 100 0 0 0 0 0 0 0 0 0 0 0 0 0 117.1
8 2.5 2.4 2 16 81 17 0 0 0 0 0 0 0 0 0 0 0 0 83 103.9
9 0.1 0.3 0 1 98 6 0 0 0 0 0 0 0 0 0 0 0 0 94 59.7
10 0.4 0.7 1 9 90 7 0 0 0 0 0 0 0 0 0 0 1 0 92 54.8
11 129.4 100.0 100 0 0 100 0 0 0 0 0 0 0 0 0 0 0 0 0 132.5
12 3.1 3.1 3 12 85 85 0 0 0 0 0 0 0 0 0 0 0 0 15 102.4
13 0.3 0.5 0 17 83 12 0 0 0 0 0 0 0 0 0 0 0 0 88 79.8
14 0.4 0.5 1 16 84 43 0 0 0 0 0 0 0 0 0 0 0 0 57 94.1
15 0.1 0.3 0 2 97 100 0 0 0 0 0 0 0 0 0 0 0 0 0 73.0
16 3.7 3.1 3 4 93 5 0 0 0 0 0 0 0 0 0 0 0 0 95 126.0
17 0.0 0.1 0 5 95 3 0 0 0 0 0 0 0 0 0 0 0 0 97 50.8
18 0.4 0.7 1 9 90 7 0 0 0 0 0 0 0 0 0 0 1 0 92 54.8
19 1.4 1.4 1 14 85 22 0 0 0 0 0 0 0 0 0 0 0 0 78 103.3
(…)
VMworld 2017 Content: Not fo
r publication or distri
bution
Balanced or High Performance?
• Always set BIOS to ‘OS Controlled’
– Then the policy change is dynamic
• Balanced (default) allows for Turbo opportunities
– Great for populations of small virtual machines
– Some performance variability is okay
• High Performance caps Turbo opportunities
– Best for populations that have Large VMs (greater than 8 vCPU)
– Required for Latency Sensitive workloads
CONFIDENTIAL 24
“It Depends”
VMworld 2017 Content: Not fo
r publication or distri
bution
Agenda
1 The Baseline
2 vNUMA
3 Keeping Things Up To Date
4 Power Management
5 Hyper-threading
25
VMworld 2017 Content: Not fo
r publication or distri
bution
Troubleshooting Scenario
• I ran my workload yesterday and it ran in 565 seconds; today it’s 1096 seconds. What happened?
• Analyze the problem
• Use tools to identify issue
• Lesson learned
CONFIDENTIAL 26
Interfering VMs
VMworld 2017 Content: Not fo
r publication or distri
bution
Problem definition• The workload VM:
– 36 vCPUs
– Running the bzip2 test of SPECcpu2006
• Simple, no I/O
• But easy to see
• The Server:
– Broadwell-EP E5-2697 v4 @ 2.30GHz
– Turbo boost up to 2.8GHz for a 1.22X improvement
– 36 cores/72 HyperThreads
• Tools:
– Standard Linux tools inside the guest
– esxtop on the hypervisor
– Hardware event counters using PMC inside the guest
• Enable virtual Performance Monitoring Counters (vPMC), KB 2030221
• “perf stat” command to collect
27
VMworld 2017 Content: Not fo
r publication or distri
bution
Troubleshooting steps
• Fast case: 565 seconds
• mpstat(1) on the guest
– 100% CPU utilization
• All other guest tools report the same stats
28
• Slow case: 1096 seconds
• mpstat(1) on the guest
– 100% CPU utilization
• All other guest tools report the same stats
• Questions:
– Why did our performance drop by half?
– Can we reconcile guest stats with ESX stats?
VMworld 2017 Content: Not fo
r publication or distri
bution
Troubleshooting with esxtop
29
• In the fast case, we get the full utilization on all 36 cores
– The VM’s %RUN is nearly 3600%
– Turbo boost of 22% (PCPU USED% to PCPU UTIL% ratio) matches the hardware counters
• In the slow case, we have a second VM running!
– Our %RUN is about half of available cycles, matching the hardware counters
– Our %READY and %COSTOP add up to about half the available cycles
• That’s why the guest tools were fooled
• We need more CPUs!!!
– HyperThreading to the rescue!!
Two VMs active
One VM active
VMworld 2017 Content: Not fo
r publication or distri
bution
Intel ® Hyper-Threading Technology
30
That doubles my CPUs, right? Not!
• Increases instruction level parallelism
• logic is replicated, partitioned, shared
• ~ 5% additional die size / cost
• ~ 25% more “performance”
• Most of the benefit comes from one HyperThread using the core while the other one is waiting for memory load
• Recent processors replicate some core functionality on each HyperThread
VMworld 2017 Content: Not fo
r publication or distri
bution
No problem: Enable HyperThreading!
31
• First, compare the one-VM case with and without HyperThreading
– Twice as many “physical” processors
– With HT, we have a new field: CORE UTIL%
• esxtop fields with HT enabled
– PCPU UTIL% versus CORE UTIL% with one HyperThread in use
• One HyperThread saturated keeps the core at 100%
• PCPU USED% reflects benefits of Turbo boost
HT Enabled
HT Disabled
VMworld 2017 Content: Not fo
r publication or distri
bution
What is PCPU USED %• A measure of how much a given HyperThread uses the core
– A metric calculated by ESX
– When it’s the only HyperThread active, it gets 100% credit
• Its twin gets 0% credit since it’s idle
– During any periods that both HyperThreads are active, they each get 50% credit
32
• The average of PCPU USED credit of the two HyperThreads cannot go over 50%
• PCPU USED % is then adjusted up or down for Turbo and frequency scaling
– So with Turbo boost, the average of PCPU USED % of a core can go to, say, 2.8GHz/2.3GHz=61%
0
10
20
30
40
50
60
70
80
90
100
Hyperthread1 Hyperthreads2
ExecutiontimeontheCPU
OneHyperthreadbusy100%
PCPUUSED
0
10
20
30
40
50
60
70
80
90
100
Hyperthread1 Hyperthreads2
ExecutiontimeontheCPU
BothHyperthreadssaturated50%
PCPUUSED50%
PCPUUSED
0
10
20
30
40
50
60
70
80
90
100
Hyperthread1 Hyperthreads2
ExecutiontimeontheCPU
BothHyperthreadsbusy
25%PCPUUSED
55%PCPUUSED
VMworld 2017 Content: Not fo
r publication or distri
bution
HyperThreading with two VMs
33
HT Enabled
HT Disabled
• On this fully saturated Broadwell with HyperThreading enabled
– PCPU USED%: is 60%
• Discounted for HT; boosted for Turbo
• Our performance boost is only 1.16X despite
– CPU %RUN per VM has gone from ~1800% to ~3600%
– %READY and %COSTOP have largely disappeared
– The expected boost is 10-40%
VMworld 2017 Content: Not fo
r publication or distri
bution
Going a layer deeper
• Can we use other tools to diagnose why the HyperThreading boost is on the low side?
– Use Performance Monitoring Counters (PMC) built into the processor hardware
• PMCs can be virtualized on vSphere
– With vPMC, each VM only sees the event counts while it was running
• Analysis common for hard-core performance engineers
CONFIDENTIAL 34
VMworld 2017 Content: Not fo
r publication or distri
bution
Enabling virtual Performance Monitoring Counters (vPMC) in a VM
35
VMworld 2017 Content: Not fo
r publication or distri
bution
Collecting hardware event counts in Linux
36
# perf stat -a -e cycles -e ref-cycles -e instructions -e cache-misses -e cache-references sleep 10
Performance counter stats for 'system wide':
1,009,203,198,874 cycles
828,986,892,878 ref-cycles
1,456,882,482,627 instructions # 1.44 insn per cycle
984,566,743 cache-misses # 6.782 % of all cache refs
14,516,610,550 cache-references
10.015823286 seconds time elapsed
Let’s check the math for this workload that fully saturated all vCPU:
36 𝑣𝐶𝑃𝑈𝑠 × 2.3𝐺𝐻𝑧 × 10 𝑠𝑒𝑐𝑜𝑛𝑑𝑠 = 828,000,000,000
VMworld 2017 Content: Not fo
r publication or distri
bution
Processor PMC stats with HyperThreading, one or two VMs
• With HyperThreading Disabled, 1 VM and 2 VMs have nearly identical profiles
• With HyperThreading Enabled and 2 VMs, we get twice the cycles in VMs, but not the efficiency
• The L3 cache and Resource Stall stats tell us the VMs are interfering with each other
37
HT VMs Overall
thruput of
all VMs
PCPU
UTIL
%
PCPU
USED
%
CORE
UTIL
%
Cycles/
second
per VM
Instr/
second
per VM
IPC Total L3
cache
accesses
Total L3
cache
misses
Total
Resource
Stalls
off 1 1.0 99% 121% - 2.7G 4.11G 1.52 23.4G 2.95G 442G
off 2 1.03 99% 121% - 1.4G 2.15G 1.53 20.9G 3.01G 478G
ON 2 1.20 99% 60% 99% 2.7G 2.44G 0.90 25.7G 7.25G 998G
VMworld 2017 Content: Not fo
r publication or distri
bution
Yes, that was complex!
• VMs interfere with each other
– I/O latency
– Cache hit rate
– Memory latency
– Core resources
• HyperThreading can give a boost
– Often 1.2-1.3X boost
• Have to use a variety of tools for a detailed analysis
– Looking just at mpstat or even CORE UTIL% and PCPU UTIL% would have been misleading
• Do not oversubscribe above available resources!
38
VMworld 2017 Content: Not fo
r publication or distri
bution
What we did NOT say• We did not say don’t overcommit
– Just that your VMs cannot use more resources than the hardware can offer
• We did not say don’t believe Linux tools
– Just that when there is a discrepancy between mpstat and what the app claims, use esxtop to investigate
• We did not say use hardware event counters as a first resort
– It’s a microscope
– PMC gives you insight into the processor, e.g. when VMs are interfering with other
• We said: performance troubleshooting can be complex. Use the tools available at different layers of vertical stack to get a full picture
39
VMworld 2017 Content: Not fo
r publication or distri
bution
Conclusion
• vSphere OOTB Performance is Excellent
• Tuning Required for Specific Workloads or Corner Cases
• Performance Requires Understanding HW to App
• Leverage Existing Best Practice Documentation for Support
• Performance is an Onion, Peel Back the Layers
• Links:
• https://blogs.vmware.com/performance/
• https://communities.vmware.com/community/vmtn/performance
#SER2724BU CONFIDENTIAL 40
VMworld 2017 Content: Not fo
r publication or distri
bution
Extreme Performance Series – Las Vegas
#SER2724BU CONFIDENTIAL 41
• SER2724BU Performance Best Practices
• SER2723BU Benchmarking 101
• SER2343BU vSphere Compute & Memory Schedulers
• SER1504BU vCenter Performance Deep Dive
• SER2734BU Byte Addressable Non-Volatile Memory in vSphere
• SER2849BU Predictive DRS – Performance & Best Practices
• SER1494BU Encrypted vMotion Architecture, Performance, & Futures
• STO1515BU vSAN Performance Troubleshooting
• VIRT1445BU Fast Virtualized Hadoop and Spark on All-Flash Disks
• VIRT1397BU Optimize & Increase Performance Using VMware NSX
• VIRT2550BU Reducing Latency in Enterprise Applications with VMware NSX
• VIRT1052BU Monster VM Database Performance
• VIRT1983BU Cycle Stealing from the VDI Estate for Financial Modeling
• VIRT1997BU Machine Learning and Deep Learning on VMware vSphere
• FUT2020BU Wringing Max Perf from vSphere for Extremely Demanding Workloads
• FUT2761BU Sharing High Performance Interconnects across Multiple VMs
VMworld 2017 Content: Not fo
r publication or distri
bution
Extreme Performance Series – Barcelona
#SER2724BU CONFIDENTIAL 42
• SER2724BE Performance Best Practices
• SER2343BE vSphere Compute & Memory Schedulers
• SER1504BE vCenter Performance Deep Dive
• SER2849BE Predictive DRS – Performance & Best Practices
• VIRT1445BE Fast Virtualized Hadoop and Spark on All-Flash Disks
• VIRT1397BE Optimize & Increase Performance Using VMware NSX
• VIRT1052BE Monster VM Database Performance
• FUT2020BE Wringing Max Perf from vSphere for Extremely Demanding Workloads
VMworld 2017 Content: Not fo
r publication or distri
bution
Extreme Performance Series – Hand on Labs
• Don’t miss these popular Extreme Performance labs:
• HOL-1804-01-SDC: vSphere 6.5 Performance Diagnostics & Benchmarking
– Each module dives deep into vSphere performance best practices, diagnostics, and optimizations using various interfaces and benchmarking tools.
• HOL-1804-02-CHG: vSphere Challenge Lab
– Each module places you in a different fictional scenario to fix common vSphere operational and performance problems.
#SER2724BU CONFIDENTIAL 43
VMworld 2017 Content: Not fo
r publication or distri
bution
Performance Survey
The VMware Performance Engineeringteam is always looking for feedback about your experience with theperformance of our products, ourvarious tools, interfaces and wherewe can improve.
Scan this QR code to access ashort survey and provide us directfeedback.
Alternatively: www.vmware.com/go/perf
Thank you!
#SER2724BU CONFIDENTIAL 44
VMworld 2017 Content: Not fo
r publication or distri
bution
VMworld 2017 Content: Not fo
r publication or distri
bution
VMworld 2017 Content: Not fo
r publication or distri
bution