mastering performance monitoring and capacity planning using
TRANSCRIPT
![Page 1: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/1.jpg)
Mastering Performance Monitoring and Capacity Planning using vRealize Operations Manager
Reghuram Vasanthakumari, Staff Engineer, VMware Mohit Kataria, Product Owner, VMware
![Page 2: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/2.jpg)
Disclaimer
• This presentation may contain product features that are currently under development
• This overview of new technology represents no commitment from VMware to deliver these features in any generally available product
• Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind
• Technical feasibility and market demand will affect final delivery
• Pricing and packaging for any new technologies or features discussed or presented have not been determined
2
![Page 3: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/3.jpg)
Agenda
1 Introduction to vRealize Operation Suite
2 Operations Management Goals
3 Real World Troubleshooting Scenarios
4 Q&A
3
![Page 4: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/4.jpg)
4
![Page 5: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/5.jpg)
Today’s Reality in Operations Management
Monitoring Data Overload Alert Storms
Finger Pointing
DBA
VI Storage
Over-provisioning
5
![Page 6: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/6.jpg)
Volume of Monitoring Data is Exploding
6
Metrics & Data
Volume
Traditional Stack
(Server, Storage,
Networking, Web, App
Server and DB)
Virtualized Infrastructure
(incl. Storage and
Network Virtualization)
Distributed & Mobile Apps
(incl. Public cloud, SaaS,
Mash-ups, …)
Alert
Volume
“Operations Gap”
“Commit to a comprehensive IT Operations Analytics strategy to
optimize today's operations and support future I&O work” – Gartner
![Page 7: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/7.jpg)
Evolution of Operations Analytics Technology
7
Proactive Reactive
Automated
Manual
Hyperic, SCOM,
Nagios, …
Traditional
Monitoring
Data collection
(Metrics, logs, …)
• Static thresholds
• Alerts
Predictive
Analytics
vRealize Operations
6.0
• Detect complex
issues from multiple
symptoms
• Remediation and
automation engine
• Scale-out, data-
agnostic platform
Data Collection Data collection Data collection
Event
Correlation
BMC, HP, CA,
IBM, …
• Aggregation
• Masking & filtering
• Rules-based alert
suppression
Data Collection Data collection Data collection
Performance
Analytics
VR Ops 1.0-5.x,
Netuitive, …
• Self-learning
• Dynamic thresholds
• Super metrics
Data collection Data collection
10x Alert
Reduction
![Page 8: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/8.jpg)
VMware’s Approach to Operations Analytics
8
Operations Analytics & Automation
Operations Analytics & Automation
Performance & Availability Performance & Availability
Logs & Unstructured
Data
Logs & Unstructured
Data
Topology Analysis Topology Analysis
Configuration Health
Configuration Health
Capacity Planning Capacity Planning
![Page 9: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/9.jpg)
vRealize Operations vRealize Operations
Operations Console Operations Console
Extensibility
Extensibility
Integrated Management Disciplines
Integrated Management Disciplines
Performance Performance Compliance Compliance Configuration Configuration Capacity Capacity Availability Availability
Resilient, Scale-Out Platform
Resilient, Scale-Out Platform
App Visibility App Visibility Logs* Logs* Analytics Analytics
Reporting/
Alerting
Reporting/
Alerting Automation Automation SDK SDK
Management
Packs
Management
Packs
APIs APIs
Quality of Service
Quality of Service
vRealize Operations Overview
Operational Efficiency
Operational Efficiency
Control and Compliance Control and Compliance
9
*vRealize Log Insight is not part of vRealize Operations but included with vRealize Operations Insight and vRealize Suite
![Page 10: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/10.jpg)
Agenda
1 Introduction to vRealize Operation Suite
2 Operations Management Goals
3 Real World Troubleshooting Scenarios
4 Q&A
10
![Page 11: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/11.jpg)
Status Quo Goal
• Are you able to meet or exceed service level expectations?
• Can you remediate issues before end users are impacted?
• How many monitoring tools are you using?
Quality of
Service
• What is your average Mean Time to Incident & Resolution?
• Do you manage your infrastructure capacity?
• How do you plan for future needs?
Operational
Efficiency
• Is your IT infrastructure compliant to regulatory standards?
• Can you proactively enforce IT standards in your organization?
Control
and
Compliance
Operations Management Goals
11
![Page 12: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/12.jpg)
Status Quo
• Are you able to meet or exceed service level agreements?
• Can you remediate issues before end users are impacted?
• How many monitoring tools are you using?
• What is your average Mean Time to Remediate (MTTR)?
• Do you leverage automated capacity optimization to improve
resource utilization?
• Are you able to accurately forecast your future capacity needs?
• Is your IT infrastructure compliant to regulatory standards?
• Do you have the capability to create flexible groups and policies
for different resource types and teams?
• Can you proactively enforce IT standards in your organization?
Goal
What Operations Management Teams are Looking For?
Quality of
Service
Operational
Efficiency
Control
and
Compliance
12
![Page 13: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/13.jpg)
How VMware Helps in Delivering Quality of Service
Improve performance and
avoid disruption with self-
learning management tools
Improve performance and
avoid disruption with self-
learning management tools
Key Capabilities
Benefits
90% reduction in alert volume
Proactively detect & avoid
incidents early-on
Quality of Service Quality of Service
Self-learning predictive analytics
Smart alerts identify problems
based on multiple symptoms
13
No new monitoring tools or point
products needed
Domain-specific management
packs for MS, SAP, NSX etc.
![Page 14: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/14.jpg)
• Dynamic Thresholds
• Problem Based
• 10x Alert Reduction
• Static Thresholds
• Symptom Focused
• 100s of Alerts
Traditional Monitoring Predictive Analytics Traditional Monitoring Predictive Analytics
Evolution of Traditional Monitoring towards Operational Analytics
14
Smart Alert 1
Smart Alert 2
Smart Alert 3
Smart Alert 4
Alert Storms
Problem Based Alerts combine multiple
symptoms
![Page 15: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/15.jpg)
Predictive Analytics
Problem Detection from
multiple symptoms drives
recommendation and
proactive action
Health Risk Efficiency
Dynamic Thresholds
How is VMware Self-learning Analytics Different?
15
Super Metrics
Dynamic Thresholds adapt
to workload changes and
eliminate alert storms and
false positives
Immediate
Issues Future
Issues
Optimization
Opportunities
Super Metrics combine
hundreds of KPIs into
health, risk and efficiency
scores
1 1 2 2 3 3
![Page 16: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/16.jpg)
Applying Analytics to the Past, Present and Future Infrastructure and Application Behavior
Learned Behavior Expected Demand Real-time Events
< >
Historical Data Planned Projects Predicted Behavior
Automate
Workflows
Automate
Workflows Improve Analytics &
Avoid Risk
Improve Analytics &
Avoid Risk
Identify Stress &
Improve Efficiency
Identify Stress &
Improve Efficiency
![Page 17: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/17.jpg)
vRealize
Operations
Adopting an Analytics Based Process
17
1. Identify key metrics to measure – do not focus on the UI!
2. Start with vSphere and gradually broaden scope
3. Build a library of best practices and repeatable workflows
4. Incent team to focus on issue prevention
5. Share your success with other teams
5 Steps to an Analytics Based Process
![Page 18: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/18.jpg)
Health Alert – “Performance” Troubleshooting
18
Performance alert contributing to
degraded health. Let’s click to
see details …
Performance alert contributing to
degraded health. Let’s click to
see details …
![Page 19: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/19.jpg)
Smart Alerts deliver Insight and Information
19
Correlate symptoms across
the stack
Correlate symptoms across
the stack
![Page 20: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/20.jpg)
Customize Alerts to Your Needs
20
Add remediation actions from
vCenter, vRealize Orchestrator
or Python scripts
Add remediation actions from
vCenter, vRealize Orchestrator
or Python scripts
Combine Analytics with
symptoms and recommendations
Combine Analytics with
symptoms and recommendations
![Page 21: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/21.jpg)
Status Quo
• Are you able to meet or exceed service level agreements?
• Do you user point products to manage your IT infrastructure?
• Can you remediate issues before end users are impacted?
• What is your average Mean Time to Incident & Resolution?
• Do you manage your infrastructure capacity?
• How do you plan for future needs?
• Is your IT infrastructure compliant to regulatory standards?
• Do you have the capability to create flexible groups and policies
for different resource types and teams?
• Can you proactively enforce IT standards in your organization?
Goal
What Operations Management Teams are Looking For?
Quality of
Service
Operational
Efficiency
Control
and
Compliance
21
![Page 22: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/22.jpg)
Performance Performance Higher utilization Higher utilization
Ignore Waste Ignore Waste Higher density
Higher density
safe safe
Production Test-Dev
How would you like to
manage capacity risk?
What are your goals to
optimize your environment
22
How Do You Model Your Capacity Needs?
Identify the Right Controls Identify the Right Controls
Allocation and Demand Model Allocation and Demand Model
Over-commit ratios Over-commit ratios
Thresholds for capacity risk Thresholds for capacity risk
Buffers Buffers
Business Hours Business Hours
![Page 23: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/23.jpg)
Compute Storage
70% Utilized (Just right)
90% Utilized (Danger)
Network
35% Utilized (Over Provisioned)
• Capacity Monitoring and Analytics
– Capacity modeling for heterogeneous environments
– Out-of-the-box default policy configuration flow
– Enhanced forecasting functions and granular data
23
How VMware simplifies Capacity Management
• Project Planning
– Enhanced “What-If Scenarios”
– Plan projects, visualize changes and reserve capacity for future projects
– Extensible views, reports and alert definitions for capacity
Right-size environment
Run What-If Scenarios based on business needs
![Page 24: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/24.jpg)
Capacity Analytics
CONFIDENTIAL 24
Capacity Analytics to inform when,
why, what and where
Capacity Analytics to inform when,
why, what and where
Granular breakdown of
capacity metrics for
Compute, Memory,
Network and Storage
Granular breakdown of
capacity metrics for
Compute, Memory,
Network and Storage
![Page 25: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/25.jpg)
Capacity Planning – New Project
CONFIDENTIAL 25
Add new VMs to deploy
SharePoint app into Cluster
Add new VMs to deploy
SharePoint app into Cluster
Use existing profile of VMs
to calculate capacity needs
Use existing profile of VMs
to calculate capacity needs
Based on this new project
Cluster will need more capacity
Based on this new project
Cluster will need more capacity
![Page 26: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/26.jpg)
Planning – Add Capacity
CONFIDENTIAL 26
Capacity plan is good! Capacity plan is good!
Plan another project to see how
many ESXi hosts are needed to
meet capacity shortfall
Plan another project to see how
many ESXi hosts are needed to
meet capacity shortfall
![Page 27: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/27.jpg)
Optimization – Identify Overprovisioned Resources
CONFIDENTIAL – Shared under NDA ONLY 27
Breakdown of
reclaimable capacity
Breakdown of
reclaimable capacity
![Page 28: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/28.jpg)
Automation – Take Action to Reclaim Capacity
28
One-click action
to optimize your capacity
One-click action
to optimize your capacity
![Page 29: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/29.jpg)
Status Quo
• Are you able to meet or exceed service level agreements?
• Do you user point products to manage your IT infrastructure?
• Can you remediate issues before end users are impacted?
• What is your average Mean Time to Remediate (MTTR)?
• Do you leverage automated capacity optimization to improve
resource utilization?
• Are you able to accurately forecast your future capacity needs?
• Is your IT infrastructure compliant to regulatory standards?
• Can you proactively enforce IT standards in your organization?
Goal
What Operations Management Teams are Looking For?
Quality of
Service
Operational
Efficiency
Control
and
Compliance
29
![Page 30: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/30.jpg)
How VMware Helps in Enabling More Compliance and Control
Get continuous compliance and
proactive management across
apps and infrastructure
Get continuous compliance and
proactive management across
apps and infrastructure
Key Capabilities
Benefits
Control and Compliance Control and Compliance
30
Proactive management via
flexible groups and policies
Adhere to vendor guidelines.
security best practices and
regulatory standards.
45% reduction in time spent on
ensuring compliance
Complete control with no need for
manual processes
![Page 31: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/31.jpg)
IT Compliance Challenges
31
Silo-ed Monitoring and Compliance
Monitoring Compliance
Not integrated
No Performance Correlation to Changes
Performance Changes
Managing Users and Access Controls
Need to have tight controls in place
Missing insights
Multitude of Requirements
Security Best Practices
Vendor Hardening Guidelines
Regulatory Standards
![Page 32: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/32.jpg)
VMware Covers the Spectrum of IT Compliance
32
•Achieve compliance to regulatory standards such as PCI, HIPAA etc.
•Ensure the compliance to internal IT policies and security best practices.
•Adopt latest guidelines from vendors such as Microsoft, Cisco etc.
•Deploy and operate VMware Products in a secure manner.
vSphere Security
Hardening
vSphere Security
Hardening
Vendor Best
Practices
Vendor Best
Practices
Regulatory Compliance Regulatory Compliance
Custom IT Policies
Custom IT Policies
![Page 33: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/33.jpg)
Flexible Groups and Policies
33
• Proactive Management
– Prioritize critical workloads by defining thresholds, alerts and configuration settings for specific resource groups
– Define custom policies for specific workload types, applications or clusters.
– Apply to both vSphere and non vSphere object types
– Example: Production resources vs. development resources
![Page 34: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/34.jpg)
Monitor compliance to
standards
Monitor compliance to
standards
PCI DSS Standard PCI DSS Standard
Continuous Compliance Monitoring & Enforcement
34
Take action on non-compliant items by
launching Configuration Manager
Take action on non-compliant items by
launching Configuration Manager
![Page 35: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/35.jpg)
Operations Management in the Cloud Era
Purpose built for mobile/cloud era • Self-learning predictive analytics and smart alerts
• Capacity optimization across virtual and physical stack
Policy based automation • Automated root cause analysis with compliance visibility
• Granular access control and orchestrated workflows
Fast time to value • Fast and easy deployment as a virtual appliance
• Best for vSphere and supports multi hypervisors
1
2
3
START TODAY!
“Intelligent Operations from Apps to Storage”
From the trusted market leader • Virtualization and cloud systems management leader
• The only integrated, open and comprehensive solution
4
35
![Page 36: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/36.jpg)
Agenda
36
1 Introduction to vRealize Operations Suite
2 Operations Management Goals
3 Real World Troubleshooting Scenarios
4 Q&A
![Page 37: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/37.jpg)
How do Customers find problems in their infrastructure ?
37
Search for
problem
Search for
problem Phone call /
support ticket
Phone call /
support ticket Big Visual Big Visual Blind Luck !
Start By
vR Ops God !
Alerts/Notifications Alerts/Notifications
![Page 38: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/38.jpg)
One day in the life of VMware Admin…
• A VM Owner complains to IaaS Team that her VM is slow.
• Her application architect has verified that:
– The VM CPU and RAM utilization is good.
– The disk latency is good.
– There is no network drop packets.
– No change in the application settings
– No recent patch to Windows
What do you do?
• A: Check ESXi utilization. If it’s low, tell her to doubt no more.
• B: Buy her a nice lunch + flower. Ask her to forget about it
• C: Call your VMware TAM & MCS. That’s why you pay them right?
• D: Roll up your sleeve. You are born for this!
![Page 39: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/39.jpg)
What’s wrong with these statements?
• Cluster CPU
– CPU Ratio is high at 1:5 times on cluster “XYZ”
– Rest all other cluster overcommit ratio looks good around 1:3
– Keep the over commitment ratio to 1:4.
– CPU usage is around 50% on cluster “ABCDE”. Since they are UAT servers, don’t worry.
– Rest other cluster CPU utilization is around 25%. This is good!
• Cluster RAM
– We recommend 1:2 overcommit ratio between physical RAM and virtual RAM.
– Memory Usage on most of the cluster is high around 60%
– Cluster “ABCD” is running peak at around 75%. CPU utilization should be less than 70%
– If we see that Active Mem% is also high than we should add more RAM to cluster
– % Active should not exceed 50-60% and Memory should be running at high state on each host
39
![Page 40: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/40.jpg)
Monitoring
• There are 2 levels to monitor in VMware:
– The VM.
• VM is the most important as that’s all customers care.
• They do not care about your infrastructure. It is a Service. IaaS.
– The Infra.
• Software: NSX, vCenter, VSAN, vRealize, Distributed vSwitch, Datastore
• ESXi + hardware
• Storage & Fabric
• Network
• There are 4 areas to monitor
• The 4 areas above impact one another
![Page 41: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/41.jpg)
2 distinct layer
SDDC SDDC
VM VM VM VM VM VM VM VM
VM VM VM VM VM VM VM VM
VM VM VM VM VM VM VM VM
VM VM VM VM VM VM VM VM
Performance: We check if it is being served well by the platform. Other VM is irrelevant from VM Owner point of view.
Performance: We check if it is being served well by the platform. Other VM is irrelevant from VM Owner point of view.
1 1
Capacity. We check if VM is right-sized. If too small, increase its configuration. If too big, right size it for better performance
Capacity. We check if VM is right-sized. If too small, increase its configuration. If too big, right size it for better performance
2 2
Performance: We check if IaaS is serving everyone well. Make sure there is no contention for resource among all the VMs
Performance: We check if IaaS is serving everyone well. Make sure there is no contention for resource among all the VMs
1 1
Capacity: Check utilization. Too low, we spent too much on hardware. Too high, we need to buy more hardware.
Capacity: Check utilization. Too low, we spent too much on hardware. Too high, we need to buy more hardware.
2 2
Configuration: Check for Compliance and Config Drift Availability: Get alert for hardware fault or software stop working
Configuration: Check for Compliance and Config Drift Availability: Get alert for hardware fault or software stop working
3 3
Consumer Layer
Provider Layer
![Page 42: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/42.jpg)
Performance
How do you know your IaaS is performing fast? How do you know your IaaS is performing fast?
ESXi utilization a 10% means your ESXi is fast?
ESXi utilization a 90% means your ESXi is fast?
Storage is doing 10K IOPS?
Network is processing 8 Gbps?
ESXi utilization a 10% means your ESXi is fast?
ESXi utilization a 90% means your ESXi is fast?
Storage is doing 10K IOPS?
Network is processing 8 Gbps?
What counter do you use as a proof to your customers (VM Owner)? What counter do you use as a proof to your customers (VM Owner)?
Utilization? Utilization?
Performance is measured by how well your IaaS serves the VMs.
Fast is relative to your customer. Use SLA as your defense line.
![Page 43: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/43.jpg)
Capacity
![Page 44: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/44.jpg)
Performance and Capacity Management
Performance Capacity
Focus is on the VM. In most cases, does not apply to IaaS
Focus is on the IaaS. VM Capacity Management is just right sizing
Primary counter: Contention or Latency. Utilization is largely irrelevant.
Primary counter: Contention or Latency Secondary counter: Utilization
Does not take into account Availability SLA
Takes into account Availability SLA Tier 1 is in fact Availabity-driven.
![Page 45: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/45.jpg)
The Consumer Layer The “dining area”
CONFIDENTIAL 45
![Page 46: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/46.jpg)
How a VM gets its resource
Provisioned
Limit
Reservation
Entitlement
0 vCPU or 0 GB
Contention
Usage
Demand This is the counter
we need to measure
4 vCPU or 16 GB
![Page 47: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/47.jpg)
Dashboards
• Detail monitoring of a single VM
– When customer complains that his VM is slow. Can help desk value right away?
• Large VMs Monitoring
– Because they are actually hurting your IaaS business
– This impacts both Performance and Capacity
• VM Right Sizing
• Excessive Usage
– Excessive Usage by 1-2 VM can impact the overall IaaS performance.
– VMs with excessive usage hurts the business, if we do not charge for Network and Disk IOPS
![Page 48: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/48.jpg)
Single VM Monitoring
• A VM Owner complains that his VM is slow.
– It was okay the day before
– How does Help Desk quickly determine where the issue is?
• How well does Infra serve the VM?
– VM CPU Contention
– VM RAM contention
– VM Disk latency. For each virtual disk, not average.
• Is VM undersized?
– VM CPU Utilisation
– VM RAM Consumed (not Usage)
– VM RAM Usage
– VM Disk IOPS
![Page 49: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/49.jpg)
![Page 50: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/50.jpg)
![Page 51: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/51.jpg)
![Page 52: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/52.jpg)
![Page 53: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/53.jpg)
Dashboard 1
Single VM
Monitoring
Dashboard 1
Single VM
Monitoring
![Page 54: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/54.jpg)
Are the Large VMs oversized?
• They cause performance issue
– They impact others, and also themselves!
– ESXi vmkernel scheduler has to find available cores for all the vCPU, even though they are idle.
– Other VMs maybe migrated from core to core. The counter at esxtop tracks this migration.
• Tends to have slower performance
– ESXi may not have all the available vCPU for them.
• Reduces consolidation ratio
– You can pack more vCPU with smaller VM than with big VM.
– Unless you have progressive pricing, you make more money with smaller VM as you sell more vCPU.
![Page 55: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/55.jpg)
Dashboard of Large VMs
• Overall Picture
– A line chart showing Max CPU Demand among all the Large VMs
• If this is low, they are way oversubscribed. Remember, it only takes 1 VM to make this number high.
• This number should be 80% most of the time, indicating right sizing.
– A line chart showing Average CPU Demand
• If this chart is below <25% all the time for entire month, then the large VMs are over sized.
• Heat Map of Large VMs
– Size by vCPU config. So it’s easy to see who the biggest among these large VMs.
– Color by CPU Workload. Both high and low are bad. You want to see ~50% CPU utilisation
• To differentiate between the 2 ends, choose Black and Red. Expect to see mostly green.
• Top-N CPU Demand
– Allows us to zoom into specific time to see the past
• Line chart of a selected VM (automatically plotted)
![Page 56: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/56.jpg)
As expected, the Max of All VMs is low. We can go
back in time and see over 3 months. As expected, they are mostly Black. This means
they are over provisioned.
This shows the Top 15 VM. You can change the
period to any time. This is auto shown. We are showing CPU and RAM.
You expect 70% range, not 20% like this example.
![Page 57: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/57.jpg)
CONFIDENTIAL 57
![Page 58: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/58.jpg)
CONFIDENTIAL 58
![Page 59: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/59.jpg)
Dashboard 2
Large VM
Monitoring
Dashboard 2
Large VM
Monitoring
![Page 60: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/60.jpg)
Any Excessive Utilization in our DC?
• A VM consumes 5 resources:
1. vCPU
2. vRAM (GB)
3. Disk Space
4. Disk IOPS
5. Network (Mbps)
• The first 3 you can bound and control
• The last 2 you can, but normally you don’t do it. You should.
– Application Team does not normally know how much IOPS or Network they need.
– Do you allow any VM to generate 100K IOPS?
– Do you allow any VM to saturate 1Gb link?
• Need a dashboard to track excessive usage
– Disk IOPS
– Network throughput
![Page 61: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/61.jpg)
Dashboard for Excessive Utilisation
• Excessive Storage consumption
– Line Chart:
• Max VM Disk IOPS among all VMs
• Average VM Disk IOPS
– Heat Map
• Size by IOPS. Color by Latency
• If you see a big box, that means you have a VM dominating your storage IOPS.
• Excessive Network consumption
– Similar concept as above
![Page 62: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/62.jpg)
This tracks the IOPS from VM. From here we can tell is a distinct peak. It looks like it’s coming from
1 VM, as the average is far lower. This is a cluster of 500 VM, so even if 1 VM hits 13,200 IOPS, the
average did not even pass 15 IOPS.
Let’s zoom into the peak.
![Page 63: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/63.jpg)
Excessive Storage Dashboard
The peak was 13,212 IOPS on 24 May, around 3:16 am. Let’s find out
which VM.
![Page 64: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/64.jpg)
Excessive Storage Dashboard
• We can list the Top VMs generating the IOPS on any given period.
Bingo, it was VM 63ee that did that 13212 IOPS.
Catcha!
The dashboards are great.
But it does not tell you how the IOPS distribution
among all the VMs. It also does not tell if the VMs
are experiencing high latency.
You need a Heat Map for this.
![Page 65: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/65.jpg)
At a glance, we can tell the IOPS distribution among the VMs. We can also tell if they getting low
latency or not.
![Page 66: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/66.jpg)
Dashboard 3
Excessive DC
Utilization
Dashboard 3
Excessive DC
Utilization
![Page 67: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/67.jpg)
And that’s it! You “passed” those dashboards, you’re done with the “dining area”!
67
![Page 68: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/68.jpg)
The Provider Layer The “kitchen”
CONFIDENTIAL 68
![Page 69: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/69.jpg)
Performance Management
• Overall Performance Monitoring
– Is any of our customers experiencing bad performance?
– CPU, RAM, Disk, Network
• If yes, who are affected?
– Different VM may get different impact.
– VM 007 may get hit on CPU, while VM 747 may get hit on Storage.
![Page 70: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/70.jpg)
Performance SLA Monitoring
• How do we prove that….not a single VM… in any service tier…. fails the SLA threshold we agree for that tier… in the past 1 month?
• Since VMs move around in a cluster due to DRS and HA, we need to track at Cluster level.
• If you oversubscribe, there is a risk of Contention.
– For Tier 1, do not overcommit.
– For Tier 2 and 3, do overcommit.
![Page 71: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/71.jpg)
Using Max and Average to determine how VMs are served
If the Max is: • below what you think your customers can tolerate, then you are good.
• Near the threshold, then your capacity is full. Do not add more VM.
• Above the threshold, move a few VMs out, preferably the large ones.
![Page 72: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/72.jpg)
![Page 73: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/73.jpg)
![Page 74: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/74.jpg)
![Page 75: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/75.jpg)
![Page 76: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/76.jpg)
This dashboard is good as summary. You stop here if there is no issue.
Yes, 1 dashboard!
![Page 77: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/77.jpg)
Which VMs are affected?
• The previous slides give us info at Cluster level.
– If there is no VM affected, it’s good. No need to analyse further.
– If there are VMs affected, we want to know which ones.
• We can address the above by listing the Top 30 VM
– CPU Contention
– RAM Contention
– Disk Latency
– Network drop packet (ensure it is 0)
– Network latency (this needs NetFlow)
![Page 78: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/78.jpg)
These are the top 40 VMs which
experienced the worst CPU
Contention.
These are the top 40 VMs which
experienced the worst RAM
Contention.
These are the top 40 VMs which
experienced the worst Disk
Latency.
![Page 79: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/79.jpg)
And that’s it! If Performance is ok, it’s time to review Capacity
79
![Page 80: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/80.jpg)
Capacity Management based on Business Policy
http://virtual-red-dot.info/capacity-management-based-on-business-policy/
![Page 81: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/81.jpg)
Performance Policy
81
Group Discussion: What should your Performance Policy be?
![Page 82: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/82.jpg)
Capacity Management: Tier 1
5 line charts showing these in the past 3 months
• Number of vCPU left in the cluster.
• Number of vRAM left in the cluster.
• Number of VM left in the cluster.
• Maximum & Average storage latency experience by any VM in the cluster
• “Usable” space left in the datastore cluster.
82
If the number is approaching low number (your threshold) for it’s time to
increase supply (e.g. IOPS, Cluster)
If the number is approaching low number (your threshold) for it’s time to
increase supply (e.g. IOPS, Cluster)
![Page 83: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/83.jpg)
Capacity Management: Tier 2 or 3
5 line charts showing data in the past 3 months
• The Maximum CPU Contention experience by any VM in the cluster.
– This number has to be lower than the SLA we promise.
• The Maximum RAM Contention experience by any VM in the cluster.
– This number has to be lower than the SLA we promise.
• The total number of VM left in the cluster.
• The Maximum & Average storage latency experience by any VM in the cluster
• The disk capacity left in the datastore cluster.
83
![Page 84: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/84.jpg)
Key Takeaways
Agree on a Performance SLA.
Contention, not Utilization.
Capacity is defined by Performance.
CONFIDENTIAL 84
![Page 85: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/85.jpg)
Thank you
![Page 86: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/86.jpg)
Appendix
86
![Page 87: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/87.jpg)
Understanding VM CPU Demand vs Usage
vSphere Reported
Cpu Usage What VM Got Right now
Contention What VM Could not Get
vROps Reported CPU Demand What VM wants
If CPU Demand What VM wants
Cpu Usage What VM Got Right now
Performance
Impact
Performance
Impact VM Has Needs Troubleshooting Troubleshooting
![Page 88: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/88.jpg)
Troubleshooting Population Pressure
Entitlement What VM can ever Get
Cpu Usage What VM Got Right now
Contention What VM Could not get
Has If VM Population
Pressure
Population
Pressure Needs Move VM Move VM
Add
Physical Capacity
Add
Physical Capacity
![Page 89: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/89.jpg)
vR Ops 6.0 Out of Box
Troubleshooting Memory
CONFIDENTIAL 89
Allocation
(No Overcommit)
Allocation
(No Overcommit)
• Most Conservative • Configured Memory • Wasteful in non Production Env
Usage
(Active)
Usage
(Active)
• Most Aggressive • Current Active Demand
Consumed
(All Touched Bits)
Consumed
(All Touched Bits)
• vSphere reported • Moderate Approach • Java, SQL, Xchange
Oracle VM • Memory Configured : 1GB • Memory Consumed : 721MB • Memory Demand : 292MB
![Page 90: Mastering Performance Monitoring and Capacity Planning using](https://reader033.vdocuments.net/reader033/viewer/2022042619/58a1a43a1a28abcf5a8b9a80/html5/thumbnails/90.jpg)
Our Philosophy Is Not your Philosophy : Mem Consumed in 6.1
91
Total Memory Touched by VM
vSphere vROps 6.1